Text processing before printing in awk

Question

I am not into scripting but manage to create few with the help in this forum. Coming across a problem but not able to get it work (not sure if it is possible)

I have a fileY with content

lrwxrwxrwx  1  user1 gp  35  2021-09-07  2000  /folder/subfolder1/subfolder2/subfolder3/main/summary.txt
lrwxrwxrwx  1  user1 gp  35  2021-09-08  1400  /folder/subfolder1/subfolder2/main/summary.txt
lrwxrwxrwx  1  user1 gp  35  2021-09-09  1800  /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt

I wanted to output the column 3,6,7,8 and concatenate with the folder name before "main" like below

user1 2021-09-07  2000  /folder/subfolder1/subfolder2/subfolder3/main/summary.txt subfolder3
user1 2021-09-08  1400  /folder/subfolder1/subfolder2/main/summary.txt subfolder2
user1 2021-09-09  1800  /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt subfolder4

How can i have below sed command as one of the {print} variable for awk command?

awk '{print $3,$6,$7,$8}' fileY
sed 's/\// /g; s/\./ /g' fileY | awk '{for(i=8;i<=NF;i++){if($i~/^main/){a=i}} print $(a-1)}'

Noted @Ed Morton. Sorry for it. Will start a new thread for a new questions in future. — Azman
– Azman, Commented Oct 4, 2021 at 2:47

Ed Morton · Accepted Answer · 2021-10-02 18:41:29Z

You never need sed when you're using awk. If the directory you want is always 3rd-last in the path as in your examples then all you need is this using any awk:

$ awk '{print $3, $6, $7, $8, p[split($8,p,"/")-2]}' file
user1 2021-09-07 2000 /folder/subfolder1/subfolder2/subfolder3/main/summary.txt subfolder3
user1 2021-09-08 1400 /folder/subfolder1/subfolder2/main/summary.txt subfolder2
user1 2021-09-09 1800 /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt subfolder4

Otherwise using GNU awk for the 3rd arg to match():

$ awk '{match($8,"([^/]+)/main/",a); print $3, $6, $7, $8, a[1]}' file
user1 2021-09-07 2000 /folder/subfolder1/subfolder2/subfolder3/main/summary.txt subfolder3
user1 2021-09-08 1400 /folder/subfolder1/subfolder2/main/summary.txt subfolder2
user1 2021-09-09 1800 /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt subfolder4

or using any awk:

$ awk '{match($8,"[^/]+/main/"); print $3, $6, $7, $8, substr($8,RSTART,RLENGTH-6)}' file
user1 2021-09-07 2000 /folder/subfolder1/subfolder2/subfolder3/main/summary.txt subfolder3
user1 2021-09-08 1400 /folder/subfolder1/subfolder2/main/summary.txt subfolder2
user1 2021-09-09 1800 /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt subfolder4

terdon · Accepted Answer · 2021-10-02 12:49:17Z

I don't really understand why you would want the sed there, you can do it with just one awk. Of course, this assumes that you never have spaces or newlines in the folder names, and we can safely use whitespace as a field delimiter. Please edit your question and add a more comprehensive example if that is not true.

$ awk '{ 
            split($8,dirs,"/");
            dir="" 
            for(i in dirs){ 
                if(dirs[i+1]=="main"){
                    dir=dirs[i]
                } 
            } 
            print $3,$6,$7,$8,dir}' fileY
user1 2021-09-07 2000 /folder/subfolder1/subfolder2/subfolder3/main/summary.txt subfolder3
user1 2021-09-08 1400 /folder/subfolder1/subfolder2/main/summary.txt subfolder2
user1 2021-09-09 1800 /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt subfolder4

The trick here is using split() to split the 8th field into the dirs array, using / as the delimiter. We then iterate over dirs and keep the last array entry we find whose next array entry is main. Note that this means that if you have more than one occurrence of main, you will only match the last one.

schrodingerscatcuriosity · Accepted Answer · 2021-10-02 12:52:02Z

A different approach, using rev, taking advantage of the fact that the wanted folder is the third item in reverse using / as separator, assuming the folder name structure is consistent with the sample given (<wanted folder>/main/summary.txt):

$ rev file | awk -F'/' '{ print $3,$0 }' | rev | awk '{ print $3,$6,$7,$8,$9 }'
user1 2021-09-07 2000 /folder/subfolder1/subfolder2/subfolder3/main/summary.txt subfolder3
user1 2021-09-08 1400 /folder/subfolder1/subfolder2/main/summary.txt subfolder2
user1 2021-09-09 1800 /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt subfolder4

guest_7 · Accepted Answer · 2021-10-03 15:09:17Z

0

You can do it just as easily with sed, for that I define some helper shell vars to aid in the writing of sed code. Using GNU sed in it's extended regex mode.

based on the observations raised by @Ed Morton, the delimiters are now colon to avoid confusing with the ERE regex char |

_s='[:space:]'
s="[${_s}]" S="[^${_s}]" F="$S+$s+"
sed -Ee "
  s:^($F){2}($F)($F){2}:\2:
  s:/([^/]+)/main/$S+\$:& \1:
" file

edited Oct 3, 2021 at 15:09

answered Oct 2, 2021 at 20:24

guest_7

5,7881 gold badge8 silver badges13 bronze badges

1

"just as easily" is a bit of a stretch :-). Since you're already requiring GNU sed you could use \s and \S instead of those shell variables though. IMHO using a regexp metachar like | as the delimiter obfuscates the code and you should use a char that's always literal instead, e.g. :.

Ed Morton
– Ed Morton

2021-10-03 13:29:58 +00:00
Commented Oct 3, 2021 at 13:29
The GNU sed is there for its extended regex feature. The shell variables allow it to be made POSIX friendly. And insofar the usage of delimiters is very much a preference thing, one can use what they prefer.

guest_7
– guest_7

2021-10-03 14:53:21 +00:00
Commented Oct 3, 2021 at 14:53
Right, but I'm just saying since you're already using GNU sed there's no point in trying to make it POSIX friendly when you can't run it in a POSIX sed, may as well take advantage of other GNU features and just make it a self-contained concise GNU sed script and then you don't need to wrap it in double quotes and escape anything you don't want the shell to expand, e.g. $.

Ed Morton
– Ed Morton

2021-10-03 14:59:40 +00:00
Commented Oct 3, 2021 at 14:59
1

This point is well made and I agree with. I have Based on your observations, redid the delimiter code. But I don't agree with your previous assertion of using \s . The whole point of tgose variables was to make it usable for both cases.

guest_7
– guest_7

2021-10-03 15:06:06 +00:00
Commented Oct 3, 2021 at 15:06
1

What do you mean by both cases though? You cannot run that script in a POSIX sed as it's using the non-POSIX option -E and it's associated ERE syntax so using shell variables to make it more POSIX friendly doesn't help. If you want it to run in a POSIX sed then you could get rid of -E and change the +s to \{1,\}s and the {2} to \{2\}s but if you don't do that then using a shell variable instead of \s is of limited use (that might make it be able to also run in a BSD sed, but still not POSIX).

Ed Morton
– Ed Morton

2021-10-03 15:09:05 +00:00
Commented Oct 3, 2021 at 15:09

| Show 3 more comments

sseLtaH · Accepted Answer · 2021-10-03 21:30:19Z

0

Using GNU sed nested grouping

$ sed -E 's|.*\s[0-9]\s\s(.[^ ]*).*([0-9]{4}-.*/(.[^/]*).*/.*/.*)|\1 \2 \3|' input_file
user1 2021-09-07  2000  /folder/subfolder1/subfolder2/subfolder3/main/summary.txt subfolder3
user1 2021-09-08  1400  /folder/subfolder1/subfolder2/main/summary.txt subfolder2
user1 2021-09-09  1800  /folder/subfolder1/subfolder2/subfolder3/subfolder4/main/summary.txt subfolder4

answered Oct 3, 2021 at 21:30

sseLtaH

2,9161 gold badge8 silver badges20 bronze badges

Add a comment |

Stack Exchange Network

Text processing before printing in awk

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Text processing before printing in awk

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions