1

I use the command find . -maxdepth 1 -not -type d which generates output like ./filename.1.out

I pipe the find command output to awk. The goal is to split on either the literal ./ or .. I have it working using:

find . -maxdepth 1 -not -type d | gawk 'BEGIN { FS = "(\\./)|(\\.)" } ; { print NF }'

In fact it works if I drop the first backslash in the first set of paren. Ex:

find . -maxdepth 1 -not -type d | gawk 'BEGIN { FS = "(\./)|(\\.)" } ; { print NF }'

What I don't understand - and my question is why does it not work if I use:

find . -maxdepth 1 -not -type d | gawk 'BEGIN { FS = "(\./)|(\.)" } ; { print NF }'

By "not work" I mean NF returns with a number as if the second paren was a regex . character (to match any type of character). Maybe I'm answering my own question... but as I look at the commands/behavior it would appear that the initial backslash is being ignored. In fact, there was a warning escape sequence message saying that \. was being treated as plain '.'. But I didn't really understand what it was doing until I began printing NF.

And indeed... the awk doc for escape sequences (https://www.gnu.org/software/gawk/manual/html_node/Escape-Sequences.html#Escape-Sequences) say:

The backslash character itself is another character that cannot be included normally; you must write \\ to put one backslash in the string or regexp.

So if I wanted to wring a regex to match a dollar sign then I would need FS="\\$"?

The post was originally to ask why it was happening. Then I believe I may have pieced things together. If I am wrong then please set me straight.

6
  • Your second case works by luck: de-escaped-with-warning (\./|\\.) means field delim is 'either any character and slash, or dot (by itself)'. It happens that in your input the only character that ever precedes slash is dot. Similarly (\./|\.) does indeed match any character (and every character) as a field delimiter. FYI you don't need the parentheses. For FS as a regex to match $ yes you must escape. Note however that if FS is a single character it is NOT treated as a regex, just a character, so the single character $ will also work. Commented Feb 22, 2016 at 5:08
  • Do you really want -not -type d rather than -type f ? Commented Feb 22, 2016 at 10:37
  • @symcbean: -not -type d does not mean -type f. Like not negative does not mean positive, it's can be zero. Commented Feb 22, 2016 at 11:32
  • @cuonglm: quite aware of that, just wondering why Gregg wants to parse device nodes and pipes. Commented Feb 22, 2016 at 12:19
  • @symcbean: I had not even considered that what I was doing (by using -not -type d) would include those things. Honestly, I'm not really sure what they are and should read up on them. But I think it is safe to say that using -type f is what I was really after. Thanks! Wish I could upvote comments. Commented Feb 22, 2016 at 19:15

2 Answers 2

3

The FS value was scanned twice, the first as a string value and the second as an ERE (See Lexical Conventions).

And also, POSIX did not specify the behavior of \c when c is not one of ", /, \ddd with d is one of octal digits, \, a, b, f, n, r, t, v. So you don't know whether string \c will be passed as \c or c to ERE.

gawk, nawk, and Brian Kernighan's own version give you c, while mawk give you \c:

$ for AWK in gawk mawk nawk bk-awk; do
  printf '<%s>\n' "$AWK"
  echo | "$AWK" -F '\.' '{print FS}'
done
<gawk>
gawk: warning: escape sequence `\.' treated as plain `.'
.
<mawk>
\.
<nawk>
.
<bk-awk>
.

Because \\ will always be recognized as \, then you will be safe with \\c:

$ for AWK in gawk mawk nawk bk-awk; do
printf '<%s>\n' "$AWK"; echo | "$AWK" -F '\\.' '{print FS}'
done
<gawk>
\.
<mawk>
\.
<nawk>
\.
<bk-awk>
\.

The string value of \\c will be \c, so using it as an ERE give you the desired result.

1
  • Thanks for your answer and the links where I could read more about how the conversions are done. I'm really trying to understand why certain things work instead of just accepting that it works. :) It took me a while to mark as answered because I was trying to read everything you pointed me to - to see if it would answer a different question (stackoverflow.com/questions/35564207/…). But it seems like the new problem I had was somewhat unrelated. Commented Feb 22, 2016 at 21:30
0

\x becomes one character in a double-quoted string (just like in most shells and C) before it's regarded as a regex, so you do need to type \\. to construct \..

Let's test that (you don't need the parentheses since the alternation operator | has the lowest precedence):

$ echo ./a.b.c | gawk 'BEGIN { FS = "\.|\./" } { for (i=1; i<=NF; i++) { print i ": " $i } }'
gawk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'
1: 
2: 
3: 
4: 
5: 
6: 
7: 

The warning is telling you that the escape sequence in the string is superfluous. So FS is .|./ and you're splitting on every character, yielding a bunch of empty fields.

Now with the doubled-up \:

$ echo ./a.b.c | gawk 'BEGIN { FS = "\\.|\\./" } { for (i=1; i<=NF; i++) { print i ": " $i } }'
1: 
2: a
3: b
4: c
1
  • That's not always true. mawk is an exception. Commented Feb 22, 2016 at 10:46

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.