6

I know that this has been asked before, but this is just a little bit different: I need to remove all comments, excluding escaped # or otherwise not meant as starting a comment (in-between single or double apices)

Starting with the following text:

test
# comment
comment on midline # comment
escaped hash "\# this is an escaped hash"
escaped hash "\\# this is not a comment"
not a comment "# this is not a comment - double apices"
not a comment '# this is not a comment - single apices'
this is a comment \\# this is a comment
this is not a comment \# this is not a comment

I would like to obtain

test
comment on midline
escaped hash "\# this is an escaped hash"
escaped hash "\\# this is not a comment"
not a comment "# this is not a comment - double apices"
not a comment '# this is not a comment - single apices'
this is a comment \\
this is not a comment \# this is not a comment

I tried

grep -o '^[^#]*' file

but this also deletes escaped hashes.

NOTE: text I'm working on does have escaped # (\#) but it lacks double escaped # (\\#), so it does not matter to me if they are kept or not. I guess it's more neat to delete them as as a matter of fact the hash is not escaped.

7
  • 5
    What should happen with the line gotcha "this # is a hash" Commented May 6, 2016 at 13:56
  • gotcha indeed. my bad. As the hash does not start a comment, the line should be displayed. Commented May 6, 2016 at 16:25
  • just the ones in the question text: single and double apices Commented May 6, 2016 at 16:37
  • Hey @don_crissti, Thank you for your question! I'm working on configuration files whose parameters won't span over multiple lines Commented May 6, 2016 at 16:42
  • @don_crissti You're of course right, let's say I do not need to cover all cases, just the ones in the example above. straight single or double quotes Commented May 6, 2016 at 16:51

3 Answers 3

5

With sed you could delete the lines that start with a # (preceded by zero or more blanks) and remove all strings starting with # that doesn't follow a single backslash (and only if it's not in-between quotes1):

sed '/^[[:blank:]]*#/d
/["'\''].*#.*["'\'']/!{
s/\\\\#.*/\\\\/
s/\([^\]\)#.*/\1/
}' infile

1: this solution assumes a single pair of quotes on a line

2

This is a more complicated problem than it sounds like, but not beyond the ability of regex. To analyze it: A whole line consists of non-commented text optionally followed by commented text. What can appear in non-commented text:

  1. Any character other than \, #, ', "
  2. \ followed by any character
  3. A quoted string, which starts and ends with " and may contain
    • A) any character other than \ or "
    • B) \ followed by any character
  4. A quoted string, which starts and ends with ' and may contain
    • any character other than '

(The difference in treatment of the two kinds of quote is based on how unix shells handle it - adjust to taste)

Translating that directly into regex, you want:

s/^([non comment])[comment]$/\1/
non comment = ([^\\"'#]|\\.|"([^\\"]|\\.)*"|'[^']*')*
              (11111111|222|3(AAAAAA|BBB)33|4444444)*
comment = #.*
Therefore
s/^(([^\\"'#]|\\.|"([^\\"]|\\.)*"|'[^']*')*)#.*$/\1/

For a sed regex, you need more backslashes, before the (, |, and ) characters:

s/^\(\([^\\"'#]\|\\.\|"\([^\\"]\|\\.\)*"\|'[^']*'\)*\)#.*$/\1/

And bash needs additional quoting:

sed 's/^\(\([^\\"'\''#]\|\\.\|"\([^\\"]\|\\.\)*"\|'\''[^'\'']*'\''\)*\)#.*$/\1/'

EDIT: I hadn't realized grep -o existed until I saw @StéphaneChazelas' answer. The same core regex can be adapted to this approach, and egrep lets you avoid doing most of the extra backslashes:

grep -Eo '^([^\\"'\''#]|\\.|"([^\\"]|\\.)*"|'\''[^'\'']*'\'')*'
grep -Eo "^([^\\\\\"'#]|\\\\.|\"([^\\\\\"]|\\\\.)*\"|'[^']*')*"

Both of these are identical in meaning (and fortuitously are the same length), it's just different approaches to shell quoting - I personally prefer the first approach because single-quote is the only character I have to worry about, but you may find the second more readable, and it closely resembles what you would write in other programming languages.

One caveat is that the regex doesn't know what to do with lines that contain mismatched quotes. They won't match the regex at all, so the sed command won't remove anything whereas the grep command will remove everything.

4
  • No need to double the backslashes inside [...]. You can also use both '...' and "..." quoting as in grep -Eo '^(\\.|"(\\.|[^\"])*"|'"'[^']*'|[^'#"'\"])*' Commented May 6, 2016 at 20:17
  • -o is a GNU extension to grep, but the OP's already using it so presumably his implementation is GNU's. Commented May 6, 2016 at 20:18
  • @StéphaneChazelas I actually had considered including an example of that sort of quote-mixing, but I didn't want to go into too long a discussion of shell quoting techniques, especially since this isn't code golf. As for the backslash, I wasn't sure since can put \] and \- in a character class. Commented May 6, 2016 at 20:56
  • grep -E '[\]]' matches on \] Commented May 6, 2016 at 21:12
0

This command should work.

sed -e '/^#/d;s/[^\/]#.*$//' <file-path>

1
  • Thank you @favadi. your sed oneliner answers to my original question - what about the amended one :) Commented May 6, 2016 at 16:28

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.