3

I've got a simple bash script to strip comments from a js file:

#!/bin/bash
sed -E '/^[[:blank:]]*(\/\/|#)/d;s/#.*//' $1 >> stripped.js

this works almost perfect except for comments that occur inline such as

// file-to-be-stripped.js
...
...
const someVar = 'var' // this comment won't be stripped
// this comment will be stripped

what am i missing to strip inline comments?

UPDATE:

What's really strange is i fired up your example with an online bash shell and it works flawlessly! However, when i run the exact same code locally it does not strip the inline ones!? Any idea why / how this could be? I'm obviously missing something... very strange.

Here is my updated code:

My script: stripper.sh

#!/bin/bash
sed -E -e 's:(\s+(//|#)|^\s*(//|#)).*$::; /^$/d' $1 > "stripped.${1}"

My test file: test.js

// testies one
const testies = 'two'
console.log(testies) // three
// testies FOUR!?
console.log('Mmmmm toast') // I won't be stripped of my rights!

Then i execute: ./stripper.sh test.js and the output is:

const testies = 'two'
console.log(testies) // three
console.log('Mmmmm toast') // I won't be stripped of my rights!

Any ideas why running the exact same code locally only sed's whole line comments yet running it with the online bash interpreter (unfortunately I cannot share the exact link to my shell because it is a bit.ly link and apparently that's a "no no" here.) does work as expected?

10
  • that removes the whole line. Commented Aug 4, 2017 at 15:42
  • What about lines that have const url = 'http://stackexchange.com' or const x = '###'? Commented Aug 4, 2017 at 17:05
  • lets see here. Nope they both pass no problem but the inline ones console.log(something) // some comment still are not being removed (locally that is. like i said it works perfectly with an online shell) My only guess is maybe different versions of sed? That's my best guess at least... thoughts? Commented Aug 4, 2017 at 19:22
  • @archae0pteryx Which version of sed --version are you using? try with sed -E -e 's:(\s+(//|#)|^\s*(//|#)).*$::' -e '/^$/d' infile.txt once Commented Aug 4, 2017 at 19:34
  • It may be that the online version is only POSIX. That is, it doesn't have GNU extensions or anything. Did you try my version online? I tried to avoid any extensions. In fact when I run each of them with GNU sed's --posix flag mine still works but the other behaves exactly as you describe. Commented Aug 4, 2017 at 19:36

4 Answers 4

6

POSIXly, you'd do:

sed '
  s|[[:blank:]]*//.*||; # remove //comments
  s|[[:blank:]]*#.*||; # remove #comments
  t prune
  b
  :prune
  /./!d; # remove empty lines, but only those that
         # become empty as a result of comment stripping'

Which with GNU sed we can shorten to:

sed -E 's@[[:blank:]]*(//|#).*@@;T;/./!d'

Note that it would happily remove #things and //things that are not comments like in:

const url = 'http://stackexchange.com';
x = "foo#bar";

To ignore the #, // inside quotes, you could do:

perl -ne 'if (/./) {
   s{\s*(?://|#).*|("(?:\\.|[^"])*"|'"'(?:\\\\.|[^'])*'"'|.)}{$1}g;
   print if /./} else {print}'

On an input like:

#blah
// testies one
const testies = 'two';
console.log(testies) // three

const url = 'http://stackexchange.com';
x = "not#a comment";
y = "foo\"bar" # comment
y = 'foo\'bar' # it's a comment

It gives:

const testies = 'two';
console.log(testies)

const url = 'http://stackexchange.com';
x = "not#a comment";
y = "foo\"bar"
y = 'foo\'bar'

(you might need to adapt for the actual language of those files (I'm not aware that JavaScript supports # as a comment, except maybe for a first line starting with #! in node.js)).

3
sed -e '/^\/\//d' -e 's@\(.*\)[[:blank:]]\{1,\}//.*@\1@' your_file

This sed command deletes lines that begin with a comment and for inline comments it removes everything from the whitespace separating code from comment to the end of the line. It's POSIX (no GNU extensions used) and, per OP's original example and for ease of reading, this version only supports // comments (more on that below).

Details

This sed call includes two sed commands: a "delete on pattern match" and a substitution.

The former is /^\/\//d. The pattern ^\/\/ matches lines that begin with two slashes (e.g. "// foo bar"). Such lines are deleted and the next line is brought in immediately (i.e. the substitution is skipped).

The pattern in the substitution is \(.*\)[[:blank:]]\{1,\}//.*. Note: I'm using @ as a delimiter in order to avoid some of the character escaping that a / delimiter would require.

  • \( .. \) - anything matched within is available as a back reference
  • .* - match 0 or more characters (anything but newline); in the substitution section we can refer back to whatever is matched here because of the surrounding \( and \).
  • [[:blank:]] - a whitespace character
  • \{1,\} - match one or more of the thing that precedes it ([[:blank:]] in this case)
  • // - match two slashes (i.e. the beginning of a comment)
  • .* - same as above except not available as a back reference

The substitution part is just \1 which says to replace whatever we matched with the first backreference, i.e. the .* that preceded [[:blank:]].

So it works just how I described: for inline comments remove everything from the whitespace separating code from comment to the end of the line.

'#' Comments

With GNU sed adding handling of # comments is just a matter of replacing // with the alternation (#|//) (or if we need escaping \(#\|\/\/\)). To do it the POSIX way, though, is much more verbose because alternation is not supported. You could obviously brute force it by repeating the existing sed commands with versions for #. Better yet there's already an answer posted that shows a cleaner way to do it. Either way I'll not repeat a solution here.

EDIT:

Having revisited this after much time I realize that the substitution is more complicated than it needs to be and, as pointed out in the comments, doesn't catch certain corner cases besides (e.g. "something // foo // bar"..only "// bar" is removed).

I believe this is all we need...

sed -e '/^\/\//d' -e 's@[[:blank:]]\{1,\}//.*@@' your_file

That is, the substitution part says "at the first whitespace-slash-slash we encounter remove it and everything that follows while leaving any preceding text alone".

12
  • 1
    Why not simply 's@[[:blank:]]*//.*@@'for the second part? This will do the same as long as there is only one pair of slashes in the line. But I would suggest to require at least one blank before the spaces or you could end up with some lines ending with https: (-: Commented Aug 4, 2017 at 16:23
  • this errors out with no such file is that because of the chaining do you think? Commented Aug 4, 2017 at 16:23
  • here's my exact code with your example sed -E '/^\/\//d' -e 's@\(.*\)[[:blank:]]*//.*@\1@' $1 > "stripped.${1}" Commented Aug 4, 2017 at 16:24
  • I didn't include the file because it's implied. And someone might want to pipe the file through rather than include it as an argument. Commented Aug 4, 2017 at 16:24
  • 1
    @Jun Yes. Done. Commented Apr 11, 2019 at 4:57
0

Using GNU sed we can write-up a mini-parser code to filter out C++ style comments, // and the sh style comments # also.

To make the construction modular and scalable, we utilize canned regexes defined in shell variables and appropriately quoted.

The sed code let's pass empty/blank lines. Then it goes looking for unbalanced double quotes in a lines. It keeps grabbing the next lines until they become balanced. This is for the purpose of quotes that spill over onto multiple lines.

Same is done for single quotes as well.

Next we look for any continuation lines, identified via a trailing backslash.

Finally we keep skipping over quoted words or barewords that are not comments.

If after this transformation we are left with nothing, then we delete it , OTW we promptly print the decommentified line to stdout.

P.S.: We use a mixture of single and double quotes in the sed -e ... to workaround the error on bash command lines where they ! character inside of double quotes is unsuppressable , hence we put it in single quotes.

# symbol names
q=\' Q=\"
d=\$ b=\\
B=$b$b

# construct regexes using symbolic names
single_quotes_open="$q[^$b$q]*($B.[^$b$q]*)*$d"
single_quoted_word="$q[^$b$q]*($B.[^$b$q]*)*$q"
double_quoted_word="$Q[^$b$Q]*($B.[^$b$Q]*)*$Q"
double_quotes_open="$Q[^$b$Q]*($B.[^$b$Q]*)*$d"
quoted_word="$double_quoted_word|$single_quoted_word"

# decomment a c++ file
sed -Ee '
   /\S/!b'"
   :a;/(^|\s)$double_quotes_open/{N;ba;}
   :b;/(^|\s)$single_quotes_open/{N;bb;}
   :c;/$B$d/{N;bc;}
   s_\s*(//|#).*|($quoted_word|.)_\2_g
   "'/\S/!d
' c_file
0

If you want to strip comments from a source file, you can try my comcat tool. The latest one night build is available on GitHub.

  • It can display only the comments or display everything but the comments.
  • It is a very young project, so expect some bugs.

I do realize this is a question about sed. If you do not find this answer useful, it can be removed.

Disclaimer: I am the maintainer of comcat.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.