33

Is it possible to do a multiline pattern match using sed, awk or grep? Take for example, I would like to get all the lines between { and }

So it should be able to match

 1. {}
 2. {.....}
 3. {.....
.....}

Initially the question used <p> as an example. Edited the question to use { and}.

5
  • afaik, you can do it with perl regex but not with sed/awk/grep. Commented Mar 28, 2011 at 11:19
  • 1
    @forcefsck> You can do multiline pattern matching with 'sed' and 'awk', but in both cases you need more than a single command... Commented Mar 29, 2011 at 15:03
  • 1
    don't ask like "is it possible to use sed to do ...." you can use sed to do anything within the area of text processing. LOL Commented Oct 6, 2013 at 3:13
  • @CiroSantilli - there's nothing wrong with a similar Q showing up on the various SE sites, only if the original poster posted the identical Q on multiple sites. Commented Sep 16, 2014 at 1:34
  • @sim I did not mean to imply that =) Commented Sep 16, 2014 at 6:28

5 Answers 5

26

While I agree with the advice above, that you'll want to get a parser for anything more than tiny or completely ad-hoc, it is (barely ;-) possible to match multi-line blocks between curly braces with sed.

Here's a debugging version of the sed code

sed -n '/[{]/,/[}]/{
    p
    /[}]/a\
     end of block matching brace

    }' *.txt

Some notes,

  • -n means 'no default print lines as processed'.
  • 'p' means now print the line.
  • The construct /[{]/,/[}]/ is a range expression. It means scan until you find something that matches the first pattern (/[{]/) AND then scan until you find the 2nd pattern (/[}]/) THEN perform whatever actions you find in between the { } in the sed code. In this case 'p' and the debugging code. (not explained here, use it, mod it or take it out as works best for you).

You can remove the /[}]/a\ end of block debugging when you prove to your satisfaction that the code is really matching blocks delimited by {,}.

This code sample will skip over anything not inside a curly brace pair. It will, as noted by others above, be easly confused if you have any extra {,} embedded in strings, reg-exps, etc., OR where the closing brace is the same line, (with thanks to fred.bear)

I hope this helps.

3
  • 1
    > When a range expression matches the first pattern, it will only start searching for a match to the second pattern after it has finished processing that current line... This means that if { and } are on the same line, things are going to get messy... here is as test script which shows it: div==========; echo $div in; text="fred\n{block 1}\nbetty\n{block 2 line 1\n block 2 line 2}\nbarney"; echo -e "$text\n$div out"; echo -e "$text" |sed -n '/[{]/,/[}]/{'$'\n''p'$'\n''/[}]/a\'$'\n''end of block matching brace'$'\n''}'; echo "$div" ... The "betty" line shouldn't be there. Commented Mar 29, 2011 at 13:54
  • @fred.bear :You're definitely right. I have extended my cautionary last paragraph to mention this. Thanks! Commented Mar 29, 2011 at 17:13
  • Do you have a link to docs explaining this "range expression"? Because I cannot find it anywhere, I only get stuff about e.g [1-9] type "range expressions" when searching for that term. EDIT: Ahh maybe here? pement.org/sed/sedfaq4.html#s4.23.1 Commented Nov 12, 2023 at 22:40
15

You can use the -M (multiline) option for pcregrep:

pcregrep -M '\{(\s*.*\s*)*\}' test.txt

\s is whitespace (including newlines), so this matches zero or more occurrences of (whitespace followed by .* followed by whitespace), all enclosed in braces.

Update:

This should do the non-greedy matching:

pcregrep -n -M '\{(\n*.*?\n*)*?\}' test.txt
4
  • > It seems like a handy tool... Yes, it is being greedy... Can you show how to invert the greedy nature? ... and I noticed in my Ubuntu man pcregrep: ...8K characters are available for forward matching, and 8K for previous matching... Commented Mar 29, 2011 at 14:33
  • Adding a ? after a quantifier makes it non-greedy. (asdf)* is greedy, and (asdf)*? is non greedy. Commented Mar 30, 2011 at 13:17
  • > Thanks.. It's brilliant... It works as "advertised" and with (optional) line numbers! :) Commented Mar 30, 2011 at 14:31
  • Thanks for mentioning pcregrep! This was the only tool I succeeded in eliminating arbitrary multiline patterns in multiline input strings (with pcregrep -v -M -F -- "$pattern" Commented Mar 2, 2016 at 23:19
7

XML like expressions (infinintely recursive tags) is not a 'regular language' therefore cannot be parsed with regular expressions (regex). Here's why:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/

http://www.perlmonks.org/?node_id=668353

https://stackoverflow.com/questions/1379524/textual-protocol-which-is-not-a-regular-language

3
  • FYI, I have used <p></p> just for an example. Let's change this to a C block {,}, Will I still be able to do a pattern matching to extract out a C block, which can span multiple line Commented Mar 28, 2011 at 12:38
  • @johnsamuel: One problem is that unless you can fully parse a particular language, you can't tell if "{" (for example) is part of a comment or a quoted literal or is actually the start of your "block"... It only takes one misinterpretation to upset everything Commented Mar 28, 2011 at 13:08
  • 1
    (1) The slogan about non-regular languages concerns a technical notion of regular expressions that is more limited than most regex engines now in use. (2) The question was about what can be done with sed or awk, not with their regex engines specifically. And these languages are Turing complete. I'm not saying that writing anything beyond a trivial parser in them is going to be pretty or efficient. Commented Apr 19, 2012 at 21:20
7

parser.awk:

#!/usr/bin/awk -f    
function die(msg) { print msg > "/dev/stderr"; exit 1 }
BEGIN {
  FS=opener
  if (mode=="l") linewise=1
  else if (mode=="i") trim_closer=length(closer)
  else if (mode!="a") die("mode must be one of: l,i,a")
}
{
  live=level
  for (f=1; f<=NF; f++) {
    if (f>1) {
      live=++level
      if (mode=="i" && level>1 || mode=="a") printf "%s", opener
    }
    cur=$f
    level-=gsub(closer, "", cur)
    if (level<0) die("Unbalanced")
    if (!linewise) {
      cur=$f
      if (sub(".*" closer, "", cur)) printf "%s", 
        substr($f, 1, length($f) - length(cur) - (level ? 0 : trim_closer))
      else if (live) printf "%s", $f
    }
  }
  if (live) {
    if (linewise) print
    else print ""
  }
}
END { if (level>0) die("Unbalanced") }

Call as awk -v'opener={' -v'closer=}' -v'mode=a' -f parser.awk. If mode is a, it prints the brackets and contents of all outermost, balanced {...}; if mode is i, it prints only their contents; if mode is l, it prints complete lines where an outermost {...} begins, remains open, or closes.

1

Regular expressions cannot find matching nested parentheses.

If you are certain that there will be no pair of parentheses nested inside the one you are searching, you can search until the first closing one. For example:

sed -r 's#\{([^}])\}#\1#'

This will replace all the text from '{' to '}' with what's between them.

1
  • 1
    > s#\{([^}])\}#\1# will only match a single non-} char... It needs a zero-to-many * wildcard after the closing square bracket ]*... and also, sed always operates only on a single line, unless you do some buffer hold of the line and sub-process subsequent lines until you find a matching '}' Commented Mar 29, 2011 at 8:53