3

I have a log file which reports on the output of a process, I'd like to extract all lines from between the last occurrence of two patterns.

The patterns will be along the lines of;

Summary process started at <datestring>

and

Summary process finished at <datestring> with return code <num>

There will be several instances of these patterns throughout the file, along with a lot of other information. I'd like to print the only the last occurrence.

I know that I can use:

sed -n '/StartPattern/,/EndPattern/p' FileName

to get lines between the patterns, but not sure how to get the last instance.

sed or awk solutions would be fine.

Edit: I've not been clear at all about the behaviour that I want when multiple StartPatterns appear with no EndPattern, or if there's no EndPattern before the end of file, after detecting a StartPattern.

  • For multiple StartPatterns with missing EndPattern, I'd only like lines from the last StartPattern to the EndPattern.
  • For a StartPattern which reaches the EOF without an EndPattern, I'd like everything up to the EOF, followed by a warning that EOF was reached prematurely.

5 Answers 5

6

You can always do:

tac < fileName | sed  '/EndPattern/,$!d;/StartPattern/q' | tac

If your system doesn't have GNU tac, you may be able to use tail -r instead.

You can also do it like:

awk '
  inside {
    text = text $0 RS
    if (/EndPattern/) inside=0
    next
  }
  /StartPattern/ {
    inside = 1
    text = $0 RS
  }
  END {printf "%s", text}' < filename

But that means reading the whole file.

Note that it may give different results if there's another StartPattern in between a StartPattern and the next EndPattern or if the last StartPattern does not have an ending EndPattern or if there are lines matching both StartPattern and EndPattern.

awk '
  /StartPattern/ {
    inside = 1
    text = ""
  }
  inside {text = text $0 RS}
  /EndPattern/ {inside = 0} 
  END {printf "%s", text}' < filename

Would make it behave more like the tac+sed+tac approach (except for the unclosed trailing StartPattern case).

That last one seems to be the closest to your edited requirements. To add the warning would simply be:

awk '
  /StartPattern/ {
    inside = 1
    text = ""
  }
  inside {text = text $0 RS}
  /EndPattern/ {inside = 0} 
  END {
    printf "%s", text
    if (inside)
      print "Warning: EOF reached without seeing the end pattern" > "/dev/stderr"
  }' < filename

To avoid reading the whole file:

tac < filename | awk '
  /StartPattern/ {
    printf "%s", $0 RS text
    if (!inside)
      print "Warning: EOF reached without seeing the end pattern" > "/dev/stderr"
    exit
  }
  /EndPattern/ {inside = 1; text = ""}
  {text = $0 RS text}'

Portability note: for /dev/stderr, you need either a system with such a special file (beware that on Linux if stderr is open on a seekable file that will write the text at the beginning of the file instead of the current position within the file) or an awk implementation that emulates it like gawk, mawk or busybox awk (those work around the Linux issue mentioned above).

On other systems, you can replace print ... > "/dev/stderr" with print ... | "cat>&2".

3
  • But that means reading the whole file. Doesn't tac need to read the whole file? Commented Jun 14, 2016 at 10:48
  • Thanks for the answer and explanation of behaviour. I've realised that my question was vague around expected behaviour in unusual cases, so have edited for clarity. You have my +1 already though. The file is unlikely to be huge, so reading it all won't be a problem. Commented Jun 14, 2016 at 11:42
  • @Arronical, see edit. Commented Jun 14, 2016 at 12:08
4

You can use GNU sed like so

sed '/START/{:1;$!{/END/!{N;b1};h}};${x;p};d' file

Just overwrites the hold space every occurrence of the full multiline pattern. Prints it at the end of the file.

This will provide consistent behaviour such as

  • Both START and END are on the same line, will match line.
  • Multiple STARTs after the initial START, will match all until END
  • Will not print match if there is no END, will print last occurrence of full START to END
3
  • Thanks for the answer and explanation of behaviour. I've realised that my question was vague around expected behaviour in unusual cases, so have edited for clarity. You have my +1 already though. Commented Jun 14, 2016 at 11:41
  • 1
    (note that you can ping editors with @... even if it doesn't offer you completion) GNU specific: } not preceded by ;, labels followed by something, } followed by something. The portable equivalent would be sed -e '/START/{:1' -e '$!{/END/!{N;b1' -e '}' -e 'h;}' -e '}' -e '${x;p;}' -e d (or use separate lines instead of additional -es. Commented Jun 14, 2016 at 13:33
  • Ah right, didn't know you can ping editor, thanks. I've deleted my comment from your answer. Thanks for the posix version as well. Commented Jun 14, 2016 at 13:36
0

With GNU sed, another solution could be (with variables P1/P2 as start/end patterns) :

sed -n "/${P1}/,/${P2}/H; /${P1}/h; \${g;p}"

The main differences with @Stéphane Chazelas solution, is that here :

  • if multiple STARTs before last END/EOF, we display from last START till last END/EOF.
  • any END on same line than START is ignored
  • last END in last input line is supported
  • if no END after last START, we print from last START till EOF
0

Here is a solution with awk:

awk '/EndPattern/ {recording=0}  recording>0 {buffer=buffer $0 "\n"}  /StartPattern/ {recording+=1; buffer=""}  END {printf "%s", buffer; if(recording>0) {print "WARNING: missing EndPattern" > "/dev/stderr"}}'

So, for the following input:

1
StartPattern
2
3
EndPattern
4
5
StartPattern
6
7
EndPattern
8

You would get the following output:

6
7

Please replace StartPattern by ^StartPattern$ if you want an exact line match, same for EndPattern. Also replace recording+=1 by recording=1 if you want to ignore nested patterns.

1
  • Welcome to the site, and thank you for your contribution. Please note that as it is currently written, the answer is difficult to understand. You may want to split it into a block containing just the program, then your input example as it would appear in a text file, and then the resulting output. Commented Oct 19, 2023 at 8:35
0

I like the answer from Stéphane Chazelas about combine sed and tac, however, I don't like the part using $!d to delete the rest as it is quite hard to read/understand.

I prefer below combination without $!d which is easier to understand/read:

tac fileName | sed -n '/EndPattern/,$p; /StartPattern/q' | tac

Which mean:

  • It reverse the file content (tac)
  • Find the first EndPattern then print them all ($p)
  • Until it reach the first StartPattern then quit (q)
  • Reverse the order again back to normal.

However for performance wise, I think the answer from Bruno is better as it doesn't need the tac to reverse the content order. (But on the other hands it just a little hard to read as it require copy and swap between hold space and pattern space...)

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.