Capturing multiline region defined by start and end patterns

Question

I want to print an intervening section (between a start pattern and an end pattern) from a file, with particular lines getting coloured.

Here is a sample text in one such file

## Beginning of file

Some text and code

## FAML [ASMB] KEYWORD
##  Some information.
##  Some other text.
##  Blu:
##  Some text in blue.
## END OF FAML [ASMB]

## Other text

More text and code

The text between ## FAML [ASMB] KEYWORD and ## END OF FAML [ASMB] is to be extracted (without the beginning ##) and passed to the function luciferin which will print the multiline text appropriately.

The text between blocks is discarded. Subsequent blocks work the same, the intervening region is extracted and printed by calling the function luciferin(rec). The function luciferin does the output in colour.

The input string for luciferin would be

Some information.
Some other text.
Blu:
Some text in blue.

Here is the awk script that captures the intervening region

BEGIN {
  beg_ere = "## [[:alnum:]]+ [[][[:alnum:]]+[]]"
  end_ere = "## END OF [[:alnum:]]+ [[][[:alnum:]]+[]]"
 }

match($0, beg_ere, paggr) { display = 1 }
$0 ~ end_ere { display = 0 ; next }
display { print }

Here is the luciferin function that takes a string for output in colour. Where cpt in the colour escape sequence, and astr[i] is a particular line i of the multiline input string.

function luciferin(mstr) {
  cpt = tseq["Grn:"]
  nlines = split(mstr, astr, "\n")
  for (i = 1; i <= nlines; i++) {
    for ( knam in tseq ) {
      if ( knam == astr[i] ) { cpt = tseq[knam] ; break }
     }
    if (knam == str) { print "" } else { print cpt astr[i] rst }
   }

 }

Yes, this is possible. How to do it depends on what you want to do, and the best way to show that is by including an example in your question. Showing both example input and the expected output given that input ensures that we can test our solutions for correctness. — Kusalananda
– Kusalananda ♦, Commented Mar 9, 2023 at 11:43
Did you try yourself with some sed '/start-pattern/,/end-pattern/!d' or something? — Philippos
– Philippos, Commented Mar 9, 2023 at 13:43

Ed Morton · Accepted Answer · 2023-03-10 00:34:18Z

Since there's neither a minimal complete example of code, nor adequate sample input/output to test with, this is obviously just an untested guess but it looks like you should change:

display { print }

to

display { rec = rec $0 ORS }

and

$0 ~ end_ere { display = 0 ; next }

to

$0 ~ end_ere { luciferin(rec); rec = ""; display = 0 ; next }

or similar and tweak luciferin to remove the additional trailing newline from it's arg before printing.

Regarding how the question and the OPs questions in general could be improved - here's what a complete, minimal code sample would look like in a question such as this one:

$ cat tst.awk
$2 == "FAML" { display = 1 ; next }
$2 == "END" { display = 0 ; next }
display { print }

function luciferin(mstr) {
    nlines = split(mstr, astr, "\n")
    for (i = 1; i <= nlines; i++) {
        print "Luci:", astr[i]
    }
}

and some sample input to demonstrate your needs and test with:

$ cat input
## Beginning of file

Some text and code

## FAML [ASMB] KEYWORD
##  Some information.
##  Some other text.
## END OF FAML [ASMB]

## Other text

## FAML [ASMB] KEYWORD
##  Some other information.
##  Even more text.
## END OF FAML [ASMB]

More text and code

and the expected output given that input:

Luci: ##  Some information.
Luci: ##  Some other text.
Luci: ##  Some other information.
Luci: ##  Even more text.

The fact that your real code does coloring or whatever else is utterly irrelevant to the problem you want help with which is simply how to store a block of text and call luciferin() to print it modified in some way.

Given a clear, simple example like that we can very quickly show you a solution, e.g.:

$ cat tst.awk
$2 == "FAML" { display = 1 ; next }
$2 == "END" { luciferin(rec); rec = ""; display = 0 ; next }
display { rec = rec $0 ORS }

function luciferin(mstr) {
    nlines = split(mstr, astr, "\n")
    for (i = 1; i < nlines; i++) {
        print "Luci:", astr[i]
    }
}

$ awk -f tst.awk input
Luci: ##  Some information.
Luci: ##  Some other text.
Luci: ##  Some other information.
Luci: ##  Even more text.

which you can then take away and apply the concepts from it to your real code.

The fact your code continues after the end_ere matches tells me your real input contains multiple blocks to extract but your sample input only has 1 block so we can't test what happens to the text between blocks, or what should separate output blocks, if anything, and you don't show the expected output at all, just the string you want to run your function on which isn't the same as the desired output from your script. Plus even if it was there Id have to piece together the script fragments and add other code around it before I could test it. — Ed Morton
– Ed Morton, Commented Mar 9, 2023 at 23:59
Please look at the minimal, complete code, sample input, and expected output that clearly and simply demonstrate all the requirements associated with the problem you're asking for help to solve. THAT is the kind of example you should provide in every question - minimal, complete, and testable. — Ed Morton
– Ed Morton, Commented Mar 10, 2023 at 0:23

J_H · Accepted Answer · 2023-03-09 16:52:06Z

1

Tackling this in awk is certainly feasible, but it seems you're making this much too hard on yourself. Perl offers language support for such ranges directly, copied from the sed feature that was mentioned in the comments.

Let's color spring months blue.

$ cat months.txt | perl -ane 'print "blue" if /Mar/../May/; print "\t$_"'
        January
        February
blue    March
blue    April
blue    May
        June

Use FAML / ASMB keywords in those regexes to adapt this to your use case.

Even if you wish to do fancier processing than this, it is still a good initial stage in your pipeline.

Now a subsequent stage doesn't have to worry about line ranges; it can use first field to identify whether we're within range or not and then process the rest of the line accordingly.

answered Mar 9, 2023 at 16:52

J_H

9816 silver badges9 bronze badges

Ahhhmmm: Why use cat vs perl -ane '...' months.txt?

drewk
– drewk

2023-03-10 02:17:28 +00:00
Commented Mar 10, 2023 at 2:17
@drewk, for pedagogical purposes. I was hoping to draw a neophyte into the practice of composing and iterating on pipelines, letting the 2nd stage focus on the transformation rather than the data source.

J_H
– J_H

2023-03-10 03:05:52 +00:00
Commented Mar 10, 2023 at 3:05

Add a comment |

Stack Exchange Network

Capturing multiline region defined by start and end patterns

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Capturing multiline region defined by start and end patterns

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions