Print content between first matching brackets

Question

Input example:

START{
    some text

    {
      more text}
almost there
}
nothing important{
...

Desired output:

START{
    some text

    {
      more text}
almost there
}

First open bracket could be in different positions:

START{...
START {...
START
{...

Start could also contain special characters such as: *

I want to print out everything including from START and everything between first matching {} (in bash). I was thinking about having a counter that increments when it finds { and decrements when it finds }. When the result is zero it stops printing out (curly brackets are always matching).

Can the real text indicated by your placeholders some text, more text, almost there, or nothing important include any of START, {, or }? For example if this were a programming lanaguage you were trying to parse then there might be strings (e.g. "{" or "where is START?") or comments (# { or // { or // not really START) including any of those but which you would not want to consider when counting. — Ed Morton
– Ed Morton, Commented Jan 8, 2021 at 16:16
No, there is only one START, curly brackets can be nested, but they're always in pairs. Of course there could be more text or more brackets. — GeoCap
– GeoCap, Commented Jan 8, 2021 at 16:21
If START can really be STA*RT or similar with regexp metachars or typical delimiters then show that as your sample input/output rather than just the sunny-day alphabetic chars case. — Ed Morton
– Ed Morton, Commented Jan 8, 2021 at 18:28

Ed Morton · Accepted Answer · 2021-01-08 18:22:38Z

3

A simple brute force approach that'll work in any awk in any shell on all Unix boxes:

$ cat tst.awk
s=index($0,"START") { $0=substr($0,s); f=1 }
f { rec = rec $0 RS }
END {
    len = length(rec)
    for (i=1; i<=len; i++) {
        char = substr(rec,i,1)
        if ( char == "{" ) {
            ++cnt
        }
        else if ( char == "}" ) {
            if ( --cnt == 0 ) {
                print substr(rec,1,i)
                exit
            }
        }
    }
}

$ awk -f tst.awk file
START{
    some text

    {
      more text}
almost there
}

edited Jan 8, 2021 at 18:22

answered Jan 8, 2021 at 17:42

Ed Morton

35.8k6 gold badges25 silver badges60 bronze badges

Add a comment |

Stéphane Chazelas · Accepted Answer · 2021-01-08 18:39:21Z

2

With pcregrep:

start_word='START'
pcregrep -Mo "(?s)\Q$start_word\E\h*(\{(?:[^{}]++|(?1))*+\})" < your-file

With zsh builtins:

set -o rematchpcre
start_word='START'
[[ $(<your-file) =~ "(?s)\Q$start_word\E\h*(\{(?:[^{}]++|(?1))*+\})" ]] &&
  print -r -- $MATCH

Those use PCRE's recursive regexp feature, where (?1) above recalls the regexp in the first (...) pair.

If you have neither pcregrep nor zsh, you can always resort to the real thing (perl, the P in PCRE):

perl -l -0777 -sne '
    print $& if /\Q$start_word\E\h*(\{(?:[^{}]++|(?1))*+\})/s
  ' -- -start_word='START' < your-file

(note that all but the perl one assume the $start_word doesn't contain \E).

edited Jan 8, 2021 at 18:39

answered Jan 8, 2021 at 17:25

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Your solution does work, it's my fault that I forgot to add some details, such that first { could begin in a new line.

GeoCap
– GeoCap

2021-01-08 17:56:04 +00:00
Commented Jan 8, 2021 at 17:56
@GeoCap try changing START to START\s* in the regexp to allow for any optional white space between START and {.

Ed Morton
– Ed Morton

2021-01-08 17:58:52 +00:00
Commented Jan 8, 2021 at 17:58
@EdMorton, I've now updated it, but used \h instead, for horizontal spacing only so excluding \r/\n... To avoid matching on START<newline>{...}. @GeoCap, replace with \s if that's actually what you want.

Stéphane Chazelas
– Stéphane Chazelas

2021-01-08 18:14:42 +00:00
Commented Jan 8, 2021 at 18:14
@StéphaneChazelas I suggested \s because GeoCap specifically said in their comment that the { could be on the line after START: "...first { could begin in a new line.".

Ed Morton
– Ed Morton

2021-01-08 18:17:18 +00:00
Commented Jan 8, 2021 at 18:17
Thanks, my last issue is that START can contain characters that need to be escaped, like START. I changed the code so I pass in a string /$VAR\s... is there a way to take the string in literal without needing to write STA*RT

GeoCap
– GeoCap

2021-01-08 18:17:32 +00:00
Commented Jan 8, 2021 at 18:17

| Show 2 more comments

Stack Exchange Network

Print content between first matching brackets

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Print content between first matching brackets

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions