3

I have some text like this:

Sentence #1 (n tokens):
Blah Blah Blah
[...
 ...
 ...]
( #start first set here
 ... (other possible parens and text here)
 ) #end first set here

(...)
(...)

Sentence #2 (n tokens):

I want to extract the second set of parens (including everything in between) ,i.e.,

(
 ... (other possible parens here)
)

Is there a bash way to do this. I tried the simple

 's/(\(.*\))/\1/'
5
  • 1
    Regular expressions cannot handle "matching parentheses" -- they are mathematically incapable of it. Commented Sep 25, 2014 at 21:20
  • I don't think that is the case, because I have extracted the lines above with "[...]". Plus, I am not looking to match the parens, just aggressive match and skip that blank line after. If this absolutely not possible with sed what alternatives do you suggest? Commented Sep 25, 2014 at 21:23
  • Are the opening and closing parens alone on their own lines like you show here? Commented Sep 25, 2014 at 21:26
  • Pretty much, its like "(ROOT" and "(. .)))". This is a sentence parsed using the stanford parser. If I can write one for the simpler case I can modify it for the specific case. Commented Sep 25, 2014 at 21:29
  • @glennjackman There is a complication - things like perl regular expresions etc are not regular expressions in the mathematical sense; They can do much more. In most cases your point is true anyway - it's just not that easy to tell. Commented Sep 26, 2014 at 6:10

2 Answers 2

8

This will do it. There's probably a better way, but this is the first approach that came to mind:

echo 'Sentence #1 (n tokens):
Blah Blah Blah
[...
 ...
 ...]
(
 ... (other possible parens here)
 )

(...)
(...)

Sentence #2 (n tokens):
' | perl -0777 -nE '
    $wanted = 2; 
    $level = 0; 
    $text = ""; 
    for $char (split //) {
        $level++ if $char eq "(";
        $text .= $char if $level > 0;
        if ($char eq ")") {
            if (--$level == 0) {
                if (++$n == $wanted) { 
                    say $text;
                    exit;
                }
                $text="";
            }
        }
    }
'

outputs

(
 ... (other possible parens here)
 )
2
  • think I should actually sit down to learn PERL now, thanks and sorry cannot vote up yet! Commented Sep 25, 2014 at 21:58
  • +1 I once wrote a (completely untested) BNF-like Perl grammar for generic parenthetical constructs that might also be relevant. Commented Sep 26, 2014 at 3:30
4

Glenn's answer is good (and probably faster for large input), but for the record, what Glenn proposes is totally possible in bash too. It was a relatively simple matter to port his answer to pure bash in just a few minutes:

s='Sentence #1 (n tokens):
Blah Blah Blah
[...
 ...
 ...]
(
 ... (other possible parens here)
 )

(...)
(...)

Sentence #2 (n tokens):
'
wanted=2
level=0
text=""
for (( i=0; i<${#s}; i++ )); do
    char="${s:i:1}"
    if [ "$char" == "(" ]; then (( level++ )) ; fi
    if (( level > 0 )); then text+="$char"; fi
    if [ "$char" == ")" ]; then
        if (( --level == 0 )); then
            if (( ++n == wanted )); then
                echo "$text"
                exit
            fi
            text=""
        fi
    fi
done

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.