2

I'm trying to count the number of occurrences of a regex containing recursive parentheses expression. In my particular case I'm looking for counting occurrences by line or by file of (NP *) (VP *) (NP *). My example file contains (line 4 has a recursive case):

$ more mini.example 
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (XP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement (NP (NN opposition)) (VP et) (NP gouvernement))  </parse>
    <parse> (NP (NN opposition)) (VP et) (FP gouvernement) (NP (NN opposition)) (RP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) </parse>
    <parse> (VP et) (NP gouvernement) </parse>

I would like to have an output like this:

3 1
2 2
2 3
2 4
0 5
0 6

I tried this:

$ grep -Pon '(?<=\(NP ).*(?=\).*(?<=\(VP ).*(?=\).*(?<=\(NP ).*(?=\))))' mini.example | cut -d : -f 1 | uniq -c | sort -k 1

But the output is:

1 1
1 2
1 4
1 5
1 6

Which is different to the desired one. It counts uniquely the first part of the pattern, even if the whole pattern does not match and recursion can't be verified. Thank you for any help.

2
  • Could you describe the meaning of the output numbers? The left column is presumably the number of parentheses that are more than 1 layer deep - is the right column simply the current line number? Commented Aug 19, 2016 at 22:15
  • That's right @JigglyNaga, in this example the number of occurrences is the first column and the line number is the second column; although the format is not important actually. Commented Aug 19, 2016 at 22:24

1 Answer 1

3

Maybe something like:

grep -nPo '(?=(\((?:[^()]++|(?1))*\)) (?=\(VP)(?1) (?=\(NP)(?1))\(NP' |
 cut -d: -f1 | uniq -c

That is, it matches a (NP provided it's the start of a (NP *) (VP *) (NP *) where we use PCRE recursive matching for the (...) parts (the (\((?:[^()]++|(?1))*\)) part straight from the pcrepattern man page).

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.