Count number of occurrences of a parentheses regex

Question

I'm trying to count the number of occurrences of a regex containing recursive parentheses expression. In my particular case I'm looking for counting occurrences by line or by file of (NP *) (VP *) (NP *). My example file contains (line 4 has a recursive case):

$ more mini.example 
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (XP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement (NP (NN opposition)) (VP et) (NP gouvernement))  </parse>
    <parse> (NP (NN opposition)) (VP et) (FP gouvernement) (NP (NN opposition)) (RP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) </parse>
    <parse> (VP et) (NP gouvernement) </parse>

I would like to have an output like this:

I tried this:

$ grep -Pon '(?<=\(NP ).*(?=\).*(?<=\(VP ).*(?=\).*(?<=\(NP ).*(?=\))))' mini.example | cut -d : -f 1 | uniq -c | sort -k 1

But the output is:

Which is different to the desired one. It counts uniquely the first part of the pattern, even if the whole pattern does not match and recursion can't be verified. Thank you for any help.

Could you describe the meaning of the output numbers? The left column is presumably the number of parentheses that are more than 1 layer deep - is the right column simply the current line number? — JigglyNaga
– JigglyNaga, Commented Aug 19, 2016 at 22:15
That's right @JigglyNaga, in this example the number of occurrences is the first column and the line number is the second column; although the format is not important actually. — Nacho
– Nacho, Commented Aug 19, 2016 at 22:24

Stéphane Chazelas · Accepted Answer · 2016-08-19 22:11:58Z

3

Maybe something like:

grep -nPo '(?=(\((?:[^()]++|(?1))*\)) (?=\(VP)(?1) (?=\(NP)(?1))\(NP' |
 cut -d: -f1 | uniq -c

That is, it matches a (NP provided it's the start of a (NP *) (VP *) (NP *) where we use PCRE recursive matching for the (...) parts (the (\((?:[^()]++|(?1))*\)) part straight from the pcrepattern man page).

edited Aug 19, 2016 at 22:11

answered Aug 19, 2016 at 22:05

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

Stack Exchange Network

Count number of occurrences of a parentheses regex

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Count number of occurrences of a parentheses regex

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions