2

I have a string similar to

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

I'm trying to extract between (),

['test1 or (test2 or test3)', 'test4 and (test6)', 'test7 or test8']

I have tried

result = re.search('%s(.*)%s' % ("(", ")"), s).group(1)
result =(s[s.find("(")+1 : s.find(")")])
result = re.search('((.*))', s)
6
  • if you change it a bit, it will certainly work. Commented May 16, 2019 at 13:49
  • 3
    re module doesn't support nesting. not the good tech Commented May 16, 2019 at 13:52
  • Since or and and are also Python keywords, have a look at ast. Commented May 16, 2019 at 13:53
  • Doesn't look like you're escaping the parentheses, which you might have to do. Does r'\((.*)\).*' work when you do re.findall() with it? Commented May 16, 2019 at 13:54
  • Using re.findall() with r'\((.*)\).* returns the string inside the first and last bracket. Commented May 16, 2019 at 14:00

2 Answers 2

2

you have nested parentheses. That calls for parsing, or if you don't want to go that far, back to basics, parse character by character to find the 0-nesting level of each group.

Then hack to remove the and tokens before if any.

The code I've written for this. Not short, not very complex either, self-contained, no extra libs:

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

nesting_level = 0
previous_group_index = 0

def rework_group(group):
    # not the brightest function but works. Maybe needs tuning
    # that's not the core of the algorithm but simple string operations
    # look for the first opening parenthese, remove what's before
    idx = group.find("(")
    if idx!=-1:
        group = group[idx:]
    else:
        # no parentheses: split according to blanks, keep last item
        group = group.split()[-1]
    return group

result = []

for i,c in enumerate(s):
    if c=='(':
        nesting_level += 1
    elif c==')':
        nesting_level -= 1
        if nesting_level == 0:
            result.append(rework_group(s[previous_group_index:i+1]))
            previous_group_index = i+1

result.append(rework_group(s[previous_group_index:]))

result:

>>> result
['(test1 or (test2 or test3))',
 '(test4 and (test6))',
 '(test7 or test8)',
 'test9']
>>> 
Sign up to request clarification or add additional context in comments.

Comments

0

If you did want to make a rough parser for this it would look some like this.

This uses the scanner method of pattern objects, iterates through and builds the list when at level 0, where the level is defined through the left and right brackets encountered.

import re

# Token specification
TEST = r'(?P<TEST>test[0-9]*)'
LEFT_BRACKET = r'(?P<LEFT_BRACKET>\()'
RIGHT_BRACKET = r'(?P<RIGHT_BRACKET>\))'
AND = r'(?P<AND> and )'
OR = r'(?P<OR> or )'

master_pat = re.compile('|'.join([TEST, LEFT_BRACKET, RIGHT_BRACKET, AND, OR]))

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

def generate_list(pat, text):
    ans = []
    elem = ''
    level = 0
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        # print(m.lastgroup, m.group(), level)
        # keep building elem if nested or not tokens to skip for level=0,1
        if (level > 1 or
          (level == 1 and m.lastgroup != 'RIGHT_BRACKET') or
          (level == 0 and m.lastgroup not in ['LEFT_BRACKET', 'AND'])
        ):
            elem += m.group()
        # if at level 0 we can append
        if level == 0 and elem != '':
            ans.append(elem)
            elem = ''
        # set level
        if m.lastgroup == 'LEFT_BRACKET':
            level += 1
        elif m.lastgroup == 'RIGHT_BRACKET':
            level -= 1
    return ans


generate_list(master_pat, s)
# ['test1 or (test2 or test3)', 'test4 and (test6)', 'test7 or test8', 'test9']

To see how scanner behaves:

master_pat = re.compile('|'.join([TEST, LEFT_BRACKET, RIGHT_BRACKET, AND, OR]))
s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

scanner = master_pat.scanner(s)
scanner.match()
# <re.Match object; span=(0, 1), match='('>
_.lastgroup, _.group()
# ('LEFT_BRACKET', '(')
scanner.match()
# <re.Match object; span=(1, 6), match='test1'>
_.lastgroup, _.group()
# ('TEST', 'test1')
scanner.match()
# <re.Match object; span=(6, 10), match=' or '>
_.lastgroup, _.group()
# ('OR', ' or ')
scanner.match()
# <re.Match object; span=(10, 11), match='('>
_.lastgroup, _.group()
# ('LEFT_BRACKET', '(')
scanner.match()
# <re.Match object; span=(11, 16), match='test2'>
_.lastgroup, _.group()
# ('TEST', 'test2')

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.