Parsing indented text and inserting its data into a tree

Question

My goal is to parse indented text in the style of python and YAML.

This only find the parent of each line.

This bit of code seems to do the trick, but I'm not really satisfied and I wanted to know if you would do this another way.

raw = """animal
    carnivorous
        tiger
        lion
    vegetarian
        cow
        sheep
plant
    algea
    tree
        leaf
        pine
    fungus
        good
        bad
            evil
            mean
    cactus
        big
        small"""

lines = raw.split('\n')
indents = [(0,0,'root')]
for a in raw.split('\n'):
    indent = 0
    while a[indent] == ' ': indent+=1
    if indent % 4:
        print("not multiple of 4")
        break
    indents.append((len(indents), int(indent/4)+1,a.replace('  ','')))
for a in indents: print(a)

stack=[indents[0]]
entries =[indents[0]]

prev_indent = 0
for item in indents[1:]:
    print("#########################")
    id, indent, name = item
    diff = indent - prev_indent
    print(item)
    print("diff",diff, [a[2] for a in stack])

    if diff>0:
        entries.append(item+(stack[-1][2],))
    elif diff<0:
        # entries.append(item+(stack[-diff][2],))
        count = -diff
        while count>-1: stack.pop();count-=1
        entries.append(item+(stack[-1][2],))
    elif diff==0:
        stack.pop()
        entries.append(item+(stack[-1][2],))
    stack.append(item)
    prev_indent = entries[-1][1]

    print("result", entries[-1])
print("########################")


for a in entries:
    if len (a) == 3: continue
    ident, level, name, parent = a
    print(level*' '*4, name, '(', parent, ')')

This results in this (the name in parenthesis is the parent):

animal ( root )
    carnivorous ( animal )
        tiger ( carnivorous )
        lion ( carnivorous )
    vegetarian ( animal )
        cow ( vegetarian )
        sheep ( vegetarian )
plant ( root )
    algea ( plant )
    tree ( plant )
        leaf ( tree )
        pine ( tree )
    fungus ( plant )
        good ( fungus )
        bad ( fungus )
            evil ( bad )
            mean ( bad )
    cactus ( plant )
        big ( cactus )
        small ( cactus )

Is there a way to change this pattern re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)') you suggest to get rid of empty lines too? I am using it to read a text file and this file has empty lines causing error in the code you sugest. — jorgealmerio
– jorgealmerio, Commented Apr 12, 2020 at 17:22

200_success · Accepted Answer · 2017-09-23 20:22:38Z

You should be able to accomplish this task by making one linear pass through the lines, instead of making one pass to build indents and a second pass to build entries.

It's good practice to package your code into functions, and to write docstrings for them. In particular, if you have code that follows the pattern

outputs = []
for item in inputs:
    outputs.append(…)

… then consider writing a generator instead.

When you detect indentation that is not a multiple of 4 spaces, you print a message and stop building indents, but other than that, you still allow the program to proceed normally. The program should probably abort at that point, and I suggest doing so by raising an exception. Furthermore, I consider indentation that is suddenly excessively deep (e.g. going from 1 level to 3 levels of indentation) to be another kind of error that should be detected.

I don't like the way you handle the special case of the root node. In particular, having a non-uniform tuple length is asking for trouble — it is basically data that is not of the same type. I would avoid making the root node part of the data structure altogether, so that you don't have to write an exclusion for this special case:

for a in entries:
    if len (a) == 3: continue

Analyzing the text one character at a time (using while a[indent] == ' ': indent+=1) feels tedious. I suggest using regular expressions to describe what kind of text you are expecting. For example,

re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')

… says that you are looking for indentation at the beginning of the line that is a multiple of four spaces, followed by a name that starts with a non-space character.

Suggested solution

import re

def parse_tree(lines):
    """
    Parse an indented outline into (level, name, parent) tuples.  Each level
    of indentation is 4 spaces.
    """
    regex = re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')
    stack = []
    for line in lines:
        match = regex.match(line)
        if not match:
            raise ValueError(
                'Indentation not a multiple of 4 spaces: "{0}"'.format(line)
            )
        level = len(match.group('indent')) // 4
        if level > len(stack):
            raise ValueError('Indentation too deep: "{0}"'.format(line))
        stack[level:] = [match.group('name')]
        yield level, match.group('name'), (stack[level - 1] if level else None)

raw = """…"""    
for level, name, parent in parse_tree(raw.split('\n')):
    print('{0}{1} ( {2} )'.format(' ' * (4 * level), name, parent or 'root'))

Thanks. Please, could you comment or expand on the 2 last line in parse_tree() ? — jokoon
– jokoon, Commented Sep 23, 2017 at 21:20
The penultimate line pops items from the stack as necessary, then pushes one name onto the stack. The last line of code produces the (level, name, parent) tuple result for that line of text. — 200_success
– 200_success, Commented Sep 23, 2017 at 21:23

Stack Exchange Network

Parsing indented text and inserting its data into a tree

1 Answer 1

Suggested solution

You must log in to answer this question.

Hot Network Questions

Parsing indented text and inserting its data into a tree

1 Answer 1

Suggested solution

You must log in to answer this question.

Related

Hot Network Questions