3
\$\begingroup\$

I often need to parse tab-separated text (usually from a huge file) into records. I wrote a generator to do that for me; is there anything that could be improved in it, in terms of performance, extensibility or generality?

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
  header = next(in_stream).rstrip(endl).split(sep)
  for lineno, line in enumerate(in_stream):
    if line == endl:
      continue # ignore blank lines
    if line[0] == comment:
      continue # ignore comments
    fields = line.rstrip(endl).split(sep)
    try:
      # could have done this outside the loop instead:
      # if types is None: types = {c : (lambda x : x) for c in headers}
      # but it nearly doubles the run-time if types actually is None
      if types is None:
        record = {col : fields[no] for no, col in enumerate(header)}
      else:
        record = {col : types[col](fields[no]) for no, col in enumerate(header)}
    except IndexError:
      print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
      raise
    yield record
\$\endgroup\$
1
  • \$\begingroup\$ @RikPoggi: I asked moderator to move it there. Thank you \$\endgroup\$ Commented Feb 2, 2012 at 9:56

2 Answers 2

1
\$\begingroup\$

One thing you could try to reduce the amount of code in the loop is to make a function expression for these.

  if types is None:
    record = {col : fields[no] for no, col in enumerate(header)}
  else:
    record = {col : types[col](fields[no]) for no, col in enumerate(header)}

something like this: not tested but you should get the idea

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
  header = next(in_stream).rstrip(endl).split(sep)
  enumheader=enumerate(header)              ####  No need to do this every time
  if types is None:
     def recorder(col,fields): 
        return {col : fields[no] for no, col in enumheader}
  else:
     def recorder(col,fields): 
        return {col : types[col](fields[no]) for no, col in enumheader}

  for lineno, line in enumerate(in_stream):
    if line == endl:
      continue # ignore blank lines
    if line[0] == comment:
      continue # ignore comments
    fields = line.rstrip(endl).split(sep)
    try:
        record = recorder(col,fields)
    except IndexError:
      print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
      raise
    yield record

EDIT: from my first version (read comments)

Tiny thing:

    if types is None:

I suggest

    if not types:
\$\endgroup\$
4
  • \$\begingroup\$ In many cases I think if types is None would be preferable. I think I see why you disagree in this case, but I'm curious... could you say more? \$\endgroup\$ Commented Feb 2, 2012 at 9:48
  • \$\begingroup\$ Generally do not test types if it's not clear that it's required. (duck type). In this case I can't come up with a specific benefit, some other type evaluated to false in a boolean context? Also is is testing the identity and would be false even if the other None type was identical to None. Completely academic, yes. \$\endgroup\$ Commented Feb 2, 2012 at 9:57
  • 2
    \$\begingroup\$ Well, PEP 8 says "Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators." In this particular case, using is makes it possible to distinguish between cases where [] was passed and cases where no variable was passed. \$\endgroup\$ Commented Feb 2, 2012 at 10:03
  • \$\begingroup\$ All right that makes sense. From your link: A Foolish Consistency is the Hobgoblin of Little Minds \$\endgroup\$ Commented Feb 2, 2012 at 10:08
2
\$\begingroup\$

You also may use csv module to iterate over your file. Your code would be faster because of C implementation and cleaner without line.rstrip(endl).split(sep)

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.