Text parser implemented as a generator

Question

I often need to parse tab-separated text (usually from a huge file) into records. I wrote a generator to do that for me; is there anything that could be improved in it, in terms of performance, extensibility or generality?

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
  header = next(in_stream).rstrip(endl).split(sep)
  for lineno, line in enumerate(in_stream):
    if line == endl:
      continue # ignore blank lines
    if line[0] == comment:
      continue # ignore comments
    fields = line.rstrip(endl).split(sep)
    try:
      # could have done this outside the loop instead:
      # if types is None: types = {c : (lambda x : x) for c in headers}
      # but it nearly doubles the run-time if types actually is None
      if types is None:
        record = {col : fields[no] for no, col in enumerate(header)}
      else:
        record = {col : types[col](fields[no]) for no, col in enumerate(header)}
    except IndexError:
      print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
      raise
    yield record

\$\begingroup\$ @RikPoggi: I asked moderator to move it there. Thank you \$\endgroup\$

max
– max

2012-02-02 09:56:42 +00:00
Commented Feb 2, 2012 at 9:56 — max
– max, Commented Feb 2, 2012 at 9:56

Johan Lundberg · Accepted Answer · 2012-02-02 09:44:19Z

1

One thing you could try to reduce the amount of code in the loop is to make a function expression for these.

  if types is None:
    record = {col : fields[no] for no, col in enumerate(header)}
  else:
    record = {col : types[col](fields[no]) for no, col in enumerate(header)}

something like this: not tested but you should get the idea

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
  header = next(in_stream).rstrip(endl).split(sep)
  enumheader=enumerate(header)              ####  No need to do this every time
  if types is None:
     def recorder(col,fields): 
        return {col : fields[no] for no, col in enumheader}
  else:
     def recorder(col,fields): 
        return {col : types[col](fields[no]) for no, col in enumheader}

  for lineno, line in enumerate(in_stream):
    if line == endl:
      continue # ignore blank lines
    if line[0] == comment:
      continue # ignore comments
    fields = line.rstrip(endl).split(sep)
    try:
        record = recorder(col,fields)
    except IndexError:
      print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
      raise
    yield record

EDIT: from my first version (read comments)

Tiny thing:

    if types is None:

I suggest

    if not types:

answered Feb 2, 2012 at 9:44

Johan Lundberg

1263 bronze badges

\$\begingroup\$ In many cases I think if types is None would be preferable. I think I see why you disagree in this case, but I'm curious... could you say more? \$\endgroup\$

senderle
– senderle

2012-02-02 09:48:43 +00:00
Commented Feb 2, 2012 at 9:48
\$\begingroup\$ Generally do not test types if it's not clear that it's required. (duck type). In this case I can't come up with a specific benefit, some other type evaluated to false in a boolean context? Also is is testing the identity and would be false even if the other None type was identical to None. Completely academic, yes. \$\endgroup\$

Johan Lundberg
– Johan Lundberg

2012-02-02 09:57:12 +00:00
Commented Feb 2, 2012 at 9:57
2

\$\begingroup\$ Well, PEP 8 says "Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators." In this particular case, using is makes it possible to distinguish between cases where [] was passed and cases where no variable was passed. \$\endgroup\$

senderle
– senderle

2012-02-02 10:03:06 +00:00
Commented Feb 2, 2012 at 10:03
\$\begingroup\$ All right that makes sense. From your link: A Foolish Consistency is the Hobgoblin of Little Minds \$\endgroup\$

Johan Lundberg
– Johan Lundberg

2012-02-02 10:08:51 +00:00
Commented Feb 2, 2012 at 10:08

Add a comment |

San4ez · Accepted Answer · 2012-02-02 19:06:04Z

2

You also may use csv module to iterate over your file. Your code would be faster because of C implementation and cleaner without line.rstrip(endl).split(sep)

answered Feb 2, 2012 at 19:06

San4ez

6531 gold badge4 silver badges11 bronze badges

Add a comment |

Stack Exchange Network

Text parser implemented as a generator

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Text parser implemented as a generator

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions