Parse one table out of CSV with multiple titled tables with the Python CSV module

Question

I'm looking to parse CSV files containing multiple tables using Python 3's csv module.

These complex CSVs are not unlike the toy example below. My goal is to make an idiom for picking out any one table using a known header row.

Complex CSV file toy.csv:

lists, fruits, books, forks, rope, gum
4, 2, 3, 0, 2, 2

Manhattan Produce Market
id, fruit, color
1, orange, orange
2, apple, red

Books
id, book, pages
1, Webster’s Dictionary, 1000
2, Tony the Towtruck, 20
3, The Twelfth Night, 144

Rope
id, rope, length, diameter, color
1, hemp, 12-feet, .5, green
2, sisal, 50-feet, .125, brown

Kings County Candy
id, flavor, color, big-league
1, grape, purple, yes
2, mega mango, yellow-orange, no

Each table is preceded by a title (except for a garbage table at the start). I save the previous row, and when I match the correct table header, I add the title as a new column.

import csv, re
header = [] #doesn't need to be list, but I'm thinking ahead
table = [] 

with open('toy.csv', 'r') as blob:
    reader = csv.reader(blob)
    curr = reader.__next__()
    while True: 
        prev = curr
        try:
            curr = reader.__next__()
        except StopIteration:
            break 
        if not ['id', ' book', ' pages'] == curr: 
            continue
        else:             
            header.append(prev) 
            table.append(['title'] + curr)            
            while True: 
                try:
                    curr = reader.__next__()
                    if curr == []:
                        break
                    else:
                        table.append(header[0] + curr)
                except StopIteration:
                    break

The first part is to make an idiom which I can simply repeat for each table I want to extract. Later, I will combine the tables into one super-table filling NANs where the table headers don't match.

[['title', 'id', ' book', ' pages'],
 ['Books', '1', ' Webster’s Dictionary', ' 1000'],
 ['Books', '2', ' Tony the Towtruck', ' 20'],
 ['Books', '3', ' The Twelfth Night', ' 144']]

The code is based on this Stack Overflow post.

Happy to hear your suggestions to make the code more compact, idiomatic, and fit for my goals.

Stephen Rauch · Accepted Answer · 2018-02-18 18:45:41Z

In general this code look very nice and seems to follow style guidelines well. I found your code easy to follow. You asked for:

... suggestions to make the code more compact, idiomatic, and fit for my goals.

I will start with a couple of points for compact and idiomatic, and then add something a bit bigger picture.

Only need one stop iteration:

Your main loop uses this construct twice:

try:
    ....
except StopIteration:
    break

But the net result of these is identical, and the inner one could simply be removed.

Acccessing Dunders is generally uneeded:

In several places your code has something like:

curr = reader.__next__()

The idiomatic way to do this is:

curr = next(reader)

The __next__ (dunder next) method is there because that is how an object publishes its interface for next(). And while you can access this, you probably shouldn't.

In addition your entire loop is using next(). I suggest that this likely indicates your iteration structure could use some reworking. Looking for a more natural iteration element may allow the code to be simplified.

Use natural record breaks:

Now something a little more substantive.

The data structure has a natural break between records as an empty line. By noting the presence of this empty line in the parsing logic, the parsing logic can get quite a bit simpler. In your code you need to have:

prev = curr

Which is needed to reach back a line to find the category name. But that same category was already marked via a blank line before.

I started to recast your code as an example of how to take advantage of that blank line, and found it just as easy to read in the entire file. The example below does not preserve the logic of looking for category heads of:

 ['id', ' book', ' pages']

But is was not super clear why in your requirements this was important. But adding that back into the code below, or taking the resulting data and then finding that record in said data, could also be done.

Recast Parsing Code:

def read_my_csv(filename):
    category = None
    header = None
    record_data = {}
    table = []

    def make_record():
        return [dict(zip(header, x)) for x in table]

    with open(filename, 'rU') as csv_data:
        reader = csv.reader(csv_data)

        for row in reader:
            if not row:
                if category:
                    record_data[category] = make_record()
                table = []
                category = next(reader)[0]
                header = tuple(x.strip() for x in next(reader))
            else:
                table.append([x.strip() for x in row])

    if category:
        record_data[category] = make_record()
    return record_data

Test Code:

import csv
data = read_my_csv('csvfile.csv')
for item in data.items():
    print(item[0])
    for records in item[1]:
        for record in records.items():
            print('   {}'.format(record))
        print()

Results from recast:

Manhattan Produce Market
   ('id', '1')
   ('fruit', 'orange')
   ('color', 'orange')

   ('id', '2')
   ('fruit', 'apple')
   ('color', 'red')

Books
   ('id', '1')
   ('book', 'Websterâ€™s Dictionary')
   ('pages', '1000')

   ('id', '2')
   ('book', 'Tony the Towtruck')
   ('pages', '20')

   ('id', '3')
   ('book', 'The Twelfth Night')
   ('pages', '144')

Rope
   ('id', '1')
   ('rope', 'hemp')
   ('length', '12-feet')
   ('diameter', '.5')
   ('color', 'green')

   ('id', '2')
   ('rope', 'sisal')
   ('length', '50-feet')
   ('diameter', '.125')
   ('color', 'brown')

Kings County Candy
   ('id', '1')
   ('flavor', 'grape')
   ('color', 'purple')
   ('bigleague', 'yes')

   ('id', '2')
   ('flavor', 'mega mango')
   ('color', 'yellow-orange')
   ('bigleague', 'no')

I didn't say why it was important to parse only the table matching the given header row, but that's the gig :) I really didn't want to make the post any longer. Once you run across one of these "Excel" reports spit out as a CSV, you'll have the same first reaction I did--I just want this one table! — xtian
– xtian, Commented Feb 20, 2018 at 3:01

200_success · Accepted Answer · 2018-02-19 06:43:17Z

Iteration

To fetch the next item from an iterator, you should call next(iter). The iter.__next__() method is implemented by the iterator, but that's not how you are supposed to call it.

Due to your prev = curr and the deep nesting of while True, try … except StopIteration, your iteration is clumsy. A challenge that you have is splitting the input into stanzas (delimited by an empty line). For that, I recommend itertools.takewhile().

CSV parsing

Your input is not traditional CSV. Rather, the fields are delimited by a comma followed by a nuisance space. To read that format, you should instantiate the csv.reader with a skipinitialspace=True option.

Rather than look for a stanza with headers 'id', 'book', and 'pages', I suggest parsing all of the stanzas, and using just the one that has 'Books' as the title.

Stack Exchange Network

Parse one table out of CSV with multiple titled tables with the Python CSV module

2 Answers 2

Only need one stop iteration:

Acccessing Dunders is generally uneeded:

Use natural record breaks:

Recast Parsing Code:

Test Code:

Results from recast:

Iteration

CSV parsing

Suggested solution

You must log in to answer this question.

Hot Network Questions

Parse one table out of CSV with multiple titled tables with the Python CSV module

2 Answers 2

Only need one stop iteration:

Acccessing Dunders is generally uneeded:

Use natural record breaks:

Recast Parsing Code:

Test Code:

Results from recast:

Iteration

CSV parsing

Suggested solution

You must log in to answer this question.

Related

Hot Network Questions