Splitting a CSV file with headers

Question

I have been looking for an algorithm for splitting a file into smaller pieces, which satisfy the following requirements:

The first line in the original file is a header, this header must be carried over to the resulting files
The ability to specify the approximate size to split off, for example, I want to split a file to blocks of about 200,000 characters in size.
File must be splitted at line boundaries

More info:

This is part of my web service: the user uploads a CSV file, the web service will see this CSV is a chunk of data--it does not know of any file, just the contents.
The web service then breaks the chunk of data up into several smaller pieces, each will the header (first line of the chunk). It then schedules a task for each piece of data. I don't want to process the whole chunk of data since it might take a few minutes to process all of it. I want to process them in smaller size. This is why I need to split my data.
The final code will not deal with open file for reading nor writing. I only do that to test out the code.
The task to process smaller pieces of data will deal with CSV via csv.DictReader. It does not make sense to use the csv module to break up the original chunk of data into pieces. I have done some timing and my algorithm achieves better performance as opposed to reading/writing line-by-line just to break data into pieces.

Here is an example:

If my original file is:

Header
line1
line2
line3
line4

If I want my approximate block size of 8 characters, then the above will be splitted as followed:

File1:

Header
line1
line2

File2:

Header
line3
line4

In the example above, if I start counting from the beginning of line1 (yes, I want to exclude the header from the counting), then the first file should be:

Header
line1
li

But, since I want the splitting done at line boundary, I included the rest of line2 in File1.

Here is what I have so far. I am looking to turn this code segment into a procedure, but more importantly, I want the code to speed up a bit. Currently, it takes about 1 second to finish. In the final solution, In addition, I am going to do the following:

Change the procedure to return a generator, which returns blocks of data. I don't really need to write them into files. I think I know how to do this, but any comment or suggestion is welcome.
Since my data is in Unicode (Vietnamese text), I have to deal with .encode(), .decode(). Any hint to speed this up would be great.

with open('data.csv') as f:
    buf = f.read().decode('utf-8')

header, chunk = buf.split('\n', 1)

block_size = 200000
block_start = 0
counter = 0
while True:
    counter += 1
    filename = 'part_%03d.txt' % counter
    block_end = chunk.find('\n', block_start + block_size)
    print filename, block_start, block_end

    with open(filename, 'w') as f:
        f.write(header + '\n')
        if block_end == -1:
            f.write(chunk[block_start:].encode('utf-8'))
        else:
            f.write(chunk[block_start:block_end].encode('utf-8'))
        f.write('\n')

    block_start = block_end + 1
    if block_end == -1: break

But » it takes about 1 second to finish« I wouldn't call it that slow. — Thomas Junk
– Thomas Junk, Commented Oct 24, 2013 at 19:44
I agreed, but in context of a web app, a second might be slow. — Hai Vu
– Hai Vu, Commented Oct 24, 2013 at 19:54
On the other hand, there is not much going on in your programm. I think it would be hard to optimize more for performance, though some refactoring for "clean" code would be nice ;) My guess is, that the the I/O-Part is the real bottleneck. Have you done any python profiling yet for hot-spots? — Thomas Junk
– Thomas Junk, Commented Oct 24, 2013 at 20:04
Thank you so much for this code! It worked like magic for me after searching in lots of places. However, rather than setting the chunk size, I want to split into multiple files based on a column value. When I tried doing it in other ways, the header(which is in row 0) would not appear in the resulting files. I want to see the header in all the files generated like yours. Can you help me out by telling how can I divide into chunks based on a column? — Nymeria123
– Nymeria123, Commented Feb 25, 2017 at 7:33
@Nymeria123 Please post a new question (instead of commenting) and put a link back to this one. That way, you will get help quicker. — Hai Vu
– Hai Vu, Commented Mar 6, 2017 at 17:37

Gareth Rees · Accepted Answer · 2013-10-24 20:29:32Z

1. Comments on your code

There's no documentation. What does your program do and how should it be used?
Your program is not split up into functions. This makes it hard to test it from the interactive interpreter.
The name data.csv seems arbitrary. Surely this should be an argument to the program? Similarly for 'part_%03d.txt' and block_size = 200000.
You read the whole of the input file into memory and then use .decode, .split and so on. It would be more efficient to read and write one line at a time, thus using no more memory than is needed to store the longest line in the input.
In Python, a file object is also an iterator that yields the lines of the file. It's often simplest to process the lines in a file using for line in file: loop or line = next(file).
When you have an infinite loop incrementing a counter, like this:
```
counter = 0
while True:
    counter += 1
```
consider using itertools.count, like this:
```
for counter in itertools.count(1):
```
(But you'll see below that it's actually more convenient here to use enumerate.)
Since your data is encoded in UTF-8, and only character you actually look for in the input is the newline character (\n), there is no need for you to decode the input or encode the output: you could work with bytes throughout. (But if you insist on decoding the input and encoding the output, then you should pass the encoding='utf-8' keyword argument to open.)

2. Revised code

Here's how I'd implement your program. The only tricky aspect to it is that I have two different loops that iterate over the same iterator f (the first one using enumerate, the second one using itertools.chain). This works because the iterator state is stored in the object f, so the two loops can independently fetch lines from the iterator and the lines still come out in the right order.

from itertools import chain

def split_file(filename, pattern, size):
    """Split a file into multiple output files.

    The first line read from 'filename' is a header line that is copied to
    every output file. The remaining lines are split into blocks of at
    least 'size' characters and written to output files whose names
    are pattern.format(1), pattern.format(2), and so on. The last
    output file may be short.

    """
    with open(filename, 'rb') as f:
        header = next(f)
        for index, line in enumerate(f, start=1):
            with open(pattern.format(index), 'wb') as out:
                out.write(header)
                n = 0
                for line in chain([line], f):
                    out.write(line)
                    n += len(line)
                    if n >= size:
                        break

if __name__ == '__main__':
    split_file('data.csv', 'part_{0:03d}.txt', 200000)

3. Generating the output blocks instead

Do you really want to process a CSV file in chunks? I can't imagine why you would want to do that. Surely either you have to process the whole file at once, or else you can process it one line at a time?

Processing a file one line at a time is easy:

with open(filename, encoding='utf-8') as f:
    header = next(f)
    for line in f:
        # process line

But given that this is apparently a CSV file, why not use Python's built-in csv module to do whatever you need to do? Like this:

import csv
with open(filename, encoding='utf-8', newline='') as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        # row[0] is the first field in this row, and so on.

I think there is an error in this code upgrade. n += len(line) should be n += 1 — metersk
– metersk, Commented May 10, 2018 at 17:51
Ah! Maybe I should read the OP next time ! Regardless, this was very helpful for splitting on lines with my tiny change. Thanks! — metersk
– metersk, Commented May 10, 2018 at 18:12

Stack Exchange Network

Splitting a CSV file with headers

1 Answer 1

1. Comments on your code

2. Revised code

3. Generating the output blocks instead

You must log in to answer this question.

Hot Network Questions

Splitting a CSV file with headers

1 Answer 1

1. Comments on your code

2. Revised code

3. Generating the output blocks instead

You must log in to answer this question.

Related

Hot Network Questions