I have been looking for an algorithm for splitting a file into smaller pieces, which satisfy the following requirements:
- The first line in the original file is a header, this header must be carried over to the resulting files
- The ability to specify the approximate size to split off, for example, I want to split a file to blocks of about 200,000 characters in size.
- File must be splitted at line boundaries
More info:
- This is part of my web service: the user uploads a CSV file, the web service will see this CSV is a chunk of data--it does not know of any file, just the contents.
- The web service then breaks the chunk of data up into several smaller pieces, each will the header (first line of the chunk). It then schedules a task for each piece of data. I don't want to process the whole chunk of data since it might take a few minutes to process all of it. I want to process them in smaller size. This is why I need to split my data.
- The final code will not deal with open file for reading nor writing. I only do that to test out the code.
- The task to process smaller pieces of data will deal with CSV via csv.DictReader. It does not make sense to use thecsvmodule to break up the original chunk of data into pieces. I have done some timing and my algorithm achieves better performance as opposed to reading/writing line-by-line just to break data into pieces.
Here is an example:
If my original file is:
Header line1 line2 line3 line4
If I want my approximate block size of 8 characters, then the above will be splitted as followed:
File1:
Header line1 line2
File2:
Header line3 line4
In the example above, if I start counting from the beginning of line1 (yes, I want to exclude the header from the counting), then the first file should be:
Header line1 li
But, since I want the splitting done at line boundary, I included the rest of line2 in File1.
Here is what I have so far. I am looking to turn this code segment into a procedure, but more importantly, I want the code to speed up a bit. Currently, it takes about 1 second to finish. In the final solution, In addition, I am going to do the following:
- Change the procedure to return a generator, which returns blocks of data. I don't really need to write them into files. I think I know how to do this, but any comment or suggestion is welcome.
- Since my data is in Unicode (Vietnamese text), I have to deal with .encode(),.decode(). Any hint to speed this up would be great.
with open('data.csv') as f:
    buf = f.read().decode('utf-8')
header, chunk = buf.split('\n', 1)
block_size = 200000
block_start = 0
counter = 0
while True:
    counter += 1
    filename = 'part_%03d.txt' % counter
    block_end = chunk.find('\n', block_start + block_size)
    print filename, block_start, block_end
    with open(filename, 'w') as f:
        f.write(header + '\n')
        if block_end == -1:
            f.write(chunk[block_start:].encode('utf-8'))
        else:
            f.write(chunk[block_start:block_end].encode('utf-8'))
        f.write('\n')
    block_start = block_end + 1
    if block_end == -1: break

