concatenating csv files nicely with python

Question

My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing. My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models.

For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file.

My question: is there a smarter method to get this big output file without exporting 100 files. I use pandas and scikit-learn for data processing, and multiprocessing for parallelization.

Do you know for certain that all the CSV files (a) have the same column headers, (b) the column headers are in the same order, and (c) all are completely written to disk? — Matt Anderson
– Matt Anderson, Commented Oct 10, 2015 at 17:23
I do not include any header in the csv, but they have same number and types of columns. I am not saying that I can't do it, I am looking for a smarter method. Like maybe filling some store progressively and then export the final data into a csv. — sweeeeeet
– sweeeeeet, Commented Oct 10, 2015 at 17:24
If you're on unix and you have no column headers, you don't need python. cat *partial*.csv > unified.csv. — Matt Anderson
– Matt Anderson, Commented Oct 10, 2015 at 17:27
Not exactly sure what you want but look at the pickle library if you want to easily save arrays, models etc. — Niki van Stein
– Niki van Stein, Commented Oct 10, 2015 at 17:29

tlastowka · Accepted Answer · 2015-10-10 18:06:35Z

have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv.

from multiprocessing import Process, Manager

def worker_func(proc_id,results):

    # Do your thing

    results[proc_id] = ["your dataset from %s" % proc_id]

def convert_dataset_to_csv(dataset):

    # Placeholder example.  I realize what its doing is ridiculous

    converted_dataset = [ ','.join(data.split()) for data in dataset]
    return  converted_dataset

m = Manager()
d_results= m.dict()

worker_count = 100

jobs = [Process(target=worker_func,
        args=(proc_id,d_results))
        for proc_id in range(worker_count)]

for j in jobs:
    j.start()

for j in jobs:
    j.join()


with open('somecsv.csv','w') as f:

    for d in d_results.values():

        # if the actual conversion function benefits from multiprocessing,
        # you can do that there too instead of here

        for r in convert_dataset_to_csv(d):
            f.write(r + '\n')

Matt Anderson · Accepted Answer · 2015-10-10 17:31:05Z

1

If all of your partial csv files have no headers and share column number and order, you can concatenate them like this:

with open("unified.csv", "w") as unified_csv_file:
    for partial_csv_name in partial_csv_names:
        with open(partial_csv_name) as partial_csv_file:
            unified_csv_file.write(partial_csv_file.read())

answered Oct 10, 2015 at 17:31

Matt Anderson

19.9k12 gold badges46 silver badges59 bronze badges

Comments

Rolf of Saxony · Accepted Answer · 2015-10-10 18:22:03Z

1

Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem.

#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))    
for f in files:
    print "files added {n}".format(n=n)
    concat_file.write(f())
    n+=1
concat_file.close()

edited Oct 10, 2015 at 18:22

answered Oct 10, 2015 at 18:09

Rolf of Saxony

22.6k5 gold badges43 silver badges61 bronze badges

Collectives™ on Stack Overflow

concatenating csv files nicely with python

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related