0

My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing. My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models.

For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file.

My question: is there a smarter method to get this big output file without exporting 100 files. I use pandas and scikit-learn for data processing, and multiprocessing for parallelization.

5
  • Do you know for certain that all the CSV files (a) have the same column headers, (b) the column headers are in the same order, and (c) all are completely written to disk? Commented Oct 10, 2015 at 17:23
  • I do not include any header in the csv, but they have same number and types of columns. I am not saying that I can't do it, I am looking for a smarter method. Like maybe filling some store progressively and then export the final data into a csv. Commented Oct 10, 2015 at 17:24
  • 1
    If you're on unix and you have no column headers, you don't need python. cat *partial*.csv > unified.csv. Commented Oct 10, 2015 at 17:27
  • Is there a cleaner way only using python and not shell? Commented Oct 10, 2015 at 17:28
  • Not exactly sure what you want but look at the pickle library if you want to easily save arrays, models etc. Commented Oct 10, 2015 at 17:29

3 Answers 3

1

have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv.

from multiprocessing import Process, Manager

def worker_func(proc_id,results):

    # Do your thing

    results[proc_id] = ["your dataset from %s" % proc_id]

def convert_dataset_to_csv(dataset):

    # Placeholder example.  I realize what its doing is ridiculous

    converted_dataset = [ ','.join(data.split()) for data in dataset]
    return  converted_dataset

m = Manager()
d_results= m.dict()

worker_count = 100

jobs = [Process(target=worker_func,
        args=(proc_id,d_results))
        for proc_id in range(worker_count)]

for j in jobs:
    j.start()

for j in jobs:
    j.join()


with open('somecsv.csv','w') as f:

    for d in d_results.values():

        # if the actual conversion function benefits from multiprocessing,
        # you can do that there too instead of here

        for r in convert_dataset_to_csv(d):
            f.write(r + '\n')
Sign up to request clarification or add additional context in comments.

Comments

1

If all of your partial csv files have no headers and share column number and order, you can concatenate them like this:

with open("unified.csv", "w") as unified_csv_file:
    for partial_csv_name in partial_csv_names:
        with open(partial_csv_name) as partial_csv_file:
            unified_csv_file.write(partial_csv_file.read())

Comments

1

Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem.

#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))    
for f in files:
    print "files added {n}".format(n=n)
    concat_file.write(f())
    n+=1
concat_file.close()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.