1

I have multiple csv files that all have more or less the same headers. some might have all the headers some might not have them all. I want to use a common csv file that will have only the headers and merge them all.

sample header:

a, b, c, d, e, f,

file 1:

a, b, d,
1, 2, 3,

file 2:

a, b, c, e,
4, 5, 6, 7,

Merged result:

a, b, c, d, e, f,
1, 2,  , 3,
4, 5, 6,  , 7,  ,

So far I was pointed to use csv.DictReader, csv.DictWriter. But I am having trouble with merging based on a common header and also keeping the header order. Is there anyway I could still use them and not sort them?

I tried pandas merge function but it needs an order to sort based on, which my data do not contain.

Any help is appreciated. Thank you

4
  • Why don't you just use strip(',') and split(', ') to parse input and then use an iterator to write() to the files? Commented Aug 10, 2014 at 1:54
  • @Matt this is not my actual data, I used it just to give an idea of the type of data I am dealing with. it was just an example. my headers are like "010 C03AA01" and the data in the csv files can be any kind of string. some csv files might have one row data and some might have multiple... Commented Aug 10, 2014 at 2:14
  • @cyrusR Have you looked into csvkit: csvkit.readthedocs.org/en/0.8.0 Commented Aug 10, 2014 at 5:01
  • Just added a simple class that you can use to solve your problem Commented Aug 10, 2014 at 7:01

1 Answer 1

1

So I decided to help you create a class to do. It returns a generator which you can iterate over to build your final file.

import csv
class DataFile(object):
    empty = ''  # use this if col does not have value

    def __init__(self, filename):
        f = open(filename, 'r')
        self.reader = csv.reader(f)
        # set first line as header
        self.header = [x.strip() for x in self.reader.next()]

    def get_header(self):
        return self.header

    def with_header(self, headers):
        """ Returns a generator for specified headers"""
        header_dict = dict([(a, i,) for i, a in enumerate(self.header)])

        for line in self.reader:
            li = []
            for h in headers:
                if h in header_dict:
                    li.append(line[header_dict[h]])
                else:
                    li.append(self.empty)
            yield li

You can use it to join files: file1.csv and file2.csv thus:

>>> one = DataFile('file1.csv')
>>> two = DataFile('file2.csv')
>>> one.get_header()
['a', 'b', 'd', '']
>>> comb = set(one.get_header() + two.get_header())
>>> final = list(one.with_header(comb)) + list(two.with_header(comb))
>>> final
[['1', '', '', ' 2', '', ' 3'], ['4', '', ' 6', ' 5', ' 7', '']]

You can then use comb and final to build your new csv file (with the csv writer etc). Also, you can build a function that takes in multiple files and just returns the new generator with all columns from all files etc. Modify the char being set when value is not in file by modifying the empty attribute. I think it's easy to follow

Sign up to request clarification or add additional context in comments.

1 Comment

thank you tr33hous ;) I changed the set into a list since the set was sorting the header file for me...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.