0

I am new to Python, and not much of a coder. I have 40+ text files that I want to combine together (in a 'wide' csv, as opposed to a 'tall' csv. That is, I don't want to append the files) and produce a new csv.

Using Pandas (merge) I am able to achieve what I want, but I presume there is a simpler way. Here it is on seven of the files:


import pandas as pd

a = pd.read_csv("c:/pyTest/B01001.txt")
b = pd.read_csv("c:/pyTest/B01002.txt")
c = pd.read_csv("c:/pyTest/B01003.txt")
d = pd.read_csv("c:/pyTest/B02001.txt")
e = pd.read_csv("c:/pyTest/B05001.txt")
f = pd.read_csv("c:/pyTest/B05002.txt")
g = pd.read_csv("c:/pyTest/B05012.txt")

merged = a.merge(b.merge(c.merge(d.merge(e.merge(f.merge(g, on='GEOID'), on='GEOID'), on='GEOID'), on='GEOID'), on='GEOID'), on='GEOID')
merged.to_csv("c:/pytest/fook.csv", index=False)

It would be great if the duplicated column names (eg 'GEOID') weren't repeated in the output file too.

Any help from you experts greatly appreciated.

2
  • Can you show me an example of how two of the files look (just a single row) and how you would like them to end out? I dont follow your 'tall'/'wide' anology. Commented Oct 7, 2014 at 19:15
  • I think this is very similar to what you want to do? stackoverflow.com/questions/18689453/… Commented Oct 7, 2014 at 19:20

1 Answer 1

2

You can apply merge to a list of DataFrames using reduce:

import pandas as pd
import functools

files = ["c:/pyTest/B01001.txt", "c:/pyTest/B01002.txt", "c:/pyTest/B01003.txt",
         "c:/pyTest/B02001.txt", "c:/pyTest/B05001.txt", "c:/pyTest/B05002.txt",
         "c:/pyTest/B05012.txt",]
dfs = [pd.read_csv(filename).set_index('GEOID') for filename in files]
mergefunc = functools.partial(pd.merge, left_index=True, right_index=True)
merged = functools.reduce(mergefunc, dfs)

merged.to_csv("c:/pytest/fook.csv", index=False)

When Pandas merges two DataFrames based on the index (rather than on columns), the resultant DataFrame uses the merged index. Thus you can avoid duplication of GEOID columns by merging on the index.


For example:

In [99]: import numpy as np
In [100]: import pandas as pd
In [101]: import functools

In [102]: dfs = [pd.DataFrame(np.arange(6).reshape(3,2), columns=['A','B{}'.format(i)]).set_index('A') for i in range(3)]

In [103]: mergefunc = functools.partial(pd.merge, left_index=True, right_index=True)    
In [104]: merged = functools.reduce(mergefunc, dfs)

In [105]: merged
Out[105]: 
   B0  B1  B2
A            
0   1   1   1
2   3   3   3
4   5   5   5
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.