2

I have multiple text (.txt) files saved in a folder. I'm trying to combine them all into a single dataframe. So far I have been able to combine them, but not in the manner I'd like.

The text files (named yob####.txt where #### is a year) have information that looks like this:

Jennifer,F,58376
Amanda,F,35818
Jessica,F,33923
Melissa,F,31634
Sarah,F,25755
Heather,F,19975
Nicole,F,19917
Amy,F,19834
Elizabeth,F,19529
Michelle,F,19122
Kimberly,F,18499
Angela,F,17970

I'm trying to open each file, add the year to the end of the row, and move on.

def main():
    files = file_paths(FILE_FOLDER) # returns a list of file paths, i.e. ["C:\Images\file.txt","C:\Images\file2.txt", ...]

    df = []
    for file in files:
        year = file.split("\\")[-1][3:7] 
        df.append(pd.read_table(file)+","+year)
    big_df = pd.concat(df, ignore_index=True, axis=1)
    big_df.to_csv("Combined.csv", header=False, index=False)

This almost works...except it takes each file and puts the data in a column, the next file in a second column, next file in a third, etc.

Current output: enter image description here

The expected output is the same, except when it opens the 1881 file, it adds the info to the end of 1880. Then 1882 goes after the 1881 data, etc. etc.

4
  • 1
    You are currently concatenating the DFs in columns and not in rows, try to big_df = pd.concat(df, ignore_index=True, axis=0) instead Commented May 11, 2018 at 18:45
  • @Ben.T - I have tried that too...As I have it originally (axis is 1), it runs in approx. 4.6 seconds. Doing axis=0 pushes it up to about 38.9s. and shoots the file size from 38MB to 293MB, and it has lots of "empty columns" (screenshot here) Commented May 11, 2018 at 18:54
  • indeed it does not look nice...The problem might be that there is no header in your txt file, then the first row is by definition your columns' names and none of them are the same year after year. Try to do pd.read_table(file, header=None) and still concatenate with axis=0 Commented May 11, 2018 at 19:04
  • @Ben.T - Aha!! That looks like it does the trick, reduced to 8.2 seconds and ~33MB :D Commented May 11, 2018 at 19:08

2 Answers 2

5
  1. With read_table, the default separator is assumed to be whitespace (sep='\t'). Change read_table to read_csv, which infers your separator. Alternatively, specify sep=',' for the same effect.
  2. You're trying to add a new column year, but you're not doing that correctly. You can use assign to add it in
  3. Concatenate vertically (axis=0, the default), not horizontally.

df_list = []
for file in files:
    year = ...
    df_list.append(pd.read_csv(file, header=None).assign(year=year))

big_df = pd.concat(df_list, ignore_index=True)
big_df.to_csv("Combined.csv", header=False, index=False)
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for the second point, I knew the way I was adding the year wasn't pythonic/was kludgy. When I try yours, I get a MemoryError at the line big_df = pd.concat(df, ignore_index=True)
@BruceWayne Looks like you have a lot of data :D... by the way, did you mean` df_list`? I've changed the variables in my code a bit.
It's a lot of data, but not a crazy amount I wouldn't think. FYI it's the "National Database" from Social Security Admin...I'm surprised it's hitting a memory error? (And thanks yeah, I noticed the variable name changes).
@BruceWayne What is len(df_list)? Also, what is sum(map(len, df_list))?
@BruceWayne AHA! Because your text files have no header. The reason that happened is because concat tries to auto align the concatted DataFrames, resulting in a huge memory blowout with NaNs. Thanks for that, I'll fix my answer to reflect that.
|
0

You can use pd.DataFrame.assign to add a column seamlessly while you iterate.

Note also that it is good practice to use os.path.basename instead of splitting by specific characters: this will ensure your code will work on multiple platforms.

Updated: Add header=None and use pd.read_csv, as discussed on other answer.

dfs = []
for file in files:
    year = os.path.basename(fn)[3:7]
    dfs.append(pd.read_csv(file, header=None).assign(Year=year))

df = pd.concat(dfs, ignore_index=True, axis=1)

A more efficient way is to use a list comprehension:

dfs = [pd.read_csv(file, header=None).assign(Year=os.path.basename(fn)[3:7]) \
       for file in files]

df = pd.concat(dfs, ignore_index=True, axis=1)

1 Comment

Assign isn't the only problem here. The other problem is the separator and the manner in which concatenation is done (see my answer)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.