8

I'm trying to append a pandas DataFrame (single column) to an existing CSV, much like this post, but it's not working! Instead my column is added at the bottom of the csv, and repeated over and over (rows in csv >> size of column). Here's my code:

with open(outputPath, "a") as resultsFile:
    print len(scores)
    scores.to_csv(resultsFile, header=False)
    print resultsFile

Terminal output:4032 <open file '/Users/alavin/nta/NAB/results/numenta/artificialWithAnomaly/numenta_art_load_balancer_spikes.csv', mode 'a' at 0x1088686f0>

Thank you in advance!

6
  • 1
    two additional bits of info would be helpful: 1) what do the existing contents of resultsFile look like? (confirm by inspecting the file manually) and 2) what does the scores DataFrame look like? (scores.head(10) should suffice) Commented Jan 8, 2015 at 18:44
  • 1
    It's appending the scores data frame at the end of the file because that's how the pandas to_csv functionality works. If you want to append scores as a new column on the original csv data frame then you would need to read the csv into a data frame, append the scores column and then write it back to the csv. Commented Jan 8, 2015 at 18:45
  • resultsFile is a csv of 5 columns: 'timestamp', 'value', 'aaa', 'bbb', 'label'. I would like the 6th to be the scores DataFrame. I've verified all columns are the same length. scores has a column header 's'. Commented Jan 8, 2015 at 19:08
  • @aus_lacy I should've clarified, I'm trying to do this without reading in the csv... Commented Jan 8, 2015 at 19:19
  • @alavin89 I don't think it's possible to append the column to the original data frame within the csv without opening the file and parsing the data since python has no way of knowing that there is a data frame in the csv to append to. Commented Jan 8, 2015 at 19:23

2 Answers 2

13

Like what @aus_lacy has already suggested, you just need to read the csv file into a data frame first, concatenate two data frames and write it back to the csv file:

supposed your existing data frame called df:

df_csv = pd.read_csv(outputPath, 'your settings here')

# provided that their lengths match
df_csv['to new column'] = df['from single column']

df_csv.to_csv(outputPath, 'again your settings here')

That's it.

Sign up to request clarification or add additional context in comments.

6 Comments

I'm trying to avoid opening and reading in all that data, but this does work :)
@alavin89, do you have to use python?
@alavin89, then it will be difficult, provided that you still need to open+read each line of your csv to find the linefeed and append your new column. I don't like DiskIO-wise you can have an easy solution
For some reason, to_csv() is adding data in new rows, I want to add the dataframe in new columns. Can you please help? pythonfiddle.com/copy-csv-and-dataframe
@Veronica are you certain the length of two dataframes actually matched?
|
3

I find the solution problematic, if many columns are to be added to a large csv file iteratively.

A solution would be to accept the csv file to store a transposed dataframe. i.e. headers works as indices and vice versa.

The upside is that you don't waste computation power on insidious operations.

Here is operation times for regular appending mode, mode='a', and appending column approach for series with length of 5000 appended 100 times:

enter image description here

The downside is that you have to transpose the dataframe to get the "intended" dataframe when reading the csv for other purposes.

Code for plot:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

col = []
row = []
N = 100

# Append row approach
for i in range(N):
    t1 = dt.datetime.now()
    data = pd.DataFrame({f'col_{i}':np.random.rand(5000)}).T
    data.to_csv('test_csv_data1.txt',mode='a',header=False,sep="\t")
    t2 = dt.datetime.now()
    row.append((t2-t1).total_seconds())

# Append col approach
pd.DataFrame({}).to_csv('test_csv_data2.txt',header=True,sep="\t")
for i in range(N):
    t1 = dt.datetime.now()
    data = pd.read_csv('test_csv_data2.txt',sep='\t',header=0)
    data[f'col_{i}'] = np.random.rand(5000)
    data.to_csv('test_csv_data2.txt',header=True,sep="\t")
    t2 = dt.datetime.now()
    col.append((t2-t1).total_seconds())
    
t = pd.DataFrame({'N appendices':[i for i in range(N)],'append row':row,'append col':col})
t = t.set_index('N appendices')

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.