1

Need your help please.

I have two dataframes created from csvs and I need to return a new dataframe which will be the difference between the two on a specific field/column. For example, if ID from df1 is not in df2, then df3 should give me all columns and rows from df1 that are not in df2.

Note df1 and df2 columns are not identical i.e df1 could have more or less columns than df2 but the columns in df3 should be as of d1. Also, the ID (from df1) and User ID (df2) values are going to be the reconciling factor, the data in the fields will be the common factor but the actual field names are different.

Apologies in advance as the tables below are not clear. So in the below example, 1st row in df1 is not in df2, df3 should have this row. Once done, I need to save df3 as csv.

DF1

Direction ID Quantity Company Status

Sell - 09 - 32000 - T LTD - Rejected

Buy - 12 - 25000 - G Ltd - Done

Sell - 15 - 35000 - H Ltd - Done

DF2

Direction User ID Quantity Company Status Rating

Buy - 12 - 25000 - G Ltd - Done - Good Rating

Sell - 15 - 35000 - H Ltd - Done - Good Rating

Many thanks in advance

code so far:

import pandas as pd

fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"

createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')

I have found the isin method which I think will help but the problem is that the "User ID" column (df2) has a space in the data frame (as is the case in the csv).

createDf1[createDf1.ID.isin(createDf2.columns[2].values)]

and I get the below error when

AttributeError: 'str' object has no attribute 'values'

I passed columns [2] in the isin method as the User ID has a space

Please help address the error and why the data is not being read so that i can get a unique set where the user Id from df2 is not in ID in df1.

See below - the one highligted is the one that is missing in DF2 and I would like this in df3

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
12
  • Which columns are you using for comparison? All columns in DF1? Commented Aug 16, 2018 at 10:07
  • Just the ID column Commented Aug 16, 2018 at 10:09
  • @jpp please see below Commented Aug 16, 2018 at 10:10
  • And which of these columns might be the ID column? The second? Commented Aug 16, 2018 at 10:12
  • yes the second column in DF1 and also the 'User ID' in df2 - the check should be done on the data in these two columns Commented Aug 16, 2018 at 10:13

1 Answer 1

2

I would do:

import pandas as pd

fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"

createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')

# df3 will have createDf1 columns with ID's that are not in createDf2
# ~ means 'not' to the filter
# Acces the column via ['COLUMN NAME'] so you can put spaces into it ;)
df3 = createDf1[~createDf1['ID'].isin(createDf2['User ID'])]

I hope this helps!

Sign up to request clarification or add additional context in comments.

9 Comments

Hello - thank you very much for the response - however, the solution doesnt seem to work. There should be in total 5642 different records (checked by doing a simple v lookup). However, using your solution, I still get the same number of records in the df3 as per df2.
I'm trying this in a Jupyter Notebook and it seems to work as I want to. Make your own try with little df's first so you convince yourself that is working. import pandas as pd df1 = pd.DataFrame({'ID': [1,2,3], 'b':[4,5,6], 'c':[7,8,9]}) df2 = pd.DataFrame({'User ID': [7,2,5], 'b':[4,5,6], 'c':[7,8,9]}) df3 = df1[~df1['ID'].isin(df2['User ID'])] print(df1) print(df2) print(df3) Sorry I expected line skips
I am using spyder - the columns contain numbers as ID and User ID - do you think that will have an impact?
I don't think so , It's python anyway. I'm used to python3 btw. Try it with little simple dataframes, it will help you see how that functions work.
ah I have an issue with one of the files that I am using as input, corp.csv. I used engine = 'python' to bypass the original traceback. If I remove the engine, then I get this error which is why I think the df3 is not wokring,
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.