1

I'm performing data validation in Python using the Pandas module. I have two datasets to compare source and target data for expected values. I've successfully merged two dataframes using pd.merge and need to identify the columns causing the merge to be left or right only.

Obviously I can find the rows not matching with ['_merge'] != 'both', but is there a way to output the names or positions of the columns that fit the != 'both' condition? That way I wouldn't have to sort through the row to find which column is not working as expected?

For example, let's say these are the two dataframes:

SOURCE

ID First Name Last Name
001 John Doe
002 Roger Smith
003 Maggie Adams

TARGET

ID First Name Last Name
A001 John Doe
A002 Roger Smith
A003 Maggie Adams

Expected output: ID

In this scenario, the _merge value != 'both' due to the values in the ID column not matching. What command will give me either the position or name of the ID column in either dataframe?

If possible, I would also like to know how to find exact position (row and column) of mismatching values.

3
  • The question likely refers to the indicator parameter of pandas.merge() Commented Sep 10 at 18:46
  • 1
    Can you provide the actual parameters you are passing to pandas.merge()? Commented Sep 10 at 19:08
  • These are the parameters passed to pandas.merge(): dataframes_compare = pd.merge( SOURCE , TARGET , how = 'outer' # full outer join , indicator = True # adds column to output called _merge with source for each row ) Commented Sep 10 at 21:23

2 Answers 2

0

I am not sure how efficient my solution is for large dataframes, but basically you can compare two dataframes by corresponding values. Something like this:

df1 = pd.DataFrame({"ID":["001", "002", "003"],
                    "First Name":["John", "Roger", "Maggie"],
                    "Last Name":["Doe", "Smith", "Adams"]})
df2 = pd.DataFrame({"ID":["001", "A002", "A003"],
                    "First Name":["John1", "Roger", "Maggie"],
                    "Last Name":["Doe", "Smith", "Adams"]})

res = df1.ne(df2).stack()
diffs = res[res.eq(True)].index.tolist()

diffs:

[(0, 'First Name'), (1, 'ID'), (2, 'ID')]

The major issue with this colution is that when your dataframes have different shapes, you have to additionally find out which of them has extra elements.

Sign up to request clarification or add additional context in comments.

Comments

0

Pandas compare and equals should do what you need.

import pandas as pd
df1 = pd.DataFrame({"ID":["001", "002", "003"],
                    "First Name":["John", "Roger", "Maggie"],
                    "Last Name":["Doe", "Smith", "Adams"]})
df2 = pd.DataFrame({"ID":["001", "A002", "A003"],
                    "First Name":["John1", "Roger", "Maggie"],
                    "Last Name":["Doe", "Smith", "Adams"]})

df1.equals(df2)
# returns False

df1.compare(df2)
('ID', 'self') ('ID', 'other') ('First Name', 'self') ('First Name', 'other')
0 nan nan John John1
1 2 A002 nan nan
2 3 A003 nan nan

Starting with equals will let you know if there are differences. Compare provides the details of where the differences are.

You can also target individual columns to get more concise output.

df1['ID'].compare(df2['ID'])
self other
1 002 A002
2 003 A003

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.