Best way to compare Pandas dataframe with csv file

Question

I have a number of tests where the Pandas dataframe output needs to be compared with a static baseline file. My preferred option for the baseline file format is the csv format for its readability and easy maintenance within Git. But if I were to load the csv file into a dataframe, and use

A.equals(B)

where A is the output dataframe and B is the dataframe loaded from the CSV file, inevitably there will be errors as the csv file does not record datatypes and what-nots. So my rather contrived solution is to write the dataframe A into a CSV file and load it back out the same way as B then ask whether they are equal.

Does anyone have a better solution that they have been using for some time without any issues?

Try looking at the difference between the two dataFrames, the output DataFrame and the one that you load from CSV using: sum((A != B).any(1)) Let me know if this works, I can't test this myself as re-creating your situation is not easy — Rakesh Adhikesavan
– Rakesh Adhikesavan, Commented Jul 19, 2017 at 15:56
Thanks. Can I ask what exactly does sum((A != B).any(1)) do? I am getting an output of 1. Are you doing a row by row comparison? — Spinor8
– Spinor8, Commented Jul 19, 2017 at 17:18

Ankur · Accepted Answer · 2017-07-19 16:31:12Z

1

If you are worried about the datatypes of the csv file, you can load it as a dataframe with specific datatypes as following:

import pandas as pd
B = pd.DataFrame('path_to_csv.csv', dtypes={"col1": "int", "col2": "float64", "col3": "object"} )

This will ensure that each column of the csv is read as a particular data type

After that you can just compare the dataframes easily by using

A.equals(B)

EDIT:

If you need to compare a lot of pairs, another way to do it would be to compare the hash values of the dataframes instead of comparing each row and column of individual data frames

hashA = hash(A.values.tobytes())
hashB = hash(B.values.tobytes())

Now compare these two hash values which are just integers to check if the original data frames were same or not.

Be Careful though: I am not sure if the data types of the original data frame would matter or not. Be sure to check that.

edited Jul 19, 2017 at 16:31

answered Jul 19, 2017 at 16:05

Ankur

6,2431 gold badge17 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Spinor8 Over a year ago

Thanks for your suggestion. I did this in the past but as the number of files for testing increased, this approach isn't very efficient.

Spinor8 Over a year ago

Yes. I have several use cases. 1) For testing 2) Verifying that existing data is the same between the in-memory dataframe and csv file.

Ankur Over a year ago

Another way could be create hash of the dataframes to be compared and then compare those values instead of comparing the original dataframes

Spinor8 Over a year ago

Hmm... How would I go about that? Do you mean a dictionary of sorts? Would there be a performance impact?

Ankur Over a year ago

I have edited the answer to include that. Please take a look

|

Spinor8 · Accepted Answer · 2017-07-19 18:00:10Z

1

I came across a solution that does work for my case by making use of Pandas testing utilities.

from pandas.util.testing import assert_frame_equal

Then call it from within a try except block where check_dtype is set to False.

try:
    assert_frame_equal(A, B, check_dtype=False)
    print("The dataframes are the same.")
except: 
    print("Please verify data integrity.")

answered Jul 19, 2017 at 18:00

Spinor8

1,6174 gold badges24 silver badges54 bronze badges

Comments

Rakesh Adhikesavan · Accepted Answer · 2017-07-19 17:58:42Z

0

(A != B).any(1) returns a Series with Boolean values which tells you which rows are equal and which ones aren't ...

Boolean values are internally represented by 1's and 0's, so you can do a sum() to check how many rows were not equal.

sum((A != B).any(1))

If you get an output of 0, that would mean all rows were equal.

answered Jul 19, 2017 at 17:58

Rakesh Adhikesavan

13k18 gold badges58 silver badges78 bronze badges

Collectives™ on Stack Overflow

Best way to compare Pandas dataframe with csv file

3 Answers 3

7 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Related