3

I have a number of tests where the Pandas dataframe output needs to be compared with a static baseline file. My preferred option for the baseline file format is the csv format for its readability and easy maintenance within Git. But if I were to load the csv file into a dataframe, and use

A.equals(B) 

where A is the output dataframe and B is the dataframe loaded from the CSV file, inevitably there will be errors as the csv file does not record datatypes and what-nots. So my rather contrived solution is to write the dataframe A into a CSV file and load it back out the same way as B then ask whether they are equal.

Does anyone have a better solution that they have been using for some time without any issues?

2
  • Try looking at the difference between the two dataFrames, the output DataFrame and the one that you load from CSV using: sum((A != B).any(1)) Let me know if this works, I can't test this myself as re-creating your situation is not easy Commented Jul 19, 2017 at 15:56
  • Thanks. Can I ask what exactly does sum((A != B).any(1)) do? I am getting an output of 1. Are you doing a row by row comparison? Commented Jul 19, 2017 at 17:18

3 Answers 3

1

If you are worried about the datatypes of the csv file, you can load it as a dataframe with specific datatypes as following:

import pandas as pd
B = pd.DataFrame('path_to_csv.csv', dtypes={"col1": "int", "col2": "float64", "col3": "object"} )

This will ensure that each column of the csv is read as a particular data type

After that you can just compare the dataframes easily by using

A.equals(B)

EDIT:

If you need to compare a lot of pairs, another way to do it would be to compare the hash values of the dataframes instead of comparing each row and column of individual data frames

hashA = hash(A.values.tobytes())
hashB = hash(B.values.tobytes())

Now compare these two hash values which are just integers to check if the original data frames were same or not.

Be Careful though: I am not sure if the data types of the original data frame would matter or not. Be sure to check that.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for your suggestion. I did this in the past but as the number of files for testing increased, this approach isn't very efficient.
Yes. I have several use cases. 1) For testing 2) Verifying that existing data is the same between the in-memory dataframe and csv file.
Another way could be create hash of the dataframes to be compared and then compare those values instead of comparing the original dataframes
Hmm... How would I go about that? Do you mean a dictionary of sorts? Would there be a performance impact?
I have edited the answer to include that. Please take a look
|
1

I came across a solution that does work for my case by making use of Pandas testing utilities.

from pandas.util.testing import assert_frame_equal

Then call it from within a try except block where check_dtype is set to False.

try:
    assert_frame_equal(A, B, check_dtype=False)
    print("The dataframes are the same.")
except: 
    print("Please verify data integrity.")

Comments

0

(A != B).any(1) returns a Series with Boolean values which tells you which rows are equal and which ones aren't ...

Boolean values are internally represented by 1's and 0's, so you can do a sum() to check how many rows were not equal.

sum((A != B).any(1))

If you get an output of 0, that would mean all rows were equal.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.