18

I have 2 fixed width files like below (only change is Date value starting at position 14).

sample_hash1.txt

GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018

sample_hash2.txt

GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018

Using pandas read_fwf I am reading this file and creating a Dataframe by excluding the Date value and loading only the first 13 characters. My dataframe looks like this

import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])

df1

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111
...

df2

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111
...

Now I am trying to generate a hash value on each Dataframe, but the hash is different for df1 and df2. I'm not sure what's wrong with this. Can someone throw some light on this please? I have to identify if there is any change in data between the files (excluding the Date columns).

print(hash(df1.values.tostring()))
-3571422965125408226

print(hash(df2.values.tostring()))
5039867957859242153

I am loading these files into a table (each full file is around 2 GB size). Every time we are receiving full files from source. Sometimes there is no change in the data (excluding the last column, Date). My idea is to reject such files. If I can generate a hash on the file and store it somewhere (in a table) next time I can compare the new file hash value with the stored hash. I thought this is the right approach but I got stuck with hash generation.

I checked this post Most efficient property to hash for numpy array but that is not what I am looking for.

9
  • 2
    The hash will be different for different object. Both dataframe are not the same. Try df1.values.tostring() == df2.values.tostring(), it should be false. If you want to have the same hash, you need to remove the data in the values before taking the hash. Commented Apr 17, 2018 at 16:37
  • 1
    yes it is False. Is there any other way i can geneate a unique code based on the data in the file? (excluding some part of the data) Commented Apr 17, 2018 at 16:41
  • 1
    you can try: hash(df1[:-1].values.tostring()) to remove the last column. Commented Apr 17, 2018 at 16:54
  • 2
    Possible duplicate of Most efficient property to hash for numpy array Commented Apr 17, 2018 at 17:08
  • 1
    @TwistedSim last column is not in the dataframe anyway. i am loading first 13 characters only Commented Apr 17, 2018 at 17:14

4 Answers 4

28

You can now use pd.util.hash_pandas_object

hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() 

For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.

Sign up to request clarification or add additional context in comments.

3 Comments

how would you return a single hash for the entire dataframe though?
This answer worked well for me. pd.util.hash_pandas_object(df) on it's own returned a series (I believe it is the length of the number of rows in the dataframe), but using the full answer provided (putting inside hashlib.sha1(...) resulted in a single hash for the dataframe for me (hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() )
But this works only for the "content" of the dataframe and not its meta data like column and row names.
5

Use string representation dataframe.

import hashlib

print(hashlib.sha256(df1.to_json().encode()).hexdigest())
print(hashlib.sha256(df2.to_json().encode()).hexdigest())

or

print(hashlib.sha256(df1.to_csv().encode()).hexdigest())
print(hashlib.sha256(df2.to_csv().encode()).hexdigest())

3 Comments

Awesome. This is working. But i think hash generation will be slow on big files ?
Do you know why running the example in different run it give different hash ? Do you know how to get the same hash for the same dataframe during different run of the code ?
This method is slow.
4

The other answers here do forget the column names (column index) of a dataframe. The pd.util.hash_pandas_object() create a series of hash values for each row of a dataframe including it's index (the row name). But the name of columns doesn't matter as you can see here:

>>> from pandas import *
>>> from pandas import util
>>> util.hash_pandas_object(DataFrame({'A': [1,2,3], 'B': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64
>>> util.hash_pandas_object(DataFrame({'Foo': [1,2,3], 'Bar': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64

My solution

df = DataFrame({'A': [1,2,3], 'B': [4,5,6]})

hash = pandas.util.hash_pandas_object(df)

# append hash of the columns
hash = pandas.concat(
    [hash, pandas.util.hash_pandas_object(df.columns)]
)

# hash the series of hashes
hash = hashlib.sha1(hash.values).hexdigest()

print(hash)

Comments

2

In addition to other answers - simple and fast checksum:

checksum = pandas.util.hash_pandas_object(df).sum()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.