How to generate a Hash or checksum value on Python Dataframe (created from a fixed width file)?

Question

I have 2 fixed width files like below (only change is Date value starting at position 14).

sample_hash1.txt

GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018

sample_hash2.txt

GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018

Using pandas read_fwf I am reading this file and creating a Dataframe by excluding the Date value and loading only the first 13 characters. My dataframe looks like this

import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])

df1

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111
...

df2

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111
...

Now I am trying to generate a hash value on each Dataframe, but the hash is different for df1 and df2. I'm not sure what's wrong with this. Can someone throw some light on this please? I have to identify if there is any change in data between the files (excluding the Date columns).

print(hash(df1.values.tostring()))
-3571422965125408226

print(hash(df2.values.tostring()))
5039867957859242153

I am loading these files into a table (each full file is around 2 GB size). Every time we are receiving full files from source. Sometimes there is no change in the data (excluding the last column, Date). My idea is to reject such files. If I can generate a hash on the file and store it somewhere (in a table) next time I can compare the new file hash value with the stored hash. I thought this is the right approach but I got stuck with hash generation.

I checked this post Most efficient property to hash for numpy array but that is not what I am looking for.

The hash will be different for different object. Both dataframe are not the same. Try df1.values.tostring() == df2.values.tostring(), it should be false. If you want to have the same hash, you need to remove the data in the values before taking the hash. — TwistedSim
– TwistedSim, Commented Apr 17, 2018 at 16:37
yes it is False. Is there any other way i can geneate a unique code based on the data in the file? (excluding some part of the data) — goks
– goks, Commented Apr 17, 2018 at 16:41
you can try: hash(df1[:-1].values.tostring()) to remove the last column. — TwistedSim
– TwistedSim, Commented Apr 17, 2018 at 16:54
Possible duplicate of Most efficient property to hash for numpy array — javidcf
– javidcf, Commented Apr 17, 2018 at 17:08
@TwistedSim last column is not in the dataframe anyway. i am loading first 13 characters only — goks
– goks, Commented Apr 17, 2018 at 17:14

Roko Mijic · Accepted Answer · 2020-07-06 10:27:05Z

28

You can now use pd.util.hash_pandas_object

hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest()

For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.

answered Jul 6, 2020 at 10:27

Roko Mijic

7,0654 gold badges32 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tom Over a year ago

how would you return a single hash for the entire dataframe though?

Joseph Over a year ago

This answer worked well for me. pd.util.hash_pandas_object(df) on it's own returned a series (I believe it is the length of the number of rows in the dataframe), but using the full answer provided (putting inside hashlib.sha1(...) resulted in a single hash for the dataframe for me (hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() )

buhtz Over a year ago

But this works only for the "content" of the dataframe and not its meta data like column and row names.

python-starter · Accepted Answer · 2018-04-24 16:20:49Z

5

Use string representation dataframe.

import hashlib

print(hashlib.sha256(df1.to_json().encode()).hexdigest())
print(hashlib.sha256(df2.to_json().encode()).hexdigest())

or

print(hashlib.sha256(df1.to_csv().encode()).hexdigest())
print(hashlib.sha256(df2.to_csv().encode()).hexdigest())

edited Apr 24, 2018 at 16:20

answered Apr 17, 2018 at 21:37

python-starter

3004 silver badges13 bronze badges

3 Comments

goks Over a year ago

Awesome. This is working. But i think hash generation will be slow on big files ?

alessiosavi Over a year ago

Do you know why running the example in different run it give different hash ? Do you know how to get the same hash for the same dataframe during different run of the code ?

winderland Over a year ago

This method is slow.

buhtz · Accepted Answer · 2023-05-30 13:00:36Z

The other answers here do forget the column names (column index) of a dataframe. The pd.util.hash_pandas_object() create a series of hash values for each row of a dataframe including it's index (the row name). But the name of columns doesn't matter as you can see here:

>>> from pandas import *
>>> from pandas import util
>>> util.hash_pandas_object(DataFrame({'A': [1,2,3], 'B': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64
>>> util.hash_pandas_object(DataFrame({'Foo': [1,2,3], 'Bar': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64

My solution

df = DataFrame({'A': [1,2,3], 'B': [4,5,6]})

hash = pandas.util.hash_pandas_object(df)

# append hash of the columns
hash = pandas.concat(
    [hash, pandas.util.hash_pandas_object(df.columns)]
)

# hash the series of hashes
hash = hashlib.sha1(hash.values).hexdigest()

print(hash)

icaine · Accepted Answer · 2024-02-14 10:44:04Z

2

In addition to other answers - simple and fast checksum:

checksum = pandas.util.hash_pandas_object(df).sum()

answered Feb 14, 2024 at 10:44

icaine

4744 silver badges8 bronze badges

Collectives™ on Stack Overflow

How to generate a Hash or checksum value on Python Dataframe (created from a fixed width file)?

4 Answers 4

3 Comments

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

3 Comments

Comments

Comments

Linked

Related