Get the same hash value for a Pandas DataFrame each time

Question

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was that I create the function

def _get_array_hash(arr):
    arr_hashable = arr.values
    arr_hashable.flags.writeable = False
    hash_ = hash(arr_hashable.data)
    return hash_

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

INLINE UPD.

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

hash(df.values.tobytes())

See comments for the Most efficient property to hash for numpy array.

END OF INLINE UPD.

It works for regular pandas array:

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})

In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165

In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165

But then I try to apply it to DataFrame obtained from a .csv file:

In [15]: fpath = 'foo/bar.csv'

In [16]: data_from_file = pd.read_csv(fpath)

In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085

In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730

Can somebody explain me, how's that possible?

I can create new DataFrame out of it, like

new_data = pd.DataFrame(data=data_from_file.values, 
            columns=data_from_file.columns, 
            index=data_from_file.index)

and it works again

In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241

In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

I tried approach with getting hash value of index and columns, and str(data_frame) value. It's slow, and suffers from the same issues. — mkurnikov
– mkurnikov, Commented Jul 28, 2015 at 14:48
I'm interested in doing this as well - can I ask why you included " arr_hashable.flags.writeable = False"? Would you expect the hash() function to possibly modify the array otherwise? — Max Power
– Max Power, Commented Nov 8, 2016 at 2:05
@MaxPower it was long time ago, so I don't remember exactly. But I think I was inspired by the stackoverflow.com/questions/16589791/…. I worked by then. Now it doesn't work, but you can use hash(a.data.tobytes()) instead, and you don't need flags.writeable = False anymore. See the comments to the referred answer. — mkurnikov
– mkurnikov, Commented Nov 8, 2016 at 11:21
Actually, you don't even need .data, just use hash(a.tobytes()) or hash(df.values.tobytes()) if calling from DataFrame. I've updated the original question. — mkurnikov
– mkurnikov, Commented Nov 8, 2016 at 11:32

Joshua Shew · Accepted Answer · 2023-07-27 05:51:04Z

61

As of Pandas 0.20.1 (release notes), you can use pandas.util.hash_pandas_object (docs). It returns one hash value for reach row of the dataframe (and works on series etc. too).

import pandas as pd
import numpy as np

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)

print(df)
#      0    1   2    3
# 0   42  foo  42   42
# 1  foo  foo  42  bar
# 2   42   42  42   42

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

print(h)
# 0     5559921529589760079
# 1    16825627446701693880
# 2     7171023939017372657
# dtype: uint64

If you want an overall hash, consider the following:

int(hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest(), 16)

edited Jul 27, 2023 at 5:51

Joshua Shew

1,0511 gold badge9 silver badges32 bronze badges

answered Dec 13, 2017 at 18:39

Jonathan Stray

2,8653 gold badges29 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

safetyduck Over a year ago

Not 100%, but likely hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest() will be less collisioney than .sum().

Sergey Orshanskiy Over a year ago

@mathtick indeed, otherwise reordering rows gives the same hash.

alessiosavi Over a year ago

The problem with hash_pandas_object is that is not serializable, due to circular dependencies, see here: github.com/pandas-dev/pandas/issues/35097 and here: github.com/uqfoundation/dill/issues/374

Grant Culp Over a year ago

If the column names are different will they return different values?

bckygldstn Over a year ago

@GrantCulp hash_pandas_object does not hash the columns names: the same data with different columns will result in the same hash. To avoid this you could hash df.reset_index().T instead of df, or add df.columns.values.tobytes() to the hash.

|

uut · Accepted Answer · 2019-12-12 23:23:18Z

18

Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).

import joblib
joblib.hash(df)

answered Dec 12, 2019 at 23:23

uut

1,9741 gold badge15 silver badges17 bronze badges

2 Comments

JulianWgs Over a year ago

This does not work for me! (df1 == df2).all() is True, but hashes are different.

uut Over a year ago

@JulianWgs Do you have an example? For comparing two Series, if their name field is different, their hashes end up being different while their values are the same, but I couldn't replicate it for any DataFrame.

eMMe · Accepted Answer · 2017-01-18 09:19:15Z

13

I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.

import pandas as pd
import hashlib
DATA_FILE = 'data.json'

data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)

assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()

answered Jan 18, 2017 at 9:19

eMMe

5895 silver badges16 bronze badges

6 Comments

ostrokach Over a year ago

I found .to_msgpack() stable in Python 3.6 but not in 3.5 (not sure why, might have something to do with dictionaries being ordered in Python 3.6+). Just keep it simple and to .to_csv().encode('utf-8') instead.

keepAlive Over a year ago

Keeping ostrokach's (above) comment in mind: this solution has of the out-of-the-box advantage of dealing with unhashable dataframe-elements (in contrast with pd.util.hash_pandas_object).

Ben Lindsay Over a year ago

As of I think pandas 1.0, df.to_msgpack() is deprecated. The recommended alternative to .to_msgpack() in the pandas documentation is pyarrow, but that opens up a whole new can of worms

Ben Lindsay Over a year ago

Also data1.values.tobytes() might return deterministic values for numeric dataframe contents, but if you have string values in your dataframe, you'll get a different bytestring for different python sessions. Might match within the same python session though

Eytan Over a year ago

Unfortunately .values fails to capture the uniqueness of a DataFrame, with print(hashlib.md5(pd.DataFrame({'X': []}).values.tobytes()).hexdigest()) and print(hashlib.md5(pd.DataFrame({'Y': []}).values.tobytes()).hexdigest()) producing the same hash.

|

edmz · Accepted Answer · 2021-11-10 08:54:37Z

2

This function seems to work fine:

from hashlib import sha256
def hash_df(df):
    s = str(df.columns) + str(df.index) + str(df.values)
    return sha256(s.encode()).hexdigest()

edited Nov 10, 2021 at 8:54

answered Nov 8, 2021 at 22:11

edmz

3282 silver badges9 bronze badges

2 Comments

Emi OB Over a year ago

This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From Review

edmz Over a year ago

Thank you for your advice I made a separate question here

NucFlash · Accepted Answer · 2025-04-11 14:12:44Z

Another option is to directly hash the input csv—since CSV is plain text, it’s often easier to hash compared to Python objects.

In bash one can do md5 <file> to obtain a hash, or if you must do it in python:

import hashlib

def hash_csv_file(file_path, algorithm='sha256'):
    hash_func = getattr(hashlib, algorithm)()
    with open(file_path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            hash_func.update(chunk)
    return hash_func.hexdigest()

# Example usage
csv_hash = hash_csv_file('your_file.csv')
print(f"SHA-256 hash of CSV file: {csv_hash}")

Collectives™ on Stack Overflow

Get the same hash value for a Pandas DataFrame each time

5 Answers 5

6 Comments

2 Comments

6 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

2 Comments

6 Comments

2 Comments

Comments

Linked

Related