60

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was that I create the function

def _get_array_hash(arr):
    arr_hashable = arr.values
    arr_hashable.flags.writeable = False
    hash_ = hash(arr_hashable.data)
    return hash_

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

INLINE UPD.

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

hash(df.values.tobytes())

See comments for the Most efficient property to hash for numpy array.

END OF INLINE UPD.

It works for regular pandas array:

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})

In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165

In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165 

But then I try to apply it to DataFrame obtained from a .csv file:

In [15]: fpath = 'foo/bar.csv'

In [16]: data_from_file = pd.read_csv(fpath)

In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085

In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730

Can somebody explain me, how's that possible?

I can create new DataFrame out of it, like

new_data = pd.DataFrame(data=data_from_file.values, 
            columns=data_from_file.columns, 
            index=data_from_file.index)

and it works again

In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241

In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

5
  • 1
    This might help: github.com/TomAugspurger/engarde/issues/3 Commented Jul 27, 2015 at 15:45
  • I tried approach with getting hash value of index and columns, and str(data_frame) value. It's slow, and suffers from the same issues. Commented Jul 28, 2015 at 14:48
  • I'm interested in doing this as well - can I ask why you included " arr_hashable.flags.writeable = False"? Would you expect the hash() function to possibly modify the array otherwise? Commented Nov 8, 2016 at 2:05
  • @MaxPower it was long time ago, so I don't remember exactly. But I think I was inspired by the stackoverflow.com/questions/16589791/…. I worked by then. Now it doesn't work, but you can use hash(a.data.tobytes()) instead, and you don't need flags.writeable = False anymore. See the comments to the referred answer. Commented Nov 8, 2016 at 11:21
  • 2
    Actually, you don't even need .data, just use hash(a.tobytes()) or hash(df.values.tobytes()) if calling from DataFrame. I've updated the original question. Commented Nov 8, 2016 at 11:32

5 Answers 5

61

As of Pandas 0.20.1 (release notes), you can use pandas.util.hash_pandas_object (docs). It returns one hash value for reach row of the dataframe (and works on series etc. too).

import pandas as pd
import numpy as np

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)

print(df)
#      0    1   2    3
# 0   42  foo  42   42
# 1  foo  foo  42  bar
# 2   42   42  42   42

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

print(h)
# 0     5559921529589760079
# 1    16825627446701693880
# 2     7171023939017372657
# dtype: uint64

If you want an overall hash, consider the following:

int(hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest(), 16)
Sign up to request clarification or add additional context in comments.

6 Comments

Not 100%, but likely hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest() will be less collisioney than .sum().
@mathtick indeed, otherwise reordering rows gives the same hash.
The problem with hash_pandas_object is that is not serializable, due to circular dependencies, see here: github.com/pandas-dev/pandas/issues/35097 and here: github.com/uqfoundation/dill/issues/374
If the column names are different will they return different values?
@GrantCulp hash_pandas_object does not hash the columns names: the same data with different columns will result in the same hash. To avoid this you could hash df.reset_index().T instead of df, or add df.columns.values.tobytes() to the hash.
|
18

Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).

import joblib
joblib.hash(df)

2 Comments

This does not work for me! (df1 == df2).all() is True, but hashes are different.
@JulianWgs Do you have an example? For comparing two Series, if their name field is different, their hashes end up being different while their values are the same, but I couldn't replicate it for any DataFrame.
13

I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.

import pandas as pd
import hashlib
DATA_FILE = 'data.json'

data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)

assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()

6 Comments

I found .to_msgpack() stable in Python 3.6 but not in 3.5 (not sure why, might have something to do with dictionaries being ordered in Python 3.6+). Just keep it simple and to .to_csv().encode('utf-8') instead.
Keeping ostrokach's (above) comment in mind: this solution has of the out-of-the-box advantage of dealing with unhashable dataframe-elements (in contrast with pd.util.hash_pandas_object).
As of I think pandas 1.0, df.to_msgpack() is deprecated. The recommended alternative to .to_msgpack() in the pandas documentation is pyarrow, but that opens up a whole new can of worms
Also data1.values.tobytes() might return deterministic values for numeric dataframe contents, but if you have string values in your dataframe, you'll get a different bytestring for different python sessions. Might match within the same python session though
Unfortunately .values fails to capture the uniqueness of a DataFrame, with print(hashlib.md5(pd.DataFrame({'X': []}).values.tobytes()).hexdigest()) and print(hashlib.md5(pd.DataFrame({'Y': []}).values.tobytes()).hexdigest()) producing the same hash.
|
2

This function seems to work fine:

from hashlib import sha256
def hash_df(df):
    s = str(df.columns) + str(df.index) + str(df.values)
    return sha256(s.encode()).hexdigest()

2 Comments

This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From Review
Thank you for your advice I made a separate question here
0

Another option is to directly hash the input csv—since CSV is plain text, it’s often easier to hash compared to Python objects.

In bash one can do md5 <file> to obtain a hash, or if you must do it in python:

import hashlib

def hash_csv_file(file_path, algorithm='sha256'):
    hash_func = getattr(hashlib, algorithm)()
    with open(file_path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            hash_func.update(chunk)
    return hash_func.hexdigest()

# Example usage
csv_hash = hash_csv_file('your_file.csv')
print(f"SHA-256 hash of CSV file: {csv_hash}")

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.