Create hash value for each row of data with selected columns in dataframe in python pandas

Question

I have asked similar question in R about creating hash value for each row of data. I know that I can use something like hashlib.md5(b'Hello World').hexdigest() to hash a string, but how about a row in a dataframe?

update 01

I have drafted my code as below:

for index, row in course_staff_df.iterrows():
        temp_df.loc[index,'hash'] = hashlib.md5(str(row[['cola','colb']].values)).hexdigest()

It seems not very pythonic to me, any better solution?

cwharland · Accepted Answer · 2014-09-10 04:56:44Z

19

Or simply:

df.apply(lambda x: hash(tuple(x)), axis = 1)

As an example:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
print df
df.apply(lambda x: hash(tuple(x)), axis = 1)

     0         1         2         3         4
0  0.728046  0.542013  0.672425  0.374253  0.718211
1  0.875581  0.512513  0.826147  0.748880  0.835621
2  0.451142  0.178005  0.002384  0.060760  0.098650

0    5024405147753823273
1    -798936807792898628
2   -8745618293760919309

answered Sep 10, 2014 at 4:56

cwharland

6,8133 gold badges25 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

do-me Over a year ago

For all future readers just be warned that using int64 for later hash comparison in pandas might not be the best idea as column datatypes in pandas get messed around with a lot. Instead of comparing int64 you accidentally might compare to type obj or even float which then results in false negatives. Might be better to use df["rowhash"] = df.apply(lambda x: hashlib.md5(str(tuple(x)).encode('utf-8')).hexdigest(), axis = 1). Note: tuple speeds up the function.

Innuendo Over a year ago

Moreover, built-in hash() function is not deterministic and depends on PYTHONHASHSEED, which takes a random value for each process when not set

Neal Fultz · Accepted Answer · 2019-08-31 16:28:51Z

16

This is now available in pandas.util.hash_pandas_object:

pandas.util.hash_pandas_object(df)

answered Aug 31, 2019 at 16:28

Neal Fultz

9,7791 gold badge46 silver badges60 bronze badges

3 Comments

Aaron Hall Over a year ago

This doesn't answer the question: "Create hash value for each row of data with selected columns in DataFrame in Python Pandas" - a row is not semantically a Pandas object in the first place - the docs say the function you gave: "Return a data hash of the Index/Series/DataFrame" - none of these are "rows"

Neal Fultz Over a year ago

Yeah, the documentation is not great.

Neal Fultz Over a year ago

Please do note that while the documentation is not great, for data frames, it will hash each column element-by-element, and then combine those hashes element-wise, resulting in one hash per row - see github.com/pandas-dev/pandas/blob/… - it is done this way for performance reasons, where access patterns by column are typically faster.

Aaron Hall · Accepted Answer · 2019-04-28 14:03:36Z

Create hash value for each row of data with selected columns in dataframe in python pandas

These solutions work for the life of the Python process.

If order matters, one method would be to coerce the row (a Series object) to a tuple:

>>> hash(tuple(df.irow(1)))
-4901655572611365671

This demonstrates order matters for tuple hashing:

>>> hash((1,2,3))
2528502973977326415
>>> hash((3,2,1))
5050909583595644743

To do so for every row, appended as a column would look like this:

>>> df = df.drop('hash', 1) # lose the old hash
>>> df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
>>> df
           y  x0                 hash
0  11.624345  10 -7519341396217622291
1  10.388244  11 -6224388738743104050
2  11.471828  12 -4278475798199948732
3  11.927031  13 -1086800262788974363
4  14.865408  14  4065918964297112768
5  12.698461  15  8870116070367064431
6  17.744812  16 -2001582243795030948
7  16.238793  17  4683560048732242225
8  18.319039  18 -4288960467160144170
9  18.750630  19  7149535252257157079

[10 rows x 3 columns]

If order does not matter, use the hash of frozensets instead of tuples:

>>> hash(frozenset((3,2,1)))
-272375401224217160
>>> hash(frozenset((1,2,3)))
-272375401224217160

Avoid summing the hashes of all of the elements in the row, as this could be cryptographically insecure and lead to hashes that fall outside the range of the original.

(You could use modulo to constrain the range, but this amounts to rolling your own hash function, and the best practice is not to.)

You can make permanent cryptographic quality hashes, for example using sha256, as well using the hashlib module.

There is some discussion of the API for cryptographic hash functions in PEP 452.

Thanks to users Jamie Marshal and Discrete Lizard for their comments.

Wesley Batista · Accepted Answer · 2020-01-09 18:25:41Z

1

I've came up with this adaption from the code provided on the question:

new_df2 = df.copy()
key_combination = ['col1', 'col2', 'col3', 'col4']
new_df2.index = list(map(lambda x: hashlib.sha1('-'.join([col_value for col_value in x]).encode('utf-8')).hexdigest(), new_df2[key_combination].values))

answered Jan 9, 2020 at 18:25

Wesley Batista

875 bronze badges

Comments

perfect · Accepted Answer · 2020-06-27 10:57:49Z

-1

df.set_index(pd.util.hash_pandas_object(df), drop=False, inplace=True)

answered Jun 27, 2020 at 10:57

perfect

1

1 Comment

Mark Rotteveel Over a year ago

Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes.

Collectives™ on Stack Overflow

Create hash value for each row of data with selected columns in dataframe in python pandas

update 01

5 Answers 5

2 Comments

3 Comments

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

update 01

5 Answers 5

2 Comments

3 Comments

Comments

Comments

1 Comment

Linked

Related