Fast SQL record look-up by using hash in Python/pandas

Question

I'm looking to maintain a (Postgres) SQL database collecting data from third parties. As most data is static, while I get a full dump every day, I want to store only the data that is new. I.e., every day I get 100K new records with say 300 columns, and 95K rows will be the same. In order to do so in an efficient way, I was thinking of inserting a hash of my record (coming from a Pandas dataframe or a Python dict) alongside the data. Some other data is stored as well, like when it was loaded into the database. Then I could, prior to inserting data in the database, hash the incoming data and verify the record is not yet in the database easily, instead of having to check all 300 columns.

My question: which hash function to pick (given that I'm in Python and prefer to use a very fast & solid solution that requires little coding from my side while being able to handle all kinds of data like ints, floats, strings, datetimes, etc)

Python's hash is unsuited as it changes for every session (like: Create hash value for each row of data with selected columns in dataframe in python pandas does)
md5 or sha1 are cryptographic hashes. I don't need the crypto part, as this is not for security. Might be a bit slow as well, and I had some troubles with strings as these require encoding.
is a solution like CRC good enough?

For two and three, if you recommend, how can I implement it for arbitrary dicts and pandas rows? I have had little success in keeping this simple. For instance, for strings I needed to explicitly define the encoding, and the order of the fields in the record should also not change the hash.

Edit: I just realized that it might be tricky to depend on Python for this, if I change programming language I might end up with different hashes. Tying it to the database seems the more sensible choice.

Zach Estela · Accepted Answer · 2018-08-15 17:34:36Z

0

Have you tried pandas.util.hash_pandas_object?

Not sure how efficient this is, but maybe you could use it like this:

df.apply(lambda row: pd.util.hash_pandas_object(row), axis=1)

This will at least get you a pandas Series of hashes for each row in the df.

edited Aug 15, 2018 at 17:34

answered Aug 15, 2018 at 17:24

Zach Estela

1,4422 gold badges9 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jvz Over a year ago

It works for Pandas dataframes (actually, I needed to call it as follows: pd.util.hash_pandas_object(data), returns a hash per row), but not for dicts or list of dicts unfortunately. Moreover, the column ordering is also important, but from a data perspective it should not be. The more I think about it, the more I'm leaning towards a database solution.

Zach Estela Over a year ago

To clarify, my thoughts were to use the above df.apply example to create a column in the database that is the hash of each row. Then when there is new data, load it into a temporary dataframe and use the same function to compare the row hashes against the existing row hashes.

jvz Over a year ago

That should indeed work, although it enforces a dependency on Pandas.

Collectives™ on Stack Overflow

Fast SQL record look-up by using hash in Python/pandas

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related