1

I'm looking to maintain a (Postgres) SQL database collecting data from third parties. As most data is static, while I get a full dump every day, I want to store only the data that is new. I.e., every day I get 100K new records with say 300 columns, and 95K rows will be the same. In order to do so in an efficient way, I was thinking of inserting a hash of my record (coming from a Pandas dataframe or a Python dict) alongside the data. Some other data is stored as well, like when it was loaded into the database. Then I could, prior to inserting data in the database, hash the incoming data and verify the record is not yet in the database easily, instead of having to check all 300 columns.

My question: which hash function to pick (given that I'm in Python and prefer to use a very fast & solid solution that requires little coding from my side while being able to handle all kinds of data like ints, floats, strings, datetimes, etc)

For two and three, if you recommend, how can I implement it for arbitrary dicts and pandas rows? I have had little success in keeping this simple. For instance, for strings I needed to explicitly define the encoding, and the order of the fields in the record should also not change the hash.

Edit: I just realized that it might be tricky to depend on Python for this, if I change programming language I might end up with different hashes. Tying it to the database seems the more sensible choice.

1 Answer 1

0

Have you tried pandas.util.hash_pandas_object?

Not sure how efficient this is, but maybe you could use it like this:

df.apply(lambda row: pd.util.hash_pandas_object(row), axis=1)

This will at least get you a pandas Series of hashes for each row in the df.

Sign up to request clarification or add additional context in comments.

3 Comments

It works for Pandas dataframes (actually, I needed to call it as follows: pd.util.hash_pandas_object(data), returns a hash per row), but not for dicts or list of dicts unfortunately. Moreover, the column ordering is also important, but from a data perspective it should not be. The more I think about it, the more I'm leaning towards a database solution.
To clarify, my thoughts were to use the above df.apply example to create a column in the database that is the hash of each row. Then when there is new data, load it into a temporary dataframe and use the same function to compare the row hashes against the existing row hashes.
That should indeed work, although it enforces a dependency on Pandas.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.