I have 2 fixed width files like below (only change is Date value starting at position 14).
sample_hash1.txt
GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018
sample_hash2.txt
GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018
Using pandas read_fwf I am reading this file and creating a Dataframe by excluding the Date value and loading only the first 13 characters. My dataframe looks like this
import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])
df1
GOKULKRISHNA
0 ABCDEFGHIJKL
1 111111111111
...
df2
GOKULKRISHNA
0 ABCDEFGHIJKL
1 111111111111
...
Now I am trying to generate a hash value on each Dataframe, but the hash is different for df1 and df2. I'm not sure what's wrong with this. Can someone throw some light on this please? I have to identify if there is any change in data between the files (excluding the Date columns).
print(hash(df1.values.tostring()))
-3571422965125408226
print(hash(df2.values.tostring()))
5039867957859242153
I am loading these files into a table (each full file is around 2 GB size). Every time we are receiving full files from source. Sometimes there is no change in the data (excluding the last column, Date). My idea is to reject such files. If I can generate a hash on the file and store it somewhere (in a table) next time I can compare the new file hash value with the stored hash. I thought this is the right approach but I got stuck with hash generation.
I checked this post Most efficient property to hash for numpy array but that is not what I am looking for.
df1.values.tostring() == df2.values.tostring(), it should be false. If you want to have the same hash, you need to remove the data in the values before taking the hash.hash(df1[:-1].values.tostring())to remove the last column.