I am having trouble using duplicated or drop_duplicates to find / remove all duplicates from a dataframe.
My data looks something like this, however, my data is 52k rows long.
data = {'inventory number':['WL-SMART-INWALL',
'WL-NMDISH-22',
'WL-MPS546-MESH',
'WAS-WG-500P',
'UKS/99757/69975',
'UKS/99757/69975',
'UKS/99750/S26361F2293L10',
'UKS/99750/S26361F2293L10',
'UKS/99733/69973',
'UKS/99733/69973',
'UKS/99727/AHD6502TU3CBK',
'UKS/99727/AHD6502TU3CBK',
'UKS/99725/PMK01',
'UKS/99725/PMK01',
'UKS/99294/A3L791R15MS',
'UKS/99294/A3L791R15MS',
'UKS/98865/58018251',
'UKS/98865/58018251',
'UKS/98509/90Q653AN1N0N2UA0',
'UKS/98509/90Q653AN1N0N2UA0',
'UKS/97771/FIBLCSC2',
'UKS/97771/FIBLCSC2',
'UKS/97627/FIBLCLC1',
'UKS/97627/FIBLCLC1'],
'minimum price': ['36.85',
'55.45',
'361.29',
'265.0',
'22.46',
'22.46',
'15.0',
'15.0',
'26.71',
'26.71',
'104.0',
'104.0',
'32.3',
'32.3',
'22.51',
'22.51',
'13.0',
'13.0',
'9.59',
'9.59',
'15.0',
'15.0',
'15.0',
'15.0'],
'cost':['26.11',
'39.23',
'254.99',
'187.09',
'16.0',
'16.0',
'10.7',
'10.7',
'19.0',
'19.0',
'73.46',
'73.46',
'23.0',
'23.0',
'16.0',
'16.0',
'9.29',
'9.29',
'7.0',
'7.0',
'10.7',
'10.7',
'10.7',
'10.7']
}
df = pd.DataFrame(data=data)
I generated my dataframe by appending last weeks catalog to the bottom of this week. I only want to do something with the 'inventory number's that have changed, or i want the delta. I had figured i could append the two, make sure that they were the same data type, re-index, and drop duplicates, but when I write to CSV to QA, there are still thousands of duplicates.
here is my code:
_import['inventory number'] = _import['inventory number'].str.encode('utf-8')
ts_data['inventory number'] = ts_data['inventory number'].str.encode('utf-8')
overlap = overlap.append(ts_data, ignore_index=True)
overlap_dedupe = overlap[overlap.duplicated(['inventory number','minimum price','cost'],keep=False)==False]
I have also tried:
overlap_dedupe = overlap.drop_duplicates(keep=False)
So, I know i'm running into some kind of encoding issue, because now i'm getting no duplicates.
combined.head(50).duplicated()
returns:
42736 False
32567 False
43033 False
33212 False
46592 False
46023 False
32568 False
33520 False
32756 False
26741 False
46133 False
42737 False
42480 False
40227 False
40562 False
49623 False
27712 False
31848 False
49794 False
27296 False
38198 False
35674 False
27907 False
22210 False
40563 False
18025 False
49624 False
18138 False
19357 False
43698 False
24398 False
50566 False
22276 False
38382 False
20507 False
43550 False
18150 False
29968 False
19247 False
47706 False
19248 False
43955 False
20731 False
38199 False
44168 False
17580 False
15944 False
44891 False
28327 False
16027 False
dtype: bool


duplicated()?