Python PANDAS df.duplicated and df.drop_duplicated not finding all duplicates

Question

I am having trouble using duplicated or drop_duplicates to find / remove all duplicates from a dataframe.

My data looks something like this, however, my data is 52k rows long.

data = {'inventory number':['WL-SMART-INWALL',
                         'WL-NMDISH-22',
                         'WL-MPS546-MESH',
                         'WAS-WG-500P',
                         'UKS/99757/69975',
                         'UKS/99757/69975',
                         'UKS/99750/S26361F2293L10',
                         'UKS/99750/S26361F2293L10',
                         'UKS/99733/69973',
                         'UKS/99733/69973',
                         'UKS/99727/AHD6502TU3CBK',
                         'UKS/99727/AHD6502TU3CBK',
                         'UKS/99725/PMK01',
                         'UKS/99725/PMK01',
                         'UKS/99294/A3L791R15MS',
                         'UKS/99294/A3L791R15MS',
                         'UKS/98865/58018251',
                         'UKS/98865/58018251',
                         'UKS/98509/90Q653AN1N0N2UA0',
                         'UKS/98509/90Q653AN1N0N2UA0',
                         'UKS/97771/FIBLCSC2',
                         'UKS/97771/FIBLCSC2',
                         'UKS/97627/FIBLCLC1',
                         'UKS/97627/FIBLCLC1'],
        'minimum price': ['36.85',
                         '55.45',
                         '361.29',
                         '265.0',
                         '22.46',
                         '22.46',
                         '15.0',
                         '15.0',
                         '26.71',
                         '26.71',
                         '104.0',
                         '104.0',
                         '32.3',
                         '32.3',
                         '22.51',
                         '22.51',
                         '13.0',
                         '13.0',
                         '9.59',
                         '9.59',
                         '15.0',
                         '15.0',
                         '15.0',
                         '15.0'],
    'cost':['26.11',
                         '39.23',
                         '254.99',
                         '187.09',
                         '16.0',
                         '16.0',
                         '10.7',
                         '10.7',
                         '19.0',
                         '19.0',
                         '73.46',
                         '73.46',
                         '23.0',
                         '23.0',
                         '16.0',
                         '16.0',
                         '9.29',
                         '9.29',
                         '7.0',
                         '7.0',
                         '10.7',
                         '10.7',
                         '10.7',
                         '10.7']
   }
df = pd.DataFrame(data=data)

I generated my dataframe by appending last weeks catalog to the bottom of this week. I only want to do something with the 'inventory number's that have changed, or i want the delta. I had figured i could append the two, make sure that they were the same data type, re-index, and drop duplicates, but when I write to CSV to QA, there are still thousands of duplicates.

here is my code:

_import['inventory number'] = _import['inventory number'].str.encode('utf-8')
ts_data['inventory number'] = ts_data['inventory number'].str.encode('utf-8')
overlap = overlap.append(ts_data, ignore_index=True)
overlap_dedupe = overlap[overlap.duplicated(['inventory  number','minimum price','cost'],keep=False)==False]

I have also tried:

overlap_dedupe = overlap.drop_duplicates(keep=False)

So, I know i'm running into some kind of encoding issue, because now i'm getting no duplicates.

combined.head(50).duplicated()

returns:

42736    False
32567    False
43033    False
33212    False
46592    False
46023    False
32568    False
33520    False
32756    False
26741    False
46133    False
42737    False
42480    False
40227    False
40562    False
49623    False
27712    False
31848    False
49794    False
27296    False
38198    False
35674    False
27907    False
22210    False
40563    False
18025    False
49624    False
18138    False
19357    False
43698    False
24398    False
50566    False
22276    False
38382    False
20507    False
43550    False
18150    False
29968    False
19247    False
47706    False
19248    False
43955    False
20731    False
38199    False
44168    False
17580    False
15944    False
44891    False
28327    False
16027    False
dtype: bool

Can you give an example of a small dataframe that fools duplicated()? — DYZ
– DYZ, Commented Dec 23, 2016 at 23:32
@DYZ i've edited the sample i provided to be the top 25 rows of what my code produces with drop_duplicates. however, i ran drop duplicates on this sample and it worked perfectly. — Yale Newman
– Yale Newman, Commented Dec 24, 2016 at 0:03
$@Yale Newman Then it's not helpful, I am afraid. before a problem can be solved, we should be able to reproduce it. — DYZ
– DYZ, Commented Dec 24, 2016 at 0:10
I've read that perhaps its a memory issue? but 50k rows by 3 columns shouldn't be too much? is there a more efficient way than drop_duplicates? should i do this in batches — Yale Newman
– Yale Newman, Commented Dec 24, 2016 at 0:12
50k is nothing. It is not a memory issue. Why cannot you just show an example of duplicates not removed? — DYZ
– DYZ, Commented Dec 24, 2016 at 0:18

piRSquared · Accepted Answer · 2016-12-24 01:11:54Z

drop_duplicates

duplicated

These are sister functions that work well together.

using your df

df = pd.read_json(
    ''.join(
        ['[[26.11,"WL-SMART-INWALL",36.85],[39.23,"WL-NMDISH-22",55.45',
         '],[73.46,"UKS\\/99727\\/AHD6502TU3CBK",104.0],[73.46,"UKS\\/997',
         '27\\/AHD6502TU3CBK",104.0],[23.0,"UKS\\/99725\\/PMK01",32.3],[2',
         '3.0,"UKS\\/99725\\/PMK01",32.3],[16.0,"UKS\\/99294\\/A3L791R15MS',
         '",22.51],[16.0,"UKS\\/99294\\/A3L791R15MS",22.51],[9.29,"UKS\\/',
         '98865\\/58018251",13.0],[9.29,"UKS\\/98865\\/58018251",13.0],[7',
         '.0,"UKS\\/98509\\/90Q653AN1N0N2UA0",9.59],[7.0,"UKS\\/98509\\/90',
         'Q653AN1N0N2UA0",9.59],[254.99,"WL-MPS546-MESH",361.29],[10.7',
         ',"UKS\\/97771\\/FIBLCSC2",15.0],[10.7,"UKS\\/97771\\/FIBLCSC2",1',
         '5.0],[10.7,"UKS\\/97627\\/FIBLCLC1",15.0],[10.7,"UKS\\/97627\\/F',
         'IBLCLC1",15.0],[187.09,"WAS-WG-500P",265.0],[16.0,"UKS\\/9975',
         '7\\/69975",22.46],[16.0,"UKS\\/99757\\/69975",22.46],[10.7,"UKS',
         '\\/99750\\/S26361F2293L10",15.0],[10.7,"UKS\\/99750\\/S26361F229',
         '3L10",15.0],[19.0,"UKS\\/99733\\/69973",26.71],[19.0,"UKS\\/997',
         '33\\/69973",26.71]]']
    )
)

We can clearly see that there are duplicates with

df.duplicated()

0     False
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9      True
10    False
11     True
12    False
13    False
14     True
15    False
16     True
17    False
18    False
19     True
20    False
21     True
22    False
23     True
dtype: bool

Because we did not pass the keep parameter, we assumed the default which is keep='first'. This implies that each True in this series indicates a row that is a duplicate of another row above it whose duplicated status is False.

We could short cut this and just get back an answer of whether or not there exist duplicates with

df.duplicated().any()

True

We can verify that drop_duplicates does anything by chaining our handy test for duplicates after invoking drop_duplicates

df.drop_duplicates().duplicated().any()

False

Great! It worked.
This can preserved with

df =  df.drop_duplicates()
df

         0                           1       2
0    26.11             WL-SMART-INWALL   36.85
1    39.23                WL-NMDISH-22   55.45
2    73.46     UKS/99727/AHD6502TU3CBK  104.00
4    23.00             UKS/99725/PMK01   32.30
6    16.00       UKS/99294/A3L791R15MS   22.51
8     9.29          UKS/98865/58018251   13.00
10    7.00  UKS/98509/90Q653AN1N0N2UA0    9.59
12  254.99              WL-MPS546-MESH  361.29
13   10.70          UKS/97771/FIBLCSC2   15.00
15   10.70          UKS/97627/FIBLCLC1   15.00
17  187.09                 WAS-WG-500P  265.00
18   16.00             UKS/99757/69975   22.46
20   10.70    UKS/99750/S26361F2293L10   15.00
22   19.00             UKS/99733/69973   26.71

Just to make sure

df.duplicated().any()

False

Conclusion
It worked fine for me. Hopefully this demonstration will help you clear up whatever is going wrong for you.

when i start with my own example, it all works perfectly fine. which makes me think it has something to do with my data. i'm reading the data in from a query and importing data from ftp to get my two starting data frames. then i'm appending them together and trying to get rid of all duplicates in order to be left with the delta. it just isn't working. do you have any suggestions for how i could debug this? can you show my how to do this with groupby.
if you run the test df.duplicated().any() and it comes back True. Then you have duplicates. However, you saying it doesn't work implies that df.drop_duplicates().duplicated().any() returns True. This is what I don't believe. Unless! You're telling me that it """Looks""" like there are duplicates. Are you just eye-balling the data and telling me it looks like there are still duplicates? If so, that leads me to believe that you think there are duplicates when in fact they are not because some non-printed character is making them different.
No I completely agree, there must be some kind of encoding difference or some non-printed character thats causing them to be different. In excel you can do something like trim. Is there a way for me to clean my data and see if that works? Or what steps should i take to find non-printed characters?
i'm definitely running into some kind of encoding issue because i just switched my code to encode the strings to unicode before running the append and drop duplicates and now i don't get any duplicates where before i would get a few thousand.

Collectives™ on Stack Overflow

Python PANDAS df.duplicated and df.drop_duplicated not finding all duplicates

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related