Can't remove duplicates from DataFrame with drop_duplicates

Question

So I am using DataFrame from Pandas, python.

The dataframe, I will be referring to was created by the following way:

search = DataFrame([[262,'ny', '20'],[515,'paris','19'],[669,'ldn','10'], [669,'ldn', 10],[669,'ldn',5]],columns = ['subscriber_id','location','radius' ])

title = DataFrame([[262,'director'],[515,'artist'],[669,'scientist']],columns = ['subscriber_id','title' ])

Both the title and search DataFrames are then merged.

mergedTable = merge(title, search, on='subscriber_id', how= 'outer')

This forms the dataframe:

   subscriber_id      title location radius
0            262   director       ny     20
1            515     artist    paris     19
2            669  scientist      ldn     10
3            669  scientist      ldn     10
4            669  scientist      ldn      5

As we can see it has been merged correctly, so we now have data for a subscriber in multiple rows dependent on their searches.

Now I do not want to get rid of the subscribers having multiple rows with different values, but I do want to get rid of duplicate rows.

This is the desired final result:

   subscriber_id      title location radius
0            262   director       ny     20
1            515     artist    paris     19
2            669  scientist      ldn     10
4            669  scientist      ldn      5

The row 3, a duplicate of row 2, is removed.

I have been researching this and it seems that drop_duplicates() should work, i.e.

mergedTable.drop_duplicates()

But this doesn't work, rows are not removed. Any tips/solutions available?

Can't get why it is downvoted; my vote count reached day limit, so can't upvote. The question, despite being consequence of some inattentiveness, seems good to me, having valid test case, sadly not the most common thing on SO. — alko
– alko, Commented Dec 2, 2013 at 18:53

alko · Accepted Answer · 2013-12-02 18:40:44Z

3

Your radius is of dtype object due to some strings within: [669,'ldn','10']. And '10' != 10. Converting to integer will do the trick:

>>> mergedTable.radius = mergedTable.radius.astype(int)
>>> mergedTable.drop_duplicates()
   subscriber_id      title location  radius
0            262   director       ny      20
1            515     artist    paris      19
2            669  scientist      ldn      10
4            669  scientist      ldn       5

answered Dec 2, 2013 at 18:40

alko

48.7k12 gold badges99 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Can't remove duplicates from DataFrame with drop_duplicates

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related