So I am using DataFrame from Pandas, python.
The dataframe, I will be referring to was created by the following way:
search = DataFrame([[262,'ny', '20'],[515,'paris','19'],[669,'ldn','10'], [669,'ldn', 10],[669,'ldn',5]],columns = ['subscriber_id','location','radius' ])
title = DataFrame([[262,'director'],[515,'artist'],[669,'scientist']],columns = ['subscriber_id','title' ])
Both the title and search DataFrames are then merged.
mergedTable = merge(title, search, on='subscriber_id', how= 'outer')
This forms the dataframe:
subscriber_id title location radius
0 262 director ny 20
1 515 artist paris 19
2 669 scientist ldn 10
3 669 scientist ldn 10
4 669 scientist ldn 5
As we can see it has been merged correctly, so we now have data for a subscriber in multiple rows dependent on their searches.
Now I do not want to get rid of the subscribers having multiple rows with different values, but I do want to get rid of duplicate rows.
This is the desired final result:
subscriber_id title location radius
0 262 director ny 20
1 515 artist paris 19
2 669 scientist ldn 10
4 669 scientist ldn 5
The row 3, a duplicate of row 2, is removed.
I have been researching this and it seems that drop_duplicates() should work, i.e.
mergedTable.drop_duplicates()
But this doesn't work, rows are not removed. Any tips/solutions available?