0

Given a following dataframe:

import pandas as pd

df = pd.DataFrame({'month': [2, 2, 1, 1, 2, 10],
                   'year': [2017, 2017, 2020, 2020, 2018, 2019],
                   'sale': [60, 45, 90, 20, 28, 36],
                   'title': ['Ones', 'Twoes', 'Three', 'Four', 'Five', 'Six']})

I am trying to get duplicates in month columnn.

df[df.duplicated(subset=['month'])]

By default, keep="first"

But this is giving two occurrences for month 2.

   month  year  sale  title
1      2  2017    45  Twoes
3      1  2020    20   Four
4      2  2018    28   Five

I'm confused with the output. Am I missing something here?

1
  • the output is the duplicate values in your dataframe, not the values after dropping the duplicates. Commented Jul 6, 2021 at 7:13

2 Answers 2

2

Ouput is filter all duplicates with remove first dupe.

If need first dupes only invert mask and chain mask for filter only dupes with keep=False parameter:

df1 = df[~df.duplicated(subset=['month']) & df.duplicated(subset=['month'], keep=False)]
print (df1)
   month  year  sale  title
0      2  2017    60   Ones
2      1  2020    90  Three
Sign up to request clarification or add additional context in comments.

Comments

2

the output is the duplicate values in your dataframe, not the values after dropping the duplicates. if you want only the non duplicate values then

df.drop_duplicates(subset=['month'])

which will give you

  month  year   sale title
0   2   2017    60  Ones
2   1   2020    90  Three
5   10  2019    36  Six

you can use keep = ['first', 'last', 'None'] based on your requirement.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.