R: how to remove duplicate rows by column [duplicate]

Question

df <- data.frame(id = c(1, 1, 1, 2, 2),
                 gender = c("Female", "Female", "Male", "Female", "Male"),
                 variant = c("a", "b", "c", "d", "e"))
> df
  id gender variant
1  1 Female       a
2  1 Female       b
3  1   Male       c
4  2 Female       d
5  2   Male       e

I want to remove duplicate rows in my data.frame according to the gender column in my data set. I know there has been a similar question asked (here) but the difference here is that I would like to remove duplicate rows within each subset of the data set, where each subset is defined by an unique id.

My desired result is this:

  id gender variant
1  1 Female       a
3  1   Male       c
4  2 Female       d
5  2   Male       e

I've tried the following and it works, but I'm wondering if there's a cleaner, more efficient way of doing this?

out = list()
for(i in 1:2){
  df2 <- subset(df, id == i)
  out[[i]] <- df2[!duplicated(df2$gender), ]
}
do.call(rbind.data.frame, out)

Possible duplicate of Remove duplicated rows using dplyr OR Removing duplicate rows with ddply — Ronak Shah
– Ronak Shah, Commented Oct 10, 2017 at 1:39

stevec · Accepted Answer · 2020-05-02 05:27:48Z

29

df[!duplicated(df[ , c("id","gender")]),]

#     id  gender  variant
#  1   1  Female     a
#  3   1   Male      c
#  4   2  Female     d
#  5   2   Male      e

Another way of doing this using subset as below:

subset(df, !duplicated(subset(df, select=c(id, gender))))

#   id  gender variant
# 1  1  Female     a
# 3  1    Male     c
# 4  2  Female     d
# 5  2    Male     e

edited May 2, 2020 at 5:27

stevec

55k51 gold badges313 silver badges434 bronze badges

answered Oct 10, 2017 at 0:23

Santosh M.

2,4541 gold badge22 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

markdly · Accepted Answer · 2017-10-10 01:00:03Z

4

Here's a dplyr based solution in case you are interested (edited to include Gregor's suggestions)

library(dplyr)
group_by(df, id, gender) %>% slice(1)

#> # A tibble: 4 x 3
#> # Groups:   id, gender [4]
#>      id gender variant
#>   <dbl> <fctr>  <fctr>
#> 1     1 Female       a
#> 2     1   Male       c
#> 3     2 Female       d
#> 4     2   Male       e

It might also be worth using the arrange function as well depending on which values of variant should be removed.

edited Oct 10, 2017 at 1:00

answered Oct 10, 2017 at 0:46

markdly

4,5442 gold badges21 silver badges27 bronze badges

1 Comment

r2evans Over a year ago

Updated dplyr can now just do dplyr::distinct(df, id, gender, .keep_all=TRUE); when run in iris, this is about 6-7x faster than group_by/slice.

Collectives™ on Stack Overflow

R: how to remove duplicate rows by column [duplicate]

2 Answers 2

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Linked

Related