2

I have a large dataset that was built by combining data from multiple sources. Hence, there are a number of rows that are duplicates. I know how to remove duplicates using dplyr and distinct but I would like to have it always keep the row based on a specific value in a cell (source file). Essentially we have a ranking of which sources we prefer. Below is a very simplified dataset to use as an example:

mydata = data.frame (species =c ('myli','myli','myli','myli','myli','stili','stili','stili'),
                     count = c (10,10,15,15,12,10,10,10),
                     year =c(2020,2020,2021,2021,2019,2017,2017,2018),
                     source =c('zd','steam','ted','steam','zd','steam','ted','steam'))
    
    
    mydata

  species count year source
1    myli    10 2020     zd
2    myli    10 2020  steam
3    myli    15 2021    ted
4    myli    15 2021  steam
5    myli    12 2019     zd
6   stili    10 2017  steam
7   stili    10 2017    ted
8   stili    10 2018  steam

I do the following to remove the duplicates:

library(dplyr)
 
# Remove duplicate rows of the dataframe using 'species', 'count', and 'year' variables
distinct(mydata, species, count, year, .keep_all= TRUE)

  species count year source
1    myli    10 2020     zd
2    myli    15 2021    ted
3    myli    12 2019     zd
4   stili    10 2017  steam
5   stili    10 2018  steam

However, I want to ensure that the rows that are kept when there are duplicates prioritize the 'source' in the following order: zd > ted > steam so the final table looks like:

  species count year source
1    myli    10 2020     zd
2    myli    15 2021    ted
3    myli    12 2019     zd
4   stili    10 2017    ted
5   stili    10 2018  steam

So essentially the original rows '1', '3','5', '7' and '8' are kept and the duplicate rows '2','4', and '6' are dropped.

How to do that last step to prioritize which original row to keep of the duplicated rows?

2 Answers 2

3

Since your prioritization happens to be in reverse alphabetical order, in this case you can simply arrange(desc(source)) prior to your distinct() call

mydata %>% 
  arrange(desc(source)) %>% 
  distinct(species,count,year,.keep_all = T)

Output

  species count year source
1    myli    10 2020     zd
2    myli    12 2019     zd
3    myli    15 2021    ted
4   stili    10 2017    ted
5   stili    10 2018  steam
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! Just because this was a simplified example and my real dataset is not in alphabetical order (I did not even mean to do that) I did the following which worked great: mydata %>% arrange(match(source, c("zd", "ted", "steam"))) %>% distinct(species, count, year, .keep_all= TRUE)
1

Distinct respect the ordering. So as your criteria is alphabetically ordered* you can do it as simple as this:

mydata |>
  arrange(desc(source)) |>
  distinct(species, count, year, .keep_all= TRUE)

.* In other cases you'd need to make a variable with the order.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.