Filter a Dataframe by Another Dataframe

Question

Supposedly this question has already been answered. But the user who flagged my question failed to test the solution and the cited question does not work for my problem.

I have found questions on how to filter a dataframe using another list but I have not found something that shows how to filter a dataframe using another dataframe.

I have two dataframes and the first one can be thought of as a key of ID's and dates.

   id       date
1 id1 2016-06-23
2 id2 2016-06-25
3 id3 2016-06-23
4 id4 2016-06-25
5 id5 2016-06-27

structure(list(id = structure(1:5, .Label = c("id1", "id2", "id3", 
"id4", "id5"), class = "factor"), date = structure(c(16975, 16977, 
16975, 16977, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

I then have a second dataframe with ID's and dates and I would like to filter the second dataframe to only return rows that are after the date for the ID in the first row.

Here is the second dataframe:

   id       date
1 id1 2016-06-20
2 id1 2016-06-23
3 id1 2016-06-24
4 id2 2016-06-23
5 id3 2016-06-27

structure(list(id = structure(c(1L, 1L, 1L, 2L, 3L), .Label = c("id1", 
"id2", "id3"), class = "factor"), date = structure(c(16972, 16975, 
16976, 16975, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

And this is what the results would look like:

   id       date
1 id1 2016-06-24
2 id3 2016-06-27

Possible duplicate of Join two datasets based on an inequality condition — C8H10N4O2
– C8H10N4O2, Commented Sep 26, 2017 at 17:53
Did you read this answer? "I'll suppose you have a constant variable in each case called 'dummy' (or alternatively, it can be another variable to join by)" -- in your case the "alternatively" — C8H10N4O2
– C8H10N4O2, Commented Sep 26, 2017 at 18:17

Community · Accepted Answer · 2020-06-20 09:12:55Z

3

Use non-equi-join in `data.table`

library(data.table)

setDT(df1)
setDT(df2)

setnames(df1, 'date','date1') # disambiguate for conditional join

df1[df2, on=.(id, date1<date), nomatch=0]

Returns:

  id      date1
1: id1 2016-06-24
2: id3 2016-06-27

On large datasets I expect this approach to be faster than any approach which uses dplyr and/or a cartesian join followed by a filter.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Sep 26, 2017 at 18:01

C8H10N4O2

19.2k10 gold badges106 silver badges145 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

C8H10N4O2 Over a year ago

Why the downvote? This is probably the most efficient solution offered...

halfer · Accepted Answer · 2017-09-27 15:03:43Z

3

Thank god there is dplyr. The following code joins df1 which has unique identifiers, and keeps only these rows (filter) which matches condition date >= date.1.

Be careful, because by default when you have identical column names in both data.frames, dplyr will join by all of them. Then we have to specify by parameter and add suffix to variables names to differ identical column names.

library(dplyr)
library(magrittr)

df2 %>%
 left_join(df1, by = "id", suffix=c("",".2") ) %>%
 filter( date > date.2) %>%
 select( -date.2 )

#  id       date
# 1 id1 2016-06-23
# 2 id1 2016-06-24
# 3 id3 2016-06-27

edited Sep 27, 2017 at 15:03

halfer

20.2k19 gold badges110 silver badges207 bronze badges

answered Sep 26, 2017 at 17:44

GoGonzo

2,8971 gold badge22 silver badges26 bronze badges

2 Comments

pogibas Over a year ago

This result differs from OPs expected output

C8H10N4O2 Over a year ago

replace >= with > because OP specifies "after" not "on or after"

pogibas · Accepted Answer · 2017-09-26 17:58:51Z

1

Solution using data.table:

library(data.table)
setDT(d1)
setDT(d2)
merge(d1, d2, "id")[date.y > date.x, .(id, date = date.y)]

    id       date
1: id1 2016-06-24
2: id3 2016-06-27

answered Sep 26, 2017 at 17:58

pogibas

28.5k21 gold badges92 silver badges120 bronze badges

2 Comments

pogibas Over a year ago

@C8H10N4O2 can you give a link?

C8H10N4O2 Over a year ago

There is an example here that is conditional only (no equi-join on id). See my answer below for example w/ both. I still think this question is a dupe though.

kpress · Accepted Answer · 2017-09-26 18:55:29Z

0

So your first dataframe is basically an index. Assuming that index is called df1, and your second dataframe that you want to filter is df2, I would do this using dplyr:

library(dplyr)

df.result <- left_join(df2, df1, by = "id") %>% 
   filter(date.x > date.y) %>% 
   select(-date.y)

eta: this would be the result:

   id     date.x
 1 id1 2016-06-24
 2 id3 2016-06-27

answered Sep 26, 2017 at 18:55

kpress

1466 bronze badges

2 Comments

C8H10N4O2 Over a year ago

This is a duplicate of Gonzo's answer

kpress Over a year ago

Well I didn't see that before I posted mine, but it's not an exact duplicate. Mine is simpler.

LMc · Accepted Answer · 2023-07-12 16:30:54Z

0

join_by was added in dplyr 1.1.0 for more advanced join specifications:

library(dplyr)

inner_join(df2, df1, by = join_by(id == id, date > date)) |>
  select(id, date = date.x)

Note the equality condition and inequality conditions both have the same variable names. They are distinguished by the order of the data frames. The LHS refers to the first data frame (df2 in this case) and the RHS to the second data frame (df1 in this case).

answered Jul 12, 2023 at 16:30

LMc

19k4 gold badges41 silver badges54 bronze badges

Collectives™ on Stack Overflow

Filter a Dataframe by Another Dataframe

5 Answers 5

Use non-equi-join in `data.table`

1 Comment

2 Comments

2 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Use non-equi-join in data.table

1 Comment

2 Comments

2 Comments

2 Comments

Comments

Linked

Related

Use non-equi-join in `data.table`