3

Supposedly this question has already been answered. But the user who flagged my question failed to test the solution and the cited question does not work for my problem.

I have found questions on how to filter a dataframe using another list but I have not found something that shows how to filter a dataframe using another dataframe.

I have two dataframes and the first one can be thought of as a key of ID's and dates.

   id       date
1 id1 2016-06-23
2 id2 2016-06-25
3 id3 2016-06-23
4 id4 2016-06-25
5 id5 2016-06-27

structure(list(id = structure(1:5, .Label = c("id1", "id2", "id3", 
"id4", "id5"), class = "factor"), date = structure(c(16975, 16977, 
16975, 16977, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

I then have a second dataframe with ID's and dates and I would like to filter the second dataframe to only return rows that are after the date for the ID in the first row.

Here is the second dataframe:

   id       date
1 id1 2016-06-20
2 id1 2016-06-23
3 id1 2016-06-24
4 id2 2016-06-23
5 id3 2016-06-27

structure(list(id = structure(c(1L, 1L, 1L, 2L, 3L), .Label = c("id1", 
"id2", "id3"), class = "factor"), date = structure(c(16972, 16975, 
16976, 16975, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

And this is what the results would look like:

   id       date
1 id1 2016-06-24
2 id3 2016-06-27
2
  • 3
    Possible duplicate of Join two datasets based on an inequality condition Commented Sep 26, 2017 at 17:53
  • Did you read this answer? "I'll suppose you have a constant variable in each case called 'dummy' (or alternatively, it can be another variable to join by)" -- in your case the "alternatively" Commented Sep 26, 2017 at 18:17

5 Answers 5

3

Use non-equi-join in data.table

library(data.table)

setDT(df1)
setDT(df2)

setnames(df1, 'date','date1') # disambiguate for conditional join

df1[df2, on=.(id, date1<date), nomatch=0]

Returns:

  id      date1
1: id1 2016-06-24
2: id3 2016-06-27

On large datasets I expect this approach to be faster than any approach which uses dplyr and/or a cartesian join followed by a filter.

Sign up to request clarification or add additional context in comments.

1 Comment

Why the downvote? This is probably the most efficient solution offered...
3

Thank god there is dplyr. The following code joins df1 which has unique identifiers, and keeps only these rows (filter) which matches condition date >= date.1.

Be careful, because by default when you have identical column names in both data.frames, dplyr will join by all of them. Then we have to specify by parameter and add suffix to variables names to differ identical column names.

library(dplyr)
library(magrittr)

df2 %>%
 left_join(df1, by = "id", suffix=c("",".2") ) %>%
 filter( date > date.2) %>%
 select( -date.2 )

#  id       date
# 1 id1 2016-06-23
# 2 id1 2016-06-24
# 3 id3 2016-06-27

2 Comments

This result differs from OPs expected output
replace >= with > because OP specifies "after" not "on or after"
1

Solution using data.table:

library(data.table)
setDT(d1)
setDT(d2)
merge(d1, d2, "id")[date.y > date.x, .(id, date = date.y)]

    id       date
1: id1 2016-06-24
2: id3 2016-06-27

2 Comments

@C8H10N4O2 can you give a link?
There is an example here that is conditional only (no equi-join on id). See my answer below for example w/ both. I still think this question is a dupe though.
0

So your first dataframe is basically an index. Assuming that index is called df1, and your second dataframe that you want to filter is df2, I would do this using dplyr:

library(dplyr)

df.result <- left_join(df2, df1, by = "id") %>% 
   filter(date.x > date.y) %>% 
   select(-date.y) 

eta: this would be the result:

   id     date.x
 1 id1 2016-06-24
 2 id3 2016-06-27

2 Comments

This is a duplicate of Gonzo's answer
Well I didn't see that before I posted mine, but it's not an exact duplicate. Mine is simpler.
0

join_by was added in dplyr 1.1.0 for more advanced join specifications:

library(dplyr)

inner_join(df2, df1, by = join_by(id == id, date > date)) |>
  select(id, date = date.x)

Note the equality condition and inequality conditions both have the same variable names. They are distinguished by the order of the data frames. The LHS refers to the first data frame (df2 in this case) and the RHS to the second data frame (df1 in this case).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.