Merge two data frame where one column is matched based on condition

Question

Mock data:

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

df1 contains the country-year of interest. I want to get the myvar values for this country-year AND 3 years before and after.

In other words, the merging is done based on condition that df2$country==df1$country AND df2$year > df1$year - 3 & df2$year < df1$year + 3

EDIT: My (working, not elegant) solution is to pad df1 to create all the country-years that I'm interested in, then merge with df2 the regular way.

library(plyr)
ddply(df1, c("country", "year"), 
  function(df) data.frame(rep(df$country, 7), (df$year-3):(df$year+3)))

produces

   country year rep.df.country..7. X.df.year...3...df.year...3.
1       UK 2003                 UK                         2000
2       UK 2003                 UK                         2001
3       UK 2003                 UK                         2002
4       UK 2003                 UK                         2003
5       UK 2003                 UK                         2004
6       UK 2003                 UK                         2005
7       UK 2003                 UK                         2006
8       US 2000                 US                         1997
9       US 2000                 US                         1998
10      US 2000                 US                         1999
11      US 2000                 US                         2000
12      US 2000                 US                         2001
13      US 2000                 US                         2002
14      US 2000                 US                         2003

Don't have time for a full answer now, but will post one later if I manage. Check out the new function foverlaps from the package data.table if you use them. Used to answer this question — ilir
– ilir, Commented Oct 19, 2014 at 21:01

rawr · Accepted Answer · 2014-10-20 00:19:36Z

Where does merging fit in? This just sounds like a subsetting issue unless I misunderstood the question (as I oft admittedly do)

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))


f <- lapply(df1$country, function(x) {
  tmp <- df2[df2$country == x, ]
  tmp[abs(tmp$year - df1[df1$country == x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361

EDIT

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

f <- lapply(seq_len(nrow(df1)), function(x) {
  y <- df1[x, 'country']
  tmp <- df2[df2$country == y, ]
  tmp[abs(tmp$year - df1[x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361
# 7       US 2006  0.48742905
# 9       US 2008  0.57578135
# 17      US 2006 -0.01619026
# 19      US 2008  0.82122120
# 8       UK 2007  0.73832471
# 10      UK 2009 -0.30538839
# 18      UK 2007  0.94383621
# 20      UK 2009  0.59390132

This works for the mock data, but not in a more general setting where df1$country is not unique (because of this part df1[df1$country == x, 'year']). Note that my working solution works in that general condition (I think). (It's my bad for posting oversimplified mock data ofc).
are there different years for each non unique country? if not, you can just use unique(df1[...)

KFB · Accepted Answer · 2014-10-20 12:09:27Z

1

A trial using foverlaps in data.table

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))
library(data.table)
setDT(df1); setDT(df2) # convert to data table
df1[, c("start", "end") := list(year-2, year+2)]
setkey(df1, country, start, end)
setkey(df2[, year2:=year], country, year, year2)
foverlaps(df1, df2, type="any")[,4:7:=NULL][]
    country year       myvar
 1:      UK 2001  0.18364332
 2:      UK 2001  0.38984324
 3:      UK 2003  1.59528080
 4:      UK 2003 -2.21469989
 5:      UK 2005 -0.82046838
 6:      UK 2005 -0.04493361
 7:      UK 2007  0.73832471
 8:      UK 2007  0.94383621
 9:      UK 2009 -0.30538839
10:      UK 2009  0.59390132
11:      US 2000 -0.62645381
12:      US 2000  1.51178117
13:      US 2002 -0.83562861
14:      US 2002 -0.62124058
15:      US 2008  0.57578135
16:      US 2008  0.82122120

edited Oct 20, 2014 at 12:09

answered Oct 19, 2014 at 22:07

KFB

3,5013 gold badges17 silver badges18 bronze badges

7 Comments

Arun Over a year ago

I think it must be year-2 and year+2 (since foverlaps works on closed intervals)? Now I see the reason for your earlier Q about foverlaps. Will fix. Thanks.

Arun Over a year ago

Small note: If keys are set for both tables appropriately, then by.x and by.y are not necessary.

KFB Over a year ago

Removed by.x by.y.; y-3 and year+3 should be correct to include concerned year.

Arun Over a year ago

Again, I don't think so. The condition is df1$year - 3 < df2$year < df1$year+3; i.e., (df1$year-3, df1$year+3). But foverlaps assumes closed intervals - [df1$year-3, df1$year+3] and would therefore also match on the border. I don't think that's required. Also IIUC the foverlaps command must be foverlaps(df2, df1, type="within", nomatch=0L) - check if df2 interval lies within df1's interval, df1 is the lookup.

rawr Over a year ago

geez that's ugly. I thought data table was supposed to be elegant :) @Arun

|

Arun · Accepted Answer · 2016-05-12 12:19:12Z

A perhaps simple solution using data.table

library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation    

# convert datasets into data.table
  setDT(df1)
  setDT(df2)


# create conditional columns in df1
  df1[, yearplus3  :=  year +3 ][, yearminus3 := year - 3 ]

# merge
    output <- df1[df2, on = .(country = country ,                # condition 1
                              yearminus3 < year,                 # condition 2
                              yearplus3  > year), nomatch = 0 ,  # condition 3
                  .(country, year,  myvar )]  # indicate columns in the output


output 
 >   country year       myvar
 >1:      US 2000 -0.62645381
 >2:      UK 2003  0.18364332
 >3:      US 2000 -0.83562861
 >4:      UK 2003  1.59528080
 >5:      UK 2003 -0.82046838
 >6:      US 2000  1.51178117
 >7:      UK 2003  0.38984324
 >8:      US 2000 -0.62124058

ps. note that the argument on = is still in development version of data.table as of today (12 May 2016)

You should probably mention that this is currently in development and is subjected to changes.. (non-equi joins part).

Collectives™ on Stack Overflow

Merge two data frame where one column is matched based on condition

3 Answers 3

3 Comments

7 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

7 Comments

1 Comment

Linked

Related