4

Mock data:

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

df1 contains the country-year of interest. I want to get the myvar values for this country-year AND 3 years before and after.

In other words, the merging is done based on condition that df2$country==df1$country AND df2$year > df1$year - 3 & df2$year < df1$year + 3

EDIT: My (working, not elegant) solution is to pad df1 to create all the country-years that I'm interested in, then merge with df2 the regular way.

library(plyr)
ddply(df1, c("country", "year"), 
  function(df) data.frame(rep(df$country, 7), (df$year-3):(df$year+3)))

produces

   country year rep.df.country..7. X.df.year...3...df.year...3.
1       UK 2003                 UK                         2000
2       UK 2003                 UK                         2001
3       UK 2003                 UK                         2002
4       UK 2003                 UK                         2003
5       UK 2003                 UK                         2004
6       UK 2003                 UK                         2005
7       UK 2003                 UK                         2006
8       US 2000                 US                         1997
9       US 2000                 US                         1998
10      US 2000                 US                         1999
11      US 2000                 US                         2000
12      US 2000                 US                         2001
13      US 2000                 US                         2002
14      US 2000                 US                         2003
1
  • 1
    Don't have time for a full answer now, but will post one later if I manage. Check out the new function foverlaps from the package data.table if you use them. Used to answer this question Commented Oct 19, 2014 at 21:01

3 Answers 3

1

Where does merging fit in? This just sounds like a subsetting issue unless I misunderstood the question (as I oft admittedly do)

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))


f <- lapply(df1$country, function(x) {
  tmp <- df2[df2$country == x, ]
  tmp[abs(tmp$year - df1[df1$country == x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361

EDIT

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

f <- lapply(seq_len(nrow(df1)), function(x) {
  y <- df1[x, 'country']
  tmp <- df2[df2$country == y, ]
  tmp[abs(tmp$year - df1[x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361
# 7       US 2006  0.48742905
# 9       US 2008  0.57578135
# 17      US 2006 -0.01619026
# 19      US 2008  0.82122120
# 8       UK 2007  0.73832471
# 10      UK 2009 -0.30538839
# 18      UK 2007  0.94383621
# 20      UK 2009  0.59390132
Sign up to request clarification or add additional context in comments.

3 Comments

This works for the mock data, but not in a more general setting where df1$country is not unique (because of this part df1[df1$country == x, 'year']). Note that my working solution works in that general condition (I think). (It's my bad for posting oversimplified mock data ofc).
are there different years for each non unique country? if not, you can just use unique(df1[...)
Yep, different years for non unique country is the concern
1

A trial using foverlaps in data.table

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))
library(data.table)
setDT(df1); setDT(df2) # convert to data table
df1[, c("start", "end") := list(year-2, year+2)]
setkey(df1, country, start, end)
setkey(df2[, year2:=year], country, year, year2)
foverlaps(df1, df2, type="any")[,4:7:=NULL][]
    country year       myvar
 1:      UK 2001  0.18364332
 2:      UK 2001  0.38984324
 3:      UK 2003  1.59528080
 4:      UK 2003 -2.21469989
 5:      UK 2005 -0.82046838
 6:      UK 2005 -0.04493361
 7:      UK 2007  0.73832471
 8:      UK 2007  0.94383621
 9:      UK 2009 -0.30538839
10:      UK 2009  0.59390132
11:      US 2000 -0.62645381
12:      US 2000  1.51178117
13:      US 2002 -0.83562861
14:      US 2002 -0.62124058
15:      US 2008  0.57578135
16:      US 2008  0.82122120

7 Comments

I think it must be year-2 and year+2 (since foverlaps works on closed intervals)? Now I see the reason for your earlier Q about foverlaps. Will fix. Thanks.
Small note: If keys are set for both tables appropriately, then by.x and by.y are not necessary.
Removed by.x by.y.; y-3 and year+3 should be correct to include concerned year.
Again, I don't think so. The condition is df1$year - 3 < df2$year < df1$year+3; i.e., (df1$year-3, df1$year+3). But foverlaps assumes closed intervals - [df1$year-3, df1$year+3] and would therefore also match on the border. I don't think that's required. Also IIUC the foverlaps command must be foverlaps(df2, df1, type="within", nomatch=0L) - check if df2 interval lies within df1's interval, df1 is the lookup.
geez that's ugly. I thought data table was supposed to be elegant :) @Arun
|
0

A perhaps simple solution using data.table

library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation    

# convert datasets into data.table
  setDT(df1)
  setDT(df2)


# create conditional columns in df1
  df1[, yearplus3  :=  year +3 ][, yearminus3 := year - 3 ]

# merge
    output <- df1[df2, on = .(country = country ,                # condition 1
                              yearminus3 < year,                 # condition 2
                              yearplus3  > year), nomatch = 0 ,  # condition 3
                  .(country, year,  myvar )]  # indicate columns in the output


output 
 >   country year       myvar
 >1:      US 2000 -0.62645381
 >2:      UK 2003  0.18364332
 >3:      US 2000 -0.83562861
 >4:      UK 2003  1.59528080
 >5:      UK 2003 -0.82046838
 >6:      US 2000  1.51178117
 >7:      UK 2003  0.38984324
 >8:      US 2000 -0.62124058

ps. note that the argument on = is still in development version of data.table as of today (12 May 2016)

1 Comment

You should probably mention that this is currently in development and is subjected to changes.. (non-equi joins part).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.