3

Is there a way to merge (left outer join) data frames by multiple columns, but with OR condition?

Example: There are two data frames df1 and df2 with columns x, y, num. I would like to have a data frame with all rows from df1, but with only those rows from df2 which satisfy the conditions: df1$x == df2$x OR df2$y == df2y.

Here are sample data:

df1 <- data.frame(x = LETTERS[1:5],
                  y = 1:5,
                  num = rnorm(5), stringsAsFactors = F)
df1
  x y       num
1 A 1 0.4209480
2 B 2 0.4687401
3 C 3 0.3018787
4 D 4 0.0669793
5 E 5 0.9231559

df2 <- data.frame(x = LETTERS[3:7],
                  y = 3:7,
                  num = rnorm(5), stringsAsFactors = F)
df2$x[4] <- NA
df2$y[3] <- NA
df2
     x  y        num
1    C NA -0.7160824
2 <NA>  4 -0.3283618
3    E  5 -1.8775298
4    F  6 -0.9821082
5    G  7  1.8726288

Then, the result is expected to be:

  x y       num    x  y        num
1 A 1 0.4209480 <NA> NA         NA
2 B 2 0.4687401 <NA> NA         NA
3 C 3 0.3018787    C NA -0.7160824
4 D 4 0.0669793 <NA>  4 -0.3283618
5 E 5 0.9231559    E  5 -1.8775298

The most obvious solution is to use the sqldf package:

mergedData <- sqldf::sqldf("SELECT * FROM df1
                           LEFT OUTER JOIN df2
                           ON df1.x = df2.x
                           OR df1.y = df2.y")

Unfortunately this simple solution is extremely slow, and it will take ages to merge data frames with more than 100k rows each.

Another option is to split the right data frame and merge by parts, but it is there any more elegant or even "out of the box" solution?

6
  • 1
    I didn't downvote it, but it sounds like you have a working solution and you just want to figure out how to make it faster. This being the case, this question would be more appropriate for codereview.stackexchange.com Commented Jul 26, 2016 at 13:39
  • 3
    @Hack-R Disagree: code review isn’t (primarily) for improving performance, it’s for improving code quality. The question, as is, is perfectly suited for Stack Overflow: there’s a technical problem needs solving. Commented Jul 26, 2016 at 13:40
  • 1
    @KonradRudolph I'd respectfully disagree and say that optimizing code is identically equal to improving performance. It's not a technical problem so much as the question of "how do I make this code better", which is the exact distinction of CodeReview vs. SO. Commented Jul 26, 2016 at 13:43
  • 3
    @Hack-R Code Review is best thought of as General advice on improving code. Stack Overflow, on the other hand, is for "Specific programming questions". In cases where somebody has a sufficiently-scoped specific performance query, like here, SO is absolutely the appropriate place for it. Commented Jul 26, 2016 at 13:45
  • 1
    Three notes on your data: 1. use set.seed to make it reproducible. 2. pay attention to the construction of the NAs in df2 and paste in the result. 3. do df1 and df2 really have the same variable "num"? Or would it make more sense to give them different names? Commented Jul 26, 2016 at 14:01

1 Answer 1

1

Here's one approach using data.table. For each column, we perform a join, but only extract the indices (as opposed to materialising the entire join).. Then, we can combine these indices from all the columns (this part would need some changes if there can be multiple matches).

require(data.table)
setDT(df1)
setDT(df2)

foo <- function(dx, dy, cols) {
    ix = lapply(cols, function(col) {
        dy[dx, on=col, which=TRUE] # for each row in dx, get matching indices of dy
                                   # by matching on column specified in "col"
    })
    ix = do.call(function(...) pmax(..., na.rm=TRUE), ix)
}
ix = foo(df1, df2, c("x", "y")) # obtain matching indices of df2 for each row in df1
df1[, paste0("col", 1:3) := df2[ix]] # update df1 by reference
df1
#    x y         num col1 col2       col3
# 1: A 1  2.09611034   NA   NA         NA
# 2: B 2 -1.06795571   NA   NA         NA
# 3: C 3  1.38254433    C    3  1.0173476
# 4: D 4 -0.09367922    D    4 -0.6379496
# 5: E 5  0.47552072    E   NA -0.1962038

You can use setDF(df1) to convert it back to a data.frame, if necessary.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for solution, it is indeed much faster than sqldf. It took 23.59sec for sqldf to merge two data frames of 10k rows, while your solution finished in 0.011sec

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.