Using setdiff on dataframes with list columns

Question

Is there an R function to get the rows that are in one data.frame but not in another, if the data.frames contain list-columns? I know dplyr::setdiff will work on regular data.frames, but if I apply it to a data.frame with a list-column, I get an error.

list_df1 <- data.frame(x = c(1, 2, 2))
list_df1$y <- list(c("A", "B"), c("C"), c("B", "C"))
list_df2 <- data.frame(x = c(2, 3))
list_df2$y <- list(c("C"), c("D", "E"))
dplyr::setdiff(list_df1, list_df2)
#> Error: Can't join on 'y' x 'y' because of incompatible types (list / list)

Currently I've been using a loop over the rows in both data.frames and directly comparing if the rows are equal:

in_df2 <- rep(FALSE, nrow(list_df1))
for (row_ind1 in seq_len(nrow(list_df1))) {
  for (row_ind2 in seq_len(nrow(list_df2))) {
    rows_equal <- all.equal(list_df1[row_ind1, ], 
                            list_df2[row_ind2, ], 
                            check.attributes = FALSE)
    if (isTRUE(rows_equal)) {
      in_df2[row_ind1] <- TRUE
      break
    }
  }
}
list_df1[!in_df2, ]
#>   x    y
#> 1 1 A, B
#> 3 2 B, C

And while gives the result I'm looking for, I'm sure there must be a better or more efficient solution.

dplyr::anti_join is also a possible solution, if the non-list columns uniquely identify the results. But in this case, I want to remove rows only if all entries are identical between the two data.frames. If we apply anti_join on just column x we don't get the results I'm looking for:

dplyr::anti_join(list_df1, list_df2, by = "x")
#>   x    y
#> 1 1 A, B

And applying it to all columns gives an error, just list set_diff

dplr::anti_join(list_df1, list_df2)
#> Error: Can't join on 'y' x 'y' because of incompatible types (list / list)

Welcome to Stack Overflow! What you're trying to do is known as merge (base R) or join (dplyr or SQL). To find rows that are not in another table is an anti join. One solution is found in this Stack Overflow question — akraf
– akraf, Commented Nov 26, 2019 at 16:38
Thanks for pointing me to that question, it looks like it has some ideasI can work with. But dplyr::anti_join gives the same error Error: Can't join on 'y' x 'y' because of incompatible types (list / list) — Paul PR
– Paul PR, Commented Nov 26, 2019 at 16:57
See the linked SO question in my previous comment. It uses non-list columns created from the list columns as key columns for merging. I mentioned the terms to help your search in case you don't want to follow that solution — akraf
– akraf, Commented Nov 26, 2019 at 17:32
I get the exact desired result with anti_join(list_df1, list_df2, by = 'x') - please clarify your problem — tjebo
– tjebo, Commented Nov 26, 2019 at 22:31

Paul PR · Accepted Answer · 2019-12-08 17:11:01Z

The short answer is that the ability to apply setdiff to data.frames with list-columns is not built into R or dplyr at the moment. However, it is possible to implement a solution that's significantly faster than the naive approach of looping over both data.frames.

Option 1: If the list-columns aren't needed to identify the different rows, then dplyr::anti_join() is a great solution. To join by all non-list columns you could use (assuming list_df1 and list_df2 have the same column names):

nonlist_setdiff <- function(list_df1, list_df2) {
  nonlist_vars <- list_df1 %>%
    select_if(function(x) !is.list(x)) %>%
    colnames(.)
  dplyr::anti_join(list_df1, list_df2, by = nonlist_vars)
}

Option 2: If the list-columns are need to identify the different rows, then anti_join() is no longer an option. Instead, you can create a hash of each row of both data.frames, turning them into character vectors. Then you can apply setdiff to the character vectors. This approach uses the digest, dplyr, and purrr packages:

digest_setdiff <- function(list_df1, list_df2) {
  hash1 <- list_df1 %>%
    purrr::transpose() %>% # Transposing helps speed up the calculations
    purrr::map_chr(digest::digest, algo = "xxhash64") #use xxhash64 for speed
  hash2 <- list_df2 %>%
    purrr::transpose() %>%
    purrr::map_chr(digest::digest, algo = "xxhash64")
  only_in_df1 <- setdiff(hash1, hash2)
  list_df1[hash1 %in% only_in_df1, ]
}

This approach should work for all types of data.frames, no matter how complicated, and it's significantly faster than the naive approach. However, it is still way slower than the dplyr::setdiff and dplyr::anti_join commands, so don't use it if you don't have to.

Running digest_setdiff on your test input data gives the desired result:

digest_setdiff(list_df1, list_df2)
#>   x    y
#> 1 1 A, B
#> 3 2 B, C

Collectives™ on Stack Overflow

Using setdiff on dataframes with list columns

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related