Subsetting Data Frame Based on Contents of a "Column" List

Question

Set-Up

I have a list matrix, where one of the "columns" is a list (I realize it's an odd dataset to work with, but I find it useful for other operations). Each entry of the list is either; (1) empty (integer(0)), (2) an integer, or (3) a vector of integers.

E.g. the R object "d.f", With d.f$ID an index vector, and d.f$Basket_List the list.

ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
                    c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)
d.f$Basket_List <- Basket_List

My Question

Issue 1

I'd like to create a new dataset that's a subset of the initial, based on whether or not "Basket_List" contains certain value(s). E.g. a subset of all the rows in d.f such that Bask_list has "123" or "123" & "987" -- or other more complicated conditions.

I've tried every variation of the following, but to no avail.

d.f2 <- subset(d.f, 123 %in% Basket_List)
d.f2 <- subset(d.f, 123 == any(Basket_List))
d.f2 <- d.f[which(123 %in% d.f$Basket_List,]
# should return the subset, with rows 2,3,5,7 & 8

Issue 2

My other issue is that'd I'll be running this operation over many millions of rows (it's transaction data), so I'd like to optimize it as much as possible for speed (I have a complicated for loop now, but it takes too much time).

Alternative Set-Up of Data

If you think it might be useful, the data might also be set-up as the following:

ID <- c(1,2,2,3,3,4,5,5,6,7,7,8,8,9)
Basket <- c(NA,123,987,123,123,456,456,123,456,123,987,987,123,987)
alt.d.f <- data.frame(ID,Basket)

Ari B. Friedman · Accepted Answer · 2013-04-28 21:01:56Z

7

You can use sapply for this:

ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
                    c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)

sel <- sapply( Basket_List, function(bl,searchItem) {
  any(searchItem %in% bl)
}, searchItem=c(123) )

> sel
[1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE

> d.f[sel,,drop=FALSE]
  ID
2  2
3  3
5  5
7  7
8  8

Please be careful with your terminology. A data.frame is not a matrix. It's a type of list.

Speed-wise, sapply is not the fastest, but the selection will be very fast since it is vectorized. If you need more speed, data.table time.

answered Apr 28, 2013 at 21:01

Ari B. Friedman

73k35 gold badges183 silver badges238 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Simon O'Hanlon Over a year ago

+1 nice solution. I had almost the exact same initial thought. I think the OPs second setup of data will be faster to subset than the first?

Ari B. Friedman Over a year ago

@SimonO101 I agree that the second setup of data will likely be much faster than the first. It's also much more amenable to the use of data.table, which will increase the speed substantially on a large dataset.

Simon O'Hanlon · Accepted Answer · 2013-04-29 07:54:08Z

A similar approach to @AriB is to use the any operator, apply ing across rows, like so:

d.f[ apply( d.f , 1 , function(x) any(unlist(x) %in% 123) ) , ]
#  ID Basket_List
#2  2    123, 987
#3  3    123, 123
#5  5    456, 123
#7  7    123, 987
#8  8    987, 123

With the second set up of your data I imagine that it would be very fast, because you could simply subset like so:

df[ df$Basket %in% 123 , ]
#   ID Basket
#NA NA     NA
#2   2    123
#4   3    123
#5   3    123
#8   5    123
#10  7    123
#13  8    123

And if you only want the first instance of a row that contains Basket value you can subsequently use match with the unique IDs, as match returns first match of it's first argument in it's second:

df2 <- df[ df$Basket %in% 123 , ]
df2[ match( unique(df2$ID) , df2$ID),]
#   ID Basket
#NA NA     NA
#2   2    123
#4   3    123
#8   5    123
#10  7    123
#13  8    123

The second setup of your data will be far faster than the first I think. In fact, let's do a rough benchmark with it on a 1 million row table:

DF <- data.frame( ID = sample(ID , 1e6 , repl=TRUE) , Basket = sample(Basket , 1e6 , repl = TRUE) )
df<-DF

system.time({
  df2 <- df[ df$Basket %in% 123 , ]
  df2[ match( unique(df2$ID) , df2$ID),]
})
#   user  system elapsed 
#   0.16    0.00    0.16 

nrow(df)
#[1] 1000000
nrow(df2)
#[1] 428187

Nice. Wasn't sure what would happen when you applyd across a list row, and was to lazy to try. Now I know :-)

Noel · Accepted Answer · 2017-05-30 07:27:21Z

1

A slightly more readable solution using the purrr & dplyr libraries (and the magrittr pipe operator) would be:

library(dplyr)
library(purrr)    

d.f %>% filter(map_lgl(Basket_List,contains,as.integer(123)))

edited May 30, 2017 at 7:27

answered May 30, 2017 at 7:14

Noel

736 bronze badges

Collectives™ on Stack Overflow

Subsetting Data Frame Based on Contents of a "Column" List

Set-Up

My Question

Issue 1

Issue 2

Alternative Set-Up of Data

3 Answers 3

2 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Set-Up

My Question

Issue 1

Issue 2

Alternative Set-Up of Data

3 Answers 3

2 Comments

1 Comment

Comments

Linked

Related