Subsetting based on values of a different data frame in R

Question

I want to subset data if every value in the row is greater than the respective row in a different data frame. I also need to skip some top rows. These previous questions did not help me, but it is related:

Subsetting a data frame based on contents of another data frame

Subset data using information from a different data frame [r]

> A
     name1 name2
cond   trt  ctrl
hour     0     3
A        1     1
B       10     1
C        1     1
D        1     1
E       10    10
> B
     name1 name2
cond   trt  ctrl
hour     0     3
A        1     1
B        1    10
C        1     1
D        1     1
E        1     1

I want this. Only rows where ALL values were greater in A than B:

     name1 name2
cond   trt  ctrl
hour     0     3
E       10    10

I've tried these 3 lines:

subset(A, TRUE, select=(A[3:7,] > B[3:7,]))
subset(A, A > B)
A[A[3:7,] > B[3:7,]]

Thanks so much. Here is the code to generate the data:

A <- structure(list(name1 = c("trt", "0", "1", "10", "1", "1", "10"
), name2 = c("ctrl", "3", "1", "1", "1", "1", "10")), .Names = c("name1", 
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
B <- structure(list(name1 = c("trt", "0", "1", "1", "1", "1", "1"), 
    name2 = c("ctrl", "3", "1", "10", "1", "1", "1")), .Names = c("name1", 
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")

############# Follow-up question asked 2/28/13

Error when subsetting based on adjusted values of different data frame in R

the 'hours' row values are NOT greater than. Do you want to ignore that row? — Carl Witthoft
– Carl Witthoft, Commented Feb 26, 2013 at 19:38

redmode · Accepted Answer · 2013-02-26 19:32:21Z

5

N <- nrow(A)
cond <- sapply(3:N, function(i) sum(A[i,] > B[i,])==2)
rbind(A[1:2,], subset(A[3:N,], cond))

answered Feb 26, 2013 at 19:32

redmode

4,9611 gold badge27 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

chimpsarehungry Over a year ago

This is very good. I see that you're including a function to apply to those values. But I am confused about the i? Does sapply use a loop or something? What I really like about this is I can now change sum==2 to 1 if I want 50% of values in A are greater than B.

agstudy Over a year ago

Note side : The use of sum is good , but it is a little bit trick. What if you have a condition like A.name1 > A.name2 and A.name1 < B.name2?

chimpsarehungry Over a year ago

Very true agstudy. That is actually where I plan on going with this. Speed seems alright, redmode's took 20 seconds with my real data set. But you're right about a problem when adding more logic.

chimpsarehungry Over a year ago

when I do something like sum(A[i,] > 0.95*B[i,])==2 , I get an error Error in FUN(left, right) : non-numeric argument to binary operator. How can I do this to have the values in A greater than 95% of B?

redmode Over a year ago

@chimpsarehungry: sapply is loop-like operator, it passes values from 3 to N to anonymous function which returns one logical value. All returned values are combined in array. i is formal argument of anonymous function, every time it holds the value (from 3 to N) passed to the function.

|

agstudy · Accepted Answer · 2013-02-26 19:50:49Z

3

I think it is better to use SQL for such inter table filtering. It is clean and readable( You keep the rules logic).

 library(sqldf)
sqldf('SELECT DISTINCT A.*
        FROM A,B
        WHERE A.name1   > B.name1
        AND    A.name2  > B.name2')
  name1 name2
1   trt  ctrl
2    10    10

answered Feb 26, 2013 at 19:50

agstudy

122k18 gold badges204 silver badges265 bronze badges

5 Comments

Carl Witthoft Over a year ago

"Clean and readable"? What kind of software jock are you anyway? :-)

agstudy Over a year ago

@CarlWitthoft sorry for my English. I mean by clean that this solution don't hide the origin condition like using sum or prod in the 2 others solutions. I think also that is faster than sapply(maybe slower than data.table solution which has the same logic). Anyway, I am a .net programmer how try to learn R theses days.( i am not sure to understand very well "software jock").

chimpsarehungry Over a year ago

He's just joking, saying you're solution is too human-readable. I like it though. I will have to get that library.

agstudy Over a year ago

@CarlWitthoft oups !Thanks. I can't learn R and "English second level joke" on the same time :)

Carl Witthoft Over a year ago

@agstudy no offense taken. In fact, I think you've inspired me to take a look at sqldf

Ricardo Saporta · Accepted Answer · 2013-02-26 19:56:23Z

3

requisite data.table solution:

library(data.table)

# just to preserve the order, non-alphabetically
idsA <- factor(rownames(A), levels=rownames(A))
idsB <- factor(rownames(B), levels=rownames(B))

# convert to data.table with id
ADT <- data.table(id=idsA, A, key="id")
BDT <- data.table(id=idsB, B, key="id")

# filter as needed
ADT[BDT][name1 > name1.1 & name2 > name2.1, list(id, name1, name2)]

answered Feb 26, 2013 at 19:56

Ricardo Saporta

55.5k17 gold badges149 silver badges180 bronze badges

3 Comments

chimpsarehungry Over a year ago

I love when I get such a variety of solutions. Thanks Ricardo

chimpsarehungry Over a year ago

Is there a way to make that filtering part not rely on having the names of the columns. Just filter based on respective location of columns in A and B.

Ricardo Saporta Over a year ago

Yep, you can do filter as you would a data.frame, but at the end, add , with=FALSE]

Carl Witthoft · Accepted Answer · 2013-02-26 19:41:33Z

2

If I rename your matrices amat and bmat, then

amat[which(sapply(1:nrows(amat),function(x) prod(amat[x,]>bmat[x,]))==1),]
[1] 10 10

And you can paste the 'hours' row back on if desired.

answered Feb 26, 2013 at 19:41

Carl Witthoft

21.6k9 gold badges47 silver badges74 bronze badges

2 Comments

Carl Witthoft Over a year ago

Which is basically the same thing @redmode did, So I consider myself ninja'd in time but not in obfuscation.

chimpsarehungry Over a year ago

You lost my alphabet Carl.

Collectives™ on Stack Overflow

Subsetting based on values of a different data frame in R

4 Answers 4

6 Comments

5 Comments

3 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

5 Comments

3 Comments

2 Comments

Linked

Related