5

I want to subset data if every value in the row is greater than the respective row in a different data frame. I also need to skip some top rows. These previous questions did not help me, but it is related:

Subsetting a data frame based on contents of another data frame

Subset data using information from a different data frame [r]

> A
     name1 name2
cond   trt  ctrl
hour     0     3
A        1     1
B       10     1
C        1     1
D        1     1
E       10    10
> B
     name1 name2
cond   trt  ctrl
hour     0     3
A        1     1
B        1    10
C        1     1
D        1     1
E        1     1

I want this. Only rows where ALL values were greater in A than B:

     name1 name2
cond   trt  ctrl
hour     0     3
E       10    10

I've tried these 3 lines:

subset(A, TRUE, select=(A[3:7,] > B[3:7,]))
subset(A, A > B)
A[A[3:7,] > B[3:7,]]

Thanks so much. Here is the code to generate the data:

A <- structure(list(name1 = c("trt", "0", "1", "10", "1", "1", "10"
), name2 = c("ctrl", "3", "1", "1", "1", "1", "10")), .Names = c("name1", 
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
B <- structure(list(name1 = c("trt", "0", "1", "1", "1", "1", "1"), 
    name2 = c("ctrl", "3", "1", "10", "1", "1", "1")), .Names = c("name1", 
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
############# Follow-up question asked 2/28/13

Error when subsetting based on adjusted values of different data frame in R

2
  • 1
    the 'hours' row values are NOT greater than. Do you want to ignore that row? Commented Feb 26, 2013 at 19:38
  • Yes I want to ignore the hour and cond categories Commented Feb 26, 2013 at 19:39

4 Answers 4

5
N <- nrow(A)
cond <- sapply(3:N, function(i) sum(A[i,] > B[i,])==2)
rbind(A[1:2,], subset(A[3:N,], cond))
Sign up to request clarification or add additional context in comments.

6 Comments

This is very good. I see that you're including a function to apply to those values. But I am confused about the i? Does sapply use a loop or something? What I really like about this is I can now change sum==2 to 1 if I want 50% of values in A are greater than B.
Note side : The use of sum is good , but it is a little bit trick. What if you have a condition like A.name1 > A.name2 and A.name1 < B.name2?
Very true agstudy. That is actually where I plan on going with this. Speed seems alright, redmode's took 20 seconds with my real data set. But you're right about a problem when adding more logic.
when I do something like sum(A[i,] > 0.95*B[i,])==2 , I get an error Error in FUN(left, right) : non-numeric argument to binary operator. How can I do this to have the values in A greater than 95% of B?
@chimpsarehungry: sapply is loop-like operator, it passes values from 3 to N to anonymous function which returns one logical value. All returned values are combined in array. i is formal argument of anonymous function, every time it holds the value (from 3 to N) passed to the function.
|
3

I think it is better to use SQL for such inter table filtering. It is clean and readable( You keep the rules logic).

 library(sqldf)
sqldf('SELECT DISTINCT A.*
        FROM A,B
        WHERE A.name1   > B.name1
        AND    A.name2  > B.name2')
  name1 name2
1   trt  ctrl
2    10    10

5 Comments

"Clean and readable"? What kind of software jock are you anyway? :-)
@CarlWitthoft sorry for my English. I mean by clean that this solution don't hide the origin condition like using sum or prod in the 2 others solutions. I think also that is faster than sapply(maybe slower than data.table solution which has the same logic). Anyway, I am a .net programmer how try to learn R theses days.( i am not sure to understand very well "software jock").
He's just joking, saying you're solution is too human-readable. I like it though. I will have to get that library.
@CarlWitthoft oups !Thanks. I can't learn R and "English second level joke" on the same time :)
@agstudy no offense taken. In fact, I think you've inspired me to take a look at sqldf
3

requisite data.table solution:

library(data.table)

# just to preserve the order, non-alphabetically
idsA <- factor(rownames(A), levels=rownames(A))
idsB <- factor(rownames(B), levels=rownames(B))

# convert to data.table with id
ADT <- data.table(id=idsA, A, key="id")
BDT <- data.table(id=idsB, B, key="id")

# filter as needed
ADT[BDT][name1 > name1.1 & name2 > name2.1, list(id, name1, name2)]

3 Comments

I love when I get such a variety of solutions. Thanks Ricardo
Is there a way to make that filtering part not rely on having the names of the columns. Just filter based on respective location of columns in A and B.
Yep, you can do filter as you would a data.frame, but at the end, add , with=FALSE]
2

If I rename your matrices amat and bmat, then

amat[which(sapply(1:nrows(amat),function(x) prod(amat[x,]>bmat[x,]))==1),]
[1] 10 10

And you can paste the 'hours' row back on if desired.

2 Comments

Which is basically the same thing @redmode did, So I consider myself ninja'd in time but not in obfuscation.
You lost my alphabet Carl.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.