subsetting dataframes based on column values in r

Question

Given a dataframe ex:

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
c <- cbind(a,b)

i would like to subset dataframe by removing rows with similar comparison (ex: row3: 3,4 is same as row4: 4,3) and have only one of them.

c is a function in R and should never be used as a variable name. — dayne
– dayne, Commented Sep 19, 2013 at 20:58
Can you share what you've tried thus far? This is a fairly basic question that has probably been answered before. You'll find you get much better answers if you not only give some data, but also share the steps you've taken toward solving the problem on your own. — Justin
– Justin, Commented Sep 19, 2013 at 21:01
Sorry. My dataframe is huge (90M rows). I used the following steps to subset the dataframe. — Ram
– Ram, Commented Sep 19, 2013 at 21:12
@dayne, I agree with your comment, but df is also a function name, as is data and many other names that are commonly used as variables. To my knowledge, all the answers here would work whether the object is named "c" or your other favorite letter of the alphabet. That said, Ram should heed your warning if only because by using "c", he will also be making his code less readable. — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Sep 20, 2013 at 5:35
Thank you for all your suggestions. I will stop using function name as variable name in my R scripts. — Ram
– Ram, Commented Sep 20, 2013 at 14:07

dayne · Accepted Answer · 2013-09-19 21:05:45Z

3

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
d <- cbind(a,b)
e <- t(apply(d,1,function(x){x[order(x)]}))
d <- d[!duplicated(e),]

> d
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1

answered Sep 19, 2013 at 21:05

dayne

7,8447 gold badges42 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ari B. Friedman · Accepted Answer · 2013-09-20 10:30:59Z

Assuming d is your matrix, not c:

e <- unique(apply(d,1,function(x) paste(sort(x),collapse="~")))
> t(sapply(strsplit(e,"~"),as.numeric))
     [,1] [,2]
[1,]    1    2
[2,]    2    3
[3,]    3    4
[4,]    2    5
[5,]    1    6

Breaking it down:

First line

apply(d,1,function(x) ... ) takes each row of d and passes it as a vector x to the anonymous function whose body I've called ... here.

The function body is paste(sort(x),collapse="~"), which sorts the vector and then turns it into a length-one vector with each element separated by a ~.

So the apply call overall is going to return a character vector where each element used to be a row of the matrix.

Then unique keeps only the unique elements. The sorting ensures that this does what we want it to.

Second line

strsplit(e,"~") splits our character vector back into a separated form. In this case, it's a list where each element is a character vector of the numbers that comprise each row.

sapply(...,as.numeric) applies as.numeric() to each element of the list. So we convert the character vector back to a numeric vector. Since the s in sapply stands for "simplify," it will create a matrix from this.

But it's the wrong direction (2x5 instead of 5x2)! t() transposes the matrix to the original form.

+1 but you should really explain what this does step by step because it is probably not obvious for someone who's not very familiar with R
Just a (possibly) minor thing. This works but does change the order of a and b if a>b... which may be unwanted.
You're right. It also assumes everything's numeric. To avoid the first problem, you can use !duplicated instead of unique (because then you can use the logical vector to select out of the original matrix). I believe this is what @dayne's solution now does.

wotuzu17 · Accepted Answer · 2013-09-19 21:22:17Z

1

in your example, c is not a data.frame but a matrix. c shouldn't be used as variable name, as other stated.

in one line, you can do:

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
cc <- cbind(a,b)
cc[!duplicated(t(apply(cc,1,sort))), ]
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1

answered Sep 19, 2013 at 21:22

wotuzu17

1578 bronze badges

Collectives™ on Stack Overflow

subsetting dataframes based on column values in r

3 Answers 3

Comments

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Related