0

Given a dataframe ex:

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
c <- cbind(a,b)

i would like to subset dataframe by removing rows with similar comparison (ex: row3: 3,4 is same as row4: 4,3) and have only one of them.

5
  • 4
    c is a function in R and should never be used as a variable name. Commented Sep 19, 2013 at 20:58
  • Can you share what you've tried thus far? This is a fairly basic question that has probably been answered before. You'll find you get much better answers if you not only give some data, but also share the steps you've taken toward solving the problem on your own. Commented Sep 19, 2013 at 21:01
  • Sorry. My dataframe is huge (90M rows). I used the following steps to subset the dataframe. Commented Sep 19, 2013 at 21:12
  • @dayne, I agree with your comment, but df is also a function name, as is data and many other names that are commonly used as variables. To my knowledge, all the answers here would work whether the object is named "c" or your other favorite letter of the alphabet. That said, Ram should heed your warning if only because by using "c", he will also be making his code less readable. Commented Sep 20, 2013 at 5:35
  • Thank you for all your suggestions. I will stop using function name as variable name in my R scripts. Commented Sep 20, 2013 at 14:07

3 Answers 3

3
a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
d <- cbind(a,b)
e <- t(apply(d,1,function(x){x[order(x)]}))
d <- d[!duplicated(e),]

> d
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1
Sign up to request clarification or add additional context in comments.

Comments

2

Assuming d is your matrix, not c:

e <- unique(apply(d,1,function(x) paste(sort(x),collapse="~")))
> t(sapply(strsplit(e,"~"),as.numeric))
     [,1] [,2]
[1,]    1    2
[2,]    2    3
[3,]    3    4
[4,]    2    5
[5,]    1    6

Breaking it down:

First line

apply(d,1,function(x) ... ) takes each row of d and passes it as a vector x to the anonymous function whose body I've called ... here.

The function body is paste(sort(x),collapse="~"), which sorts the vector and then turns it into a length-one vector with each element separated by a ~.

So the apply call overall is going to return a character vector where each element used to be a row of the matrix.

Then unique keeps only the unique elements. The sorting ensures that this does what we want it to.

Second line

strsplit(e,"~") splits our character vector back into a separated form. In this case, it's a list where each element is a character vector of the numbers that comprise each row.

sapply(...,as.numeric) applies as.numeric() to each element of the list. So we convert the character vector back to a numeric vector. Since the s in sapply stands for "simplify," it will create a matrix from this.

But it's the wrong direction (2x5 instead of 5x2)! t() transposes the matrix to the original form.

4 Comments

+1 but you should really explain what this does step by step because it is probably not obvious for someone who's not very familiar with R
Just a (possibly) minor thing. This works but does change the order of a and b if a>b... which may be unwanted.
You're right. It also assumes everything's numeric. To avoid the first problem, you can use !duplicated instead of unique (because then you can use the logical vector to select out of the original matrix). I believe this is what @dayne's solution now does.
@nico Went and step-by-stepped it.
1

in your example, c is not a data.frame but a matrix. c shouldn't be used as variable name, as other stated.

in one line, you can do:

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
cc <- cbind(a,b)
cc[!duplicated(t(apply(cc,1,sort))), ]
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.