Subset dataframe by multiple logical conditions of rows to remove

Question

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:

data
v1 v2 v3 v4
a  v  d  c
a  v  d  d
b  n  p  g
b  d  d  h    
c  k  d  c    
c  r  p  g
d  v  d  x
d  v  d  c
e  v  d  b
e  v  d  c

For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:

v1 v2 v3 v4
a  v  d  c
a  v  d  d
c  k  d  c    
c  r  p  g

I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":

sub.data <- data[data[ , 1] != "b", ]

However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:

sub.data <- data[data[ , 1] != c("b", "d", "e")

or

sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))

I've tried some other things as well, like !%in%, but that doesn't seem to exist. Any ideas?

chl · Accepted Answer · 2011-06-05 16:37:11Z

49

Try this

subset(data, !(v1 %in% c("b","d","e")))

answered Jun 5, 2011 at 16:37

chl

29.7k5 gold badges55 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jota Over a year ago

Nice and simple, thanks. I'm not sure which solution I like better, this one or the one provided by Andrie. They are both easy and effective. All three solutions work for me, and I have never used which() before. So, it was nice to be introduced to that function.

Andrie Over a year ago

If it helps you to make up your mind as to whether to use subset or [, have a look at the warning in the help for ?subset: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences."

chl Over a year ago

@Andrie Thanks for adding clarification.

Andrie · Accepted Answer · 2011-06-05 17:30:20Z

42

The ! should be around the outside of the statement:

data[!(data$v1 %in% c("b", "d", "e")), ]

  v1 v2 v3 v4
1  a  v  d  c
2  a  v  d  d
5  c  k  d  c
6  c  r  p  g

edited Jun 5, 2011 at 17:30

answered Jun 5, 2011 at 16:42

Andrie

180k52 gold badges456 silver badges504 bronze badges

Comments

Jota · Accepted Answer · 2014-02-01 18:03:16Z

10

You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.

subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")

This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.

edited Feb 1, 2014 at 18:03

Jota

17.6k7 gold badges67 silver badges93 bronze badges

answered Dec 8, 2011 at 17:44

N Brouwer

5,1088 gold badges32 silver badges36 bronze badges

4 Comments

Ben Bolker Over a year ago

shouldn't those be | rather than & ?

Jota Over a year ago

@BenBolker If you change to |, you get the same data as were put in.

coip Over a year ago

@Frank Can you explain the logic of & paired with != here? Like Ben, it seems like | should be used, but you're right that it shouldn't. I'm especially confused about subsetting multiple columns that way. For example, using Herman's sample data above, to remove all cases of "b" from v1 and all of "n" from v2, I would think that my.df[my.df$v1 != "b" & my.df$v2 != "n",] would only remove cases that met both of those criteria (i.e. only Row 3), rather than either of those criteria (i.e. both Row 3 and Row 4). In fact, using | with != does what I expect & to do, but I don't get why.

Jota Over a year ago

With | a single TRUE result among any of the conditions will cause the whole statement to evaluate to TRUE. All the conditions must evaluate to FALSE for the statement to evaluate to FALSE. With & a single FALSE condition will make the whole statement evaluate to FALSE. If you want to use or, you can use exclusive or: xor like so: subset(my.df, xor(xor(my.df$v1 != "b", my.df$v1 != "d"), my.df$v1 != "e")).

Sacha Epskamp · Accepted Answer · 2011-06-05 17:21:07Z

This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:

> 1:3 == 1:3
[1] TRUE TRUE TRUE

Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:

> 3:1 == 1:3
[1] FALSE  TRUE FALSE

Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:

>  1:2 == 1:3
[1]  TRUE  TRUE FALSE
Warning message:
In 1:2 == 1:3 :
  longer object length is not a multiple of shorter object length

Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:

> 1:3 == 1
[1]  TRUE FALSE FALSE

The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).

Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:

> 1:3 == 1 & 1:3 != 2
[1]  TRUE FALSE FALSE

Dason · Accepted Answer · 2012-09-04 03:14:40Z

5

data <- data[-which(data[,1] %in% c("b","d","e")),]

edited Sep 4, 2012 at 3:14

Dason

62.2k9 gold badges139 silver badges149 bronze badges

answered Sep 4, 2012 at 2:46

paul c

511 silver badge1 bronze badge

1 Comment

A5C1D2H2I1M1N2O1R2T1 Over a year ago

-which is evil and will yield unexpected results in cases where none of the values in the vector to match against are in the source vector.

Roman Luštrik · Accepted Answer · 2011-06-05 16:39:51Z

3

my.df <- read.table(textConnection("
v1 v2 v3 v4
a  v  d  c
a  v  d  d
b  n  p  g
b  d  d  h    
c  k  d  c    
c  r  p  g
d  v  d  x
d  v  d  c
e  v  d  b
e  v  d  c"), header = TRUE)

my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]

  v1 v2 v3 v4
1  a  v  d  c
2  a  v  d  d
5  c  k  d  c
6  c  r  p  g

answered Jun 5, 2011 at 16:39

Roman Luštrik

70.9k25 gold badges160 silver badges203 bronze badges

Comments

Toribio · Accepted Answer · 2014-09-10 01:16:50Z

1

sub.data<-data[ data[,1] != "b"  & data[,1] != "d" & data[,1] != "e" , ]

Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).

edited Sep 10, 2014 at 1:16

Toribio

4,0784 gold badges39 silver badges50 bronze badges

answered Sep 10, 2014 at 0:28

Hernan

111 bronze badge

Comments

Joe · Accepted Answer · 2018-02-22 12:57:25Z

1

And also

library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))

or

data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")

or

data %>% filter(v1 != "b", v1 != "d", v1 != "e")

Since the & operator is implied by the comma.

answered Feb 22, 2018 at 12:57

Joe

8,7412 gold badges55 silver badges60 bronze badges

Collectives™ on Stack Overflow

Subset dataframe by multiple logical conditions of rows to remove

8 Answers 8

3 Comments

Comments

4 Comments

Comments

1 Comment

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

3 Comments

Comments

4 Comments

Comments

1 Comment

Comments

Comments

Comments

Linked

Related