Remove columns from data frame if any row contains a specific string

Question

I would like to remove columns which contain the string -- in any row.

Number  138 139 140 141 143 144 147 148 149 150 151 152 14  15  N…  
nm4804  A   B   --  A   B   A   A   --  A   A   A   A   A   --  A  
nm7574  B   A   A   A   A   A   A   A   A   A   A   A   A   --  A
nm8723  B   --  B   B   B   --  A   --  B   B   B   B   --  --  A
N…      B   A   A   A   A   B   A   --  A   A   B   --  --  --  A

I would like to count the -- frequency, if there is any column have more than 50% of -- in the columns, that column will be removed.

Desired result:

Number  138 140 141 143 147 149 150 151 N…  
nm4804  A   A   --    B A   A   A   A   A  
nm7574  B   A   A    A  A   A   A   A   A
nm8723  B   B   A    B  --    B  B  B   A
N…          B   A   A    A  A A A   B   A

Data (thanks bgoldst)

df <- data.frame(Number=c('nm4804','nm7574','nm8723','N…'),`138`=c('A','B','B','B'),`139`=c(
'B','A','--','A'),`140`=c('--','A','B','A'),`141`=c('A','A','B','A'),`143`=c('B','A','B','A'
),`144`=c('A','A','--','B'),`147`=c('A','A','A','A'),`148`=c('--','A','--','--'),`149`=c('A',
'A','B','A'),`150`=c('A','A','B','A'),`151`=c('A','A','B','B'),`152`=c('A','A','B','--'),
`14`=c('A','A','--','--'),`15`=c('--','--','--','--'),`N…`=c('A','A','A','A'),check.names=F,
stringsAsFactors=F);

It appears that your data uses -- to indicate a missing value. See ?read.table and the argument na.strings. — Matthew Lundberg
– Matthew Lundberg, Commented Jun 25, 2016 at 2:34
I'm going to assume you want to remove the columns and not count anything, and will edit. Roll back if that is not what you want. — Matthew Lundberg
– Matthew Lundberg, Commented Jun 25, 2016 at 2:38
You should give the problem a try first, then let us know where it is failing. — Alos
– Alos, Commented Jun 25, 2016 at 3:23
@Alos That is not necessary. Clearly the community thinks that this problem is not valuable by the vote count, but an attempt by the questioner isn't going to help with that. — Matthew Lundberg
– Matthew Lundberg, Commented Jun 25, 2016 at 3:24
I would like to count number of -- because if one of the columns have more than 50% of -- , that column will be removed but I don't know how to do it . Thank you for your help . — Peter Chung
– Peter Chung, Commented Jun 25, 2016 at 9:18

Matthew Lundberg · Accepted Answer · 2016-06-25 03:34:09Z

6

Use colSums():

df[,colSums(df=='--')==0]
##   Number 138 141 143 147 149 150 151 N…
## 1 nm4804   A   A   B   A   A   A   A  A
## 2 nm7574   B   A   A   A   A   A   A  A
## 3 nm8723   B   B   B   A   B   B   B  A
## 4     N…   B   A   A   A   A   A   B  A

edited Jun 25, 2016 at 3:34

Matthew Lundberg

42.7k6 gold badges93 silver badges112 bronze badges

answered Jun 25, 2016 at 2:27

bgoldst

35.6k6 gold badges43 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

talat Over a year ago

And to keep columns with less than 50% of -- values, it would be df[, colMeans(df == "--") < 0.5]

akrun · Accepted Answer · 2016-06-25 09:15:58Z

4

We can also use Filter

Filter(function(x) !any(x=="--"), df1)
#    Number X138 X141 X143 X147 X149 X150 X151 N…
#1 nm4804    A    A    B    A    A    A    A  A
#2 nm7574    B    A    A    A    A    A    A  A
#3 nm8723    B    B    B    A    B    B    B  A
#4     N…    B    A    A    A    A    A    B  A

If we need to remove the columns with more than 50% of --

Filter(function(x) mean(x == '--') <= 0.5, df1)

NOTE: Based on the OP's example, all the columns will be retained here.

edited Jun 25, 2016 at 9:15

answered Jun 25, 2016 at 3:20

akrun

891k38 gold badges589 silver badges700 bronze badges

4 Comments

Peter Chung Over a year ago

I would like to count number of -- because if one of the columns have more than 50% of -- , that column will be removed but I don't know how to do it . Thank you for your help .

akrun Over a year ago

@PeterChung Updated with an option for the new case.

Peter Chung Over a year ago

Almost there, What is function(x) ?

akrun Over a year ago

it is anonymous function

r2evans · Accepted Answer · 2016-06-25 02:26:23Z

1

Since it is unclear in the question, I'm assuming that nm4804 and such are row names, and 138..152 are column names, not actual data. With that, I'm guessing that this is a character matrix. Your data:

dat <- structure(c("A", "B", "B", "B", "B", "A", "--", "A", "--", "A", 
"B", "A", "A", "A", "B", "A", "B", "A", "B", "A", "A", "A", "--", 
"B", "A", "A", "A", "A", "--", "A", "--", "--", "A", "A", "B", 
"A", "A", "A", "B", "A", "A", "A", "B", "B", "A", "A", "B", "--", 
"A", "A", "--", "--", "--", "--", "--", "--", "A", "A", "A", 
"A"), .Dim = c(4L, 15L), .Dimnames = list(c("nm4804", "nm7574", 
"nm8723", "N..."), c("138", "139", "140", "141", "142", "143", 
"144", "145", "146", "147", "148", "149", "150", "151", "152"
)))

Try this:

dat[,! apply(dat, 2, `%in%`, x = "--")]
#        138 141 142 144 146 147 148 152
# nm4804 "A" "A" "B" "A" "A" "A" "A" "A"
# nm7574 "B" "A" "A" "A" "A" "A" "A" "A"
# nm8723 "B" "B" "B" "A" "B" "B" "B" "A"
# N...   "B" "A" "A" "A" "A" "A" "B" "A"

answered Jun 25, 2016 at 2:26

r2evans

167k8 gold badges92 silver badges175 bronze badges

10 Comments

Matthew Lundberg Over a year ago

Clearly nm4804 and such are not row names, as there is a column name: Number. Also, by the way the results are printed, my first guess would be that it is a data frame of factors/strings.

r2evans Over a year ago

Clearly? Based on many questions with hand-typed tabulated numbers/letters, I wasn't certain if this was hand-made or not. Additionally, the code providing the data was added late in the game (I didn't have it available). It is certainly non-standard (though possible) to have numbers as column names, so I made a leap. Regardless, my code still works without change, so your comment was intended for what purpose?

Matthew Lundberg Over a year ago

Your code coerces to a matrix, where a matrix may not be the input. If so, that's an unnecessary copy. As for the purpose of my comment, it is to prod you to improve your answer.

r2evans Over a year ago

Nope, it sure doesn't. Because of my assumptions, I started with a matrix, so my answer happily kept it as such. If you run it with a data.frame going in, it happily keeps it a data.frame. (Don't mistake the coersion going in inside the square brackets for what results outside of the square brackets.)

Matthew Lundberg Over a year ago

if (is.object(X)) X <- if (dl == 2L) as.matrix(X) else as.array(X) That is coercion to a matrix. I didn't say the result was a matrix! I said you were making an unnecessary copy.

|

Cricri's · Accepted Answer · 2022-04-12 12:23:13Z

Here is a proposal using dplyr using the 'dat' dataframe proposed by @r2evans

dat <- structure(c("A", "B", "B", "B", "B", "A", "--", "A", "--", "A", 
"B", "A", "A", "A", "B", "A", "B", "A", "B", "A", "A", "A", "--", 
"B", "A", "A", "A", "A", "--", "A", "--", "--", "A", "A", "B", 
"A", "A", "A", "B", "A", "A", "A", "B", "B", "A", "A", "B", "--", 
"A", "A", "--", "--", "--", "--", "--", "--", "A", "A", "A", 
"A"), .Dim = c(4L, 15L), .Dimnames = list(c("nm4804", "nm7574", 
"nm8723", "N..."), c("138", "139", "140", "141", "142", "143", 
"144", "145", "146", "147", "148", "149", "150", "151", "152"
)))

This enables to remove all columns containing more than 50% of '--'

dat %>% 
as.data.frame() %>% 
select_if(~!(sum(.=="--") / length(.) > 0.5))

Collectives™ on Stack Overflow

Remove columns from data frame if any row contains a specific string

4 Answers 4

1 Comment

4 Comments

10 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

10 Comments

Comments

Related