2

I would like to remove columns which contain the string -- in any row.

Number  138 139 140 141 143 144 147 148 149 150 151 152 14  15  N…  
nm4804  A   B   --  A   B   A   A   --  A   A   A   A   A   --  A  
nm7574  B   A   A   A   A   A   A   A   A   A   A   A   A   --  A
nm8723  B   --  B   B   B   --  A   --  B   B   B   B   --  --  A
N…      B   A   A   A   A   B   A   --  A   A   B   --  --  --  A

I would like to count the -- frequency, if there is any column have more than 50% of -- in the columns, that column will be removed.

Desired result:

Number  138 140 141 143 147 149 150 151 N…  
nm4804  A   A   --    B A   A   A   A   A  
nm7574  B   A   A    A  A   A   A   A   A
nm8723  B   B   A    B  --    B  B  B   A
N…          B   A   A    A  A A A   B   A

Data (thanks bgoldst)

df <- data.frame(Number=c('nm4804','nm7574','nm8723','N…'),`138`=c('A','B','B','B'),`139`=c(
'B','A','--','A'),`140`=c('--','A','B','A'),`141`=c('A','A','B','A'),`143`=c('B','A','B','A'
),`144`=c('A','A','--','B'),`147`=c('A','A','A','A'),`148`=c('--','A','--','--'),`149`=c('A',
'A','B','A'),`150`=c('A','A','B','A'),`151`=c('A','A','B','B'),`152`=c('A','A','B','--'),
`14`=c('A','A','--','--'),`15`=c('--','--','--','--'),`N…`=c('A','A','A','A'),check.names=F,
stringsAsFactors=F);
6
  • 1
    It appears that your data uses -- to indicate a missing value. See ?read.table and the argument na.strings. Commented Jun 25, 2016 at 2:34
  • I'm going to assume you want to remove the columns and not count anything, and will edit. Roll back if that is not what you want. Commented Jun 25, 2016 at 2:38
  • You should give the problem a try first, then let us know where it is failing. Commented Jun 25, 2016 at 3:23
  • @Alos That is not necessary. Clearly the community thinks that this problem is not valuable by the vote count, but an attempt by the questioner isn't going to help with that. Commented Jun 25, 2016 at 3:24
  • I would like to count number of -- because if one of the columns have more than 50% of -- , that column will be removed but I don't know how to do it . Thank you for your help . Commented Jun 25, 2016 at 9:18

4 Answers 4

6

Use colSums():

df[,colSums(df=='--')==0]
##   Number 138 141 143 147 149 150 151 N…
## 1 nm4804   A   A   B   A   A   A   A  A
## 2 nm7574   B   A   A   A   A   A   A  A
## 3 nm8723   B   B   B   A   B   B   B  A
## 4     N…   B   A   A   A   A   A   B  A
Sign up to request clarification or add additional context in comments.

1 Comment

And to keep columns with less than 50% of -- values, it would be df[, colMeans(df == "--") < 0.5]
4

We can also use Filter

Filter(function(x) !any(x=="--"), df1)
#    Number X138 X141 X143 X147 X149 X150 X151 N…
#1 nm4804    A    A    B    A    A    A    A  A
#2 nm7574    B    A    A    A    A    A    A  A
#3 nm8723    B    B    B    A    B    B    B  A
#4     N…    B    A    A    A    A    A    B  A

If we need to remove the columns with more than 50% of --

Filter(function(x) mean(x == '--') <= 0.5, df1)

NOTE: Based on the OP's example, all the columns will be retained here.

4 Comments

I would like to count number of -- because if one of the columns have more than 50% of -- , that column will be removed but I don't know how to do it . Thank you for your help .
@PeterChung Updated with an option for the new case.
Almost there, What is function(x) ?
it is anonymous function
1

Since it is unclear in the question, I'm assuming that nm4804 and such are row names, and 138..152 are column names, not actual data. With that, I'm guessing that this is a character matrix. Your data:

dat <- structure(c("A", "B", "B", "B", "B", "A", "--", "A", "--", "A", 
"B", "A", "A", "A", "B", "A", "B", "A", "B", "A", "A", "A", "--", 
"B", "A", "A", "A", "A", "--", "A", "--", "--", "A", "A", "B", 
"A", "A", "A", "B", "A", "A", "A", "B", "B", "A", "A", "B", "--", 
"A", "A", "--", "--", "--", "--", "--", "--", "A", "A", "A", 
"A"), .Dim = c(4L, 15L), .Dimnames = list(c("nm4804", "nm7574", 
"nm8723", "N..."), c("138", "139", "140", "141", "142", "143", 
"144", "145", "146", "147", "148", "149", "150", "151", "152"
)))

Try this:

dat[,! apply(dat, 2, `%in%`, x = "--")]
#        138 141 142 144 146 147 148 152
# nm4804 "A" "A" "B" "A" "A" "A" "A" "A"
# nm7574 "B" "A" "A" "A" "A" "A" "A" "A"
# nm8723 "B" "B" "B" "A" "B" "B" "B" "A"
# N...   "B" "A" "A" "A" "A" "A" "B" "A"

10 Comments

Clearly nm4804 and such are not row names, as there is a column name: Number. Also, by the way the results are printed, my first guess would be that it is a data frame of factors/strings.
Clearly? Based on many questions with hand-typed tabulated numbers/letters, I wasn't certain if this was hand-made or not. Additionally, the code providing the data was added late in the game (I didn't have it available). It is certainly non-standard (though possible) to have numbers as column names, so I made a leap. Regardless, my code still works without change, so your comment was intended for what purpose?
Your code coerces to a matrix, where a matrix may not be the input. If so, that's an unnecessary copy. As for the purpose of my comment, it is to prod you to improve your answer.
Nope, it sure doesn't. Because of my assumptions, I started with a matrix, so my answer happily kept it as such. If you run it with a data.frame going in, it happily keeps it a data.frame. (Don't mistake the coersion going in inside the square brackets for what results outside of the square brackets.)
if (is.object(X)) X <- if (dl == 2L) as.matrix(X) else as.array(X) That is coercion to a matrix. I didn't say the result was a matrix! I said you were making an unnecessary copy.
|
0

Here is a proposal using dplyr using the 'dat' dataframe proposed by @r2evans

dat <- structure(c("A", "B", "B", "B", "B", "A", "--", "A", "--", "A", 
"B", "A", "A", "A", "B", "A", "B", "A", "B", "A", "A", "A", "--", 
"B", "A", "A", "A", "A", "--", "A", "--", "--", "A", "A", "B", 
"A", "A", "A", "B", "A", "A", "A", "B", "B", "A", "A", "B", "--", 
"A", "A", "--", "--", "--", "--", "--", "--", "A", "A", "A", 
"A"), .Dim = c(4L, 15L), .Dimnames = list(c("nm4804", "nm7574", 
"nm8723", "N..."), c("138", "139", "140", "141", "142", "143", 
"144", "145", "146", "147", "148", "149", "150", "151", "152"
)))

This enables to remove all columns containing more than 50% of '--'

dat %>% 
as.data.frame() %>% 
select_if(~!(sum(.=="--") / length(.) > 0.5))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.