0

I have a data of 80k rows and 874 columns. Some of these columns are empty. I use sum(is.na) in a for loop to determine the index of empty columns. Since the first column is not empty, if sum(is.na) is equal to the number of rows of the first column, it means that column is empty.

for (i in 1:ncol(loans)){
  if (sum(is.na(loans[i])) == nrow(loans[1])){
      print(i)
  }
}

Now that I know the indices of empty columns, I want to drop them from the data. I thought about storing those indices in an array and dropping them in a loop but I don't think it will work since columns with data will replace the empty columns. How can I drop them?

4 Answers 4

2

You should try to provide a toy dataset for your question.

loans <- data.frame(
  a = c(NA, NA, NA),
  b = c(1,2,3),
  c = c(1,2,3),
  d = c(1,2,3),
  e = c(NA, NA, NA)
)


loans[!sapply(loans, function(col) all(is.na(col)))]

sapply loops over columns of loans and applies the anonymous function checking if all elements are NA. It then coerces the output to a vector, in this case logical.

The tidyverse option:

loans[!purrr::map_lgl(loans, ~all(is.na(.x)))]

Sign up to request clarification or add additional context in comments.

Comments

2

Does this work:

df <- data.frame(col1 = rep(NA, 5),
                 col2 = 1:5,
                 col3 = rep(NA,5),
                 col4 = 6:10)
df
  col1 col2 col3 col4
1   NA    1   NA    6
2   NA    2   NA    7
3   NA    3   NA    8
4   NA    4   NA    9
5   NA    5   NA   10
df[,which(colSums(df, na.rm = TRUE) == 0)] <- NULL
df
  col2 col4
1    1    6
2    2    7
3    3    8
4    4    9
5    5   10

Another approach:

df[!apply(df, 2, function(x) all(is.na(x)))]
  col2 col4
1    1    6
2    2    7
3    3    8
4    4    9
5    5   10

3 Comments

Wouldn't colSums(df, na.rm = TRUE) evaluate to 0 if everything would be 0 in the column?
@zerz, it would, but considering the OP's dataframe has 874 columns and 80k rows, the probability of that happening is very remote. Have added another approach to address the same.
Also: df[-which(colSums(df, na.rm = TRUE) == 0)]
1

A dplyr solution:

df %>%
  select_if(!colSums(., na.rm = TRUE) == 0)

Comments

0

You can try to use fundamental skills like if else and for loops for almost all problems, although a drawback is that it will be slower.

# evaluate each column, if a column meets your condition, remove it, then next
for (i in 1:length(loans)){
  if (sum(is.na(loans[,i])) == nrow(loans)){
    loans[,i] <- NULL
  }
}

3 Comments

The problem here is that if an empty columns is deleted, the next column with data is replacing it. Therefore, I guess, in each iteration there is a chance that you can delete columns with data.
@VolkanDemir Well, I get your concerns. But the "if" statement decides when a column should be removed or not, so no matter what, the columns with data won't be affected.
@VolkanDemir For your concern that the next column data is replacing the removed one. Actually I never thought about this before. If you use a small sample data to test my approach, you will know it actually works. But you raised a good point, I may probably post a question about that. Thanks!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.