Suppose, you have a data.frame like this:
x <- data.frame(v1=1:20,v2=1:20,v3=1:20,v4=letters[1:20])
How would you select only those columns in x that are numeric?
EDIT: updated to avoid use of ill-advised sapply.
Since a data frame is a list we can use the list-apply functions:
nums <- unlist(lapply(x, is.numeric), use.names = FALSE)
Then standard subsetting
x[ , nums]
## don't use sapply, even though it's less code
## nums <- sapply(x, is.numeric)
For a more idiomatic modern R I'd now recommend
x[ , purrr::map_lgl(x, is.numeric)]
Less codey, less reflecting R's particular quirks, and more straightforward, and robust to use on database-back-ended tibbles:
dplyr::select_if(x, is.numeric)
Newer versions of dplyr, also support the following syntax:
x %>% dplyr::select(where(is.numeric))
x[nums] or x[sapply(x,is.numeric)] works as well. And they always return data.frame. Compare x[1] vs x[,1] - first is data.frame, second is a vector. If one want to prevent conversion then must use x[, 1, drop=FALSE] .undefined columns selected. How do you avoid it ?tryCatch() to deal with this. Please consider opening a new question.The dplyr package's select_if() function is an elegant solution:
library("dplyr")
select_if(x, is.numeric)
Filter() from the base package is the perfect function for that use-case:
You simply have to code:
Filter(is.numeric, x)
It is also much faster than select_if():
library(microbenchmark)
microbenchmark(
dplyr::select_if(mtcars, is.numeric),
Filter(is.numeric, mtcars)
)
returns (on my computer) a median of 60 microseconds for Filter, and 21 000 microseconds for select_if (350x faster).
Filter() doesn't work for here is replacing, e.g. Filter(is.numeric,iris) <- 0.5*Filter(is.numeric,iris) won't work.iris %>% dplyr::select(where(is.numeric)) #as per most recent updates
Another option with purrr would be to negate discard function:
iris %>% purrr::discard(~!is.numeric(.))
If you want the names of the numeric columns, you can add names or colnames:
iris %>% purrr::discard(~!is.numeric(.)) %>% names
discard() is pretty much the same as using keep().This an alternate code to other answers:
x[, sapply(x, class) == "numeric"]
with a data.table
x[, lapply(x, is.numeric) == TRUE, with = FALSE]
The library PCAmixdata has functon splitmix that splits quantitative(Numerical data) and qualitative (Categorical data) of a given dataframe "YourDataframe" as shown below:
install.packages("PCAmixdata")
library(PCAmixdata)
split <- splitmix(YourDataframe)
X1 <- split$X.quanti(Gives numerical columns in the dataset)
X2 <- split$X.quali (Gives categorical columns in the dataset)
If you have many factor variables, you can use select_if funtion.
install the dplyr packages. There are many function that separates data by satisfying a condition. you can set the conditions.
Use like this.
categorical<-select_if(df,is.factor)
str(categorical)
Another way could be as follows:-
#extracting numeric columns from iris datset
(iris[sapply(iris, is.numeric)])
Numerical_variables <- which(sapply(df, is.numeric))
# then extract column names
Names <- names(Numerical_variables)
This doesn't directly answer the question but can be very useful, especially if you want something like all the numeric columns except for your id column and dependent variable.
numeric_cols <- sapply(dataframe, is.numeric) %>% which %>%
names %>% setdiff(., c("id_variable", "dep_var"))
dataframe %<>% dplyr::mutate_at(numeric_cols, function(x) your_function(x))