-2

I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.

Question is how do I slice a particular column from a data frame using a string (character).

Example:

  strColumnNames <- "Admit,Rank"
  strDelimiter <- ","
  strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
  for( strColName in strSplittedColumnNames[[1]] ){
    dfData$as.name(strColName) <- factor(dfData$get(strColName))
  }

Tried:

dfData$as.name()
dfData$get(as.name())
dfData$get()

Error Msg: Error: attempt to apply non-function

Any help would be greatly appreciated! Thank you!!!

1
  • 1
    I didnt know about the tick and thanks for your guidance. It seems the tick is very important to users - pretty scary here. Commented Oct 9, 2016 at 3:59

2 Answers 2

3

You need to change

dfData$as.name(strColName) <- factor(dfData$get(strColName))

to

dfData[[strColName]] <- factor(dfData[[strColName]])

You may read ?"[[" for more.

In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:

dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"

dat$z
# [1] NULL

dat[[z]]
# [1] 1 2 3 4 5

Regarding the other answer

apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:

x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)

str(x)
#'data.frame':  4 obs. of  3 variables:
# $ x1: chr  "a" "b" "c" "d"
# $ x2: chr  "A" "B" "C" "D"
# $ x3: int  1 2 3 4

Note, you still have character variable rather than factor. As I said, we have to use lapply:

x[1:2] <- lapply(x[1:2], as.factor)

str(x)
#'data.frame':  4 obs. of  3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int  1 2 3 4

Now we see the factor class in x1 and x2.

Using apply for a data frame is never a good idea. If you read the source code of apply:

    dl <- length(dim(X))
    if (is.object(X)) 
    X <- if (dl == 2L) 
        as.matrix(X)
    else as.array(X)

You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.

Yet apply is written in R not C, with an ordinary for loop:

 for (i in 1L:d2) {
        tmp <- forceAndCall(1, FUN, newX[, i], ...)
        if (!is.null(tmp)) 
            ans[[i]] <- tmp

so it is no better than an explicit for loop you write yourself.

Sign up to request clarification or add additional context in comments.

Comments

0

I would use a different method.

Create a vector of column names you want to change to factors:

factorCols <- c("Admit", "Rank")

Then extract these columns by index:

myCols <- which(names(dfData) %in% factorCols)

Finally, use apply to change these columns to factors:

dfData[,myCols] <- lapply(dfData[,myCols],as.factor)

4 Comments

hey greghk, is there any reason behind the choice of this method vs that proposed by Zheyuan? thank you!
@ZheyuanLi you're right, lapply would be better for reproducibility in the future. My point is that the code could be made more concise and easy to understand by using an *apply function rather than a for loop
@ZheyuanLi I've edited to reflect you're points about lapply
hey greghk! Thanks for your inputs as well! Let me test both codes out over the coming weekend! really appreciate both your help!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.