1

I want to perform a mean encoding to a high-cardinality variable using 5-fold cross validation. My code is:

df <- data.frame(sample(c(1,2,3,4,5), 1000, replace=T), sample(c(1,0), 1000, replace=T))
colnames(df) <- c("var", "target")

encode <- function(df, target_var, column_var){

set.seed(520)
df$group <- as.factor(sample(c(1,2,3,4,5), nrow(df), replace=T, prob=c(0.2,0.2,0.2,0.2,0.2)))

var.enc <- df %>% 
               select_("group", column_var, target_var) %>% 
               group_by_("group", column_var) %>% 
               mutate(var_encoded = mean(target_var)) %>% 
               ungroup() %>% 
               select_(column_var, "var_encoded") %>% 
               distinct() %>% 
               group_by_(column_var) %>% 
               mutate(var.enc = mean(var_encoded)) %>% 
               distinct()

return(var.enc)
}

encoding <- encode(df = df, column_var = "var", target_var = "target")

When I run the code above I got an warning:

In mean.default(target_var) : argument is not numeric or logical: returning NA

So, how can I pass the argument correctly to mean inside my function? I have tried to use as.name() but does not work either. Also, I have used mean(df[[target_var]]), but this way the group_by is not working and I am getting the global mean as a result.

EDIT: I have added a reproducible example.

1
  • 1
    I have added an example. Commented Aug 20, 2018 at 14:07

1 Answer 1

3

As the input is a character string, convert it to symbol (sym) and then do the evaluate (!!)

encode <- function(df, target_var, column_var){

  set.seed(520)
  df$group <- as.factor(sample(c(1,2,3,4,5), nrow(df),
         replace=T, prob=c(0.2,0.2,0.2,0.2,0.2)))

  column_var <- rlang::sym(column_var)
  target_var <- rlang::sym(target_var)
  df %>% 
       select(group, !! column_var, !!target_var) %>%  
        group_by(group, !! column_var) %>% 
        mutate(var_encoded = mean(!!target_var)) %>% 
        ungroup() %>% 
        select(!! column_var, var_encoded) %>% 
        distinct() %>% 
        group_by(!! column_var) %>% 
        mutate(var.enc = mean(var_encoded)) %>% 
       distinct()


}

-checking

encoding <- encode(df = df, target_var = "target", column_var = "var")
encoding
# A tibble: 25 x 3
# Groups:   var [5]
#     var var_encoded var.enc
#   <dbl>       <dbl>   <dbl>
# 1     5       0.462   0.497
# 2     5       0.553   0.497
# 3     4       0.585   0.493
# 4     2       0.543   0.536
# 5     3       0.364   0.453
# 6     4       0.46    0.493
# 7     1       0.465   0.476
# 8     3       0.474   0.453
# 9     5       0.529   0.497
#10     1       0.417   0.476
# ... with 15 more rows
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.