I want to perform a mean encoding to a high-cardinality variable using 5-fold cross validation. My code is:
df <- data.frame(sample(c(1,2,3,4,5), 1000, replace=T), sample(c(1,0), 1000, replace=T))
colnames(df) <- c("var", "target")
encode <- function(df, target_var, column_var){
set.seed(520)
df$group <- as.factor(sample(c(1,2,3,4,5), nrow(df), replace=T, prob=c(0.2,0.2,0.2,0.2,0.2)))
var.enc <- df %>%
select_("group", column_var, target_var) %>%
group_by_("group", column_var) %>%
mutate(var_encoded = mean(target_var)) %>%
ungroup() %>%
select_(column_var, "var_encoded") %>%
distinct() %>%
group_by_(column_var) %>%
mutate(var.enc = mean(var_encoded)) %>%
distinct()
return(var.enc)
}
encoding <- encode(df = df, column_var = "var", target_var = "target")
When I run the code above I got an warning:
In mean.default(target_var) : argument is not numeric or logical: returning NA
So, how can I pass the argument correctly to mean inside my function? I have tried to use as.name() but does not work either. Also, I have used mean(df[[target_var]]), but this way the group_by is not working and I am getting the global mean as a result.
EDIT: I have added a reproducible example.