Filter different columns dataframe in R based on another dataframe

Question

I have some semi-complex filtering I need to do:

Identifier <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5)
item1 <- c("a", "b", "c", "a", "b", "c", "d", "a", "b", "d", "b", "a", "c")
item2 <- c("x", "y", "z", "z", "x", "y", "z", "y", "z", "x", "y", "x", "y")
item3 <- c("p", "q", "r", "p", "q", "r", "p", "q", "r", "p", "q", "r", "p")
df1 <- data.frame(Identifier, item1, item2, item3)
df1

header <- c("Identifier","item1","item2","item3")
values <- c("1","b","y","p")
needed<- c("yes","yes","yes","no")
df2 <- data.frame(header, values, needed)
df2

I then want to use df2, to apply multiple filters on df1. So based on df2, I want to:

Filter for "1" in df1$Identifier
Filter for "b" in df1$item 1
Filter for "y" in df1$item 2
Remove "p" in df1$item 3

The goal is then to have df2 as an excel csv file, and the user can include what columns they would like filtered, and for what value. So, these would remain dynamic without them needing to edit the R code.

akrun · Accepted Answer · 2022-10-12 16:48:08Z

We may use Map in base R. When we filter (keep or remove those elements, there is a possibility to have different lengths for each columns as showed before

Map(function(x, nm) {
       i1 <- match(nm, df2$header)        
        if(df2$needed[i1] == "yes") x[x == df2$values[i1]] 
       else x[x != df2$values[i1]]} , df1, names(df1))

-output

$Identifier
[1] 1 1 1

$item1
[1] "b" "b" "b" "b"

$item2
[1] "y" "y" "y" "y" "y"

$item3
[1] "q" "r" "q" "r" "q" "r" "q" "r"

If we want to keep the data as data.frame, it may be better to replace those doesn't conform to the logic as NA

library(dplyr)
df1 %>%
   mutate(across(everything(), ~  {
    i1 <- match(cur_column(), df2$header)
   case_when((df2$needed[i1] == "yes" &.x == df2$values[i1])|
      (df2$needed[i1] == "no" & .x != df2$values[i1]) ~ .x )
  }))

-output

   Identifier item1 item2 item3
1           1  <NA>  <NA>  <NA>
2           1     b     y     q
3           1  <NA>  <NA>     r
4          NA  <NA>  <NA>  <NA>
5          NA     b  <NA>     q
6          NA  <NA>     y     r
7          NA  <NA>  <NA>  <NA>
8          NA  <NA>     y     q
9          NA     b  <NA>     r
10         NA  <NA>  <NA>  <NA>
11         NA     b     y     q
12         NA  <NA>  <NA>     r
13         NA  <NA>     y  <NA>

If we need a single value

df1 %>%
   mutate(across(everything(), ~  {
    i1 <- match(cur_column(), df2$header)
   case_when((df2$needed[i1] == "yes" &.x == df2$values[i1])|
      (df2$needed[i1] == "no" & .x != df2$values[i1]) ~ .x )
  })) %>%
   summarise(across(everything(), ~ .x[complete.cases(.x)][1]))

-output

   Identifier item1 item2 item3
1          1     b     y     q

thank you, I basically want the code to return just 1 record, (happy to not have NAs): 1 b y q
hi it does not work, it says one of the fields (not present in df2) "must be length 0 or one, not 36870."
@jkfirewood not sure. I am using your data and all the solutions are working for that data

Collectives™ on Stack Overflow

Filter different columns dataframe in R based on another dataframe

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related