I have a large dataframe, in which there is a column with number and letter codes. Something like this:
| ID | death_cause |
|---|---|
| 1 | K703 |
| 2 | N19X |
| 3 | C069 |
| 4 | C07X |
| 5 | D181 |
| 6 | R99X |
| 7 | D371 |
| 8 | E117 |
| 9 | D489 |
| 10 | D500 |
I need to filter and keep all codes starting with the letter C and codes starting with the letter D, but only with the numbers from 0 to 48 (i.e. D00, D10, D20, D48), data starting with D49 onwards are no longer needed.
I have managed to filter out the letter C codes, since it is easy to just ask to keep the characters starting with the letter C with dplyr and stringr.
df_filtered <- df %>%
filter(str_detect(death_cause, "^C"))
However, I need to keep the specific D-codes as well. One idea I had is to create a vector with the characters of the D-codes
D_codes <- paste("D", 00:48, sep = "")
My question is how to filter those other character patterns next to the C codes with dplyr and stringr (tidyverse, in general) functions.
I tried:
df_filtered <- df %>%
filter(str_detect(death_cause, "^C") | str_detect(death_cause, D_codes ) )
Any help you can give me, I would appreciate it.
df %>% filter(grepl("^C|^D", death_cause), death_cause < "D49").df %>% filter(str_detect(death_cause,'^[C|D]') & between(as.numeric(str_remove_all(death_cause,'\\D')),0,48))