0

I have a large dataframe, in which there is a column with number and letter codes. Something like this:

ID death_cause
1 K703
2 N19X
3 C069
4 C07X
5 D181
6 R99X
7 D371
8 E117
9 D489
10 D500

I need to filter and keep all codes starting with the letter C and codes starting with the letter D, but only with the numbers from 0 to 48 (i.e. D00, D10, D20, D48), data starting with D49 onwards are no longer needed.

I have managed to filter out the letter C codes, since it is easy to just ask to keep the characters starting with the letter C with dplyr and stringr.

df_filtered <- df %>% 
  filter(str_detect(death_cause, "^C"))

However, I need to keep the specific D-codes as well. One idea I had is to create a vector with the characters of the D-codes

D_codes <- paste("D", 00:48, sep = "")

My question is how to filter those other character patterns next to the C codes with dplyr and stringr (tidyverse, in general) functions.

I tried:

 df_filtered <- df %>% 
      filter(str_detect(death_cause, "^C") | str_detect(death_cause, D_codes ) )

Any help you can give me, I would appreciate it.

2
  • 2
    I think you can probably get away with df %>% filter(grepl("^C|^D", death_cause), death_cause < "D49"). Commented Oct 7, 2023 at 1:53
  • alternatively df %>% filter(str_detect(death_cause,'^[C|D]') & between(as.numeric(str_remove_all(death_cause,'\\D')),0,48)) Commented Oct 7, 2023 at 3:51

1 Answer 1

0

You’re on the right track. You’ll want to pad the single digit numerals for your D codes:

library(stringr)
library(dplyr)

D_codes <- str_c("D", str_pad(0:48, 2, pad = "0"))

And just use %in% rather than str_detect():

df %>% 
  filter(str_starts(death_cause, "C") | death_cause %in% D_codes))

(Also note str_starts() as an alternative to str_detect() in this case.)

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.