filter/subset/delete rows that contain character in middle of string in R

Question

I've got a dataframe with a column containing peptide sequences and I want to keep only rows that have no internal "R" or "K" in their string.

df1 <- data.frame(
    Peptide = c("ABCOIIJUHFSAUJHR", "AOFIAUKOAISDFUK", 'ASOIRDFHAOHFKK'))


df1 #check output

As output I would like to keep only the first row (i.e. "ABCOIIJUHFSAUJHR").

I have tried using filter (dplyr) and str_locate_all from the stringr package and length but couldn't figure it out.

Any help would be much appreciated.

Thanks Moe

akrun · Accepted Answer · 2018-04-20 03:59:21Z

5

We can skip with the first and last character (^., .$) and match zero or more characters that are not an R or K ([^RK]*) in grep and use that to subset the dataset

df1[grepl("^.[^RK]*.$", df1$Peptide), , drop = FALSE]
#           Peptide
#1 ABCOIIJUHFSAUJHR

answered Apr 20, 2018 at 3:59

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Frank Over a year ago

An edge case: single-character string: grepl("^.[^RK]*.$", "A")

akrun Over a year ago

@Frank that is a good one. I think we have to make a | for those edge cases

Moe Over a year ago

"^[^R|K]*.$" is what I ended up using (I removed the . (full stop), because I realized that I actually also wanted to filter out K|R at the beginning.

Melissa Key · Accepted Answer · 2018-04-20 04:12:12Z

3

Here's the dplyr solution: str_locate is the tidyverse equivalent to grepl, so the code looks like this:

df2 <- df1 %>%
  filter(Peptide %>% str_detect("^.[^RK]*.$"))

answered Apr 20, 2018 at 4:12

Melissa Key

4,55114 silver badges22 bronze badges

2 Comments

Moe Over a year ago

Thanks! I ended up using dplyr::filter(Peptide %>% str_detect("^[^R|K]*.$")) (because I also wanted to filter out the R|K at beginning of the string (see my comment above).

Melissa Key Over a year ago

Having some familiarity with trypsin myself, I figured that was what you wanted, but it was accepting all 3 peptides when I didn't include ^. at the beginning. Glad you got it working!

Collectives™ on Stack Overflow

filter/subset/delete rows that contain character in middle of string in R

2 Answers 2

3 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Related