4

I've got a dataframe with a column containing peptide sequences and I want to keep only rows that have no internal "R" or "K" in their string.

df1 <- data.frame(
    Peptide = c("ABCOIIJUHFSAUJHR", "AOFIAUKOAISDFUK", 'ASOIRDFHAOHFKK'))


df1 #check output

As output I would like to keep only the first row (i.e. "ABCOIIJUHFSAUJHR").

I have tried using filter (dplyr) and str_locate_all from the stringr package and length but couldn't figure it out.

Any help would be much appreciated.

Thanks Moe

2 Answers 2

5

We can skip with the first and last character (^., .$) and match zero or more characters that are not an R or K ([^RK]*) in grep and use that to subset the dataset

df1[grepl("^.[^RK]*.$", df1$Peptide), , drop = FALSE]
#           Peptide
#1 ABCOIIJUHFSAUJHR
Sign up to request clarification or add additional context in comments.

3 Comments

An edge case: single-character string: grepl("^.[^RK]*.$", "A")
@Frank that is a good one. I think we have to make a | for those edge cases
"^[^R|K]*.$" is what I ended up using (I removed the . (full stop), because I realized that I actually also wanted to filter out K|R at the beginning.
3

Here's the dplyr solution: str_locate is the tidyverse equivalent to grepl, so the code looks like this:

df2 <- df1 %>%
  filter(Peptide %>% str_detect("^.[^RK]*.$"))

2 Comments

Thanks! I ended up using dplyr::filter(Peptide %>% str_detect("^[^R|K]*.$")) (because I also wanted to filter out the R|K at beginning of the string (see my comment above).
Having some familiarity with trypsin myself, I figured that was what you wanted, but it was accepting all 3 peptides when I didn't include ^. at the beginning. Glad you got it working!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.