How to remove part of string using R regex with boundary

Question

I have these 3 example strings:

x <- "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found"
y <- "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found"
z <- "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found"

What I want to do is to remove strings that comes after first word of the last / delimiter resulting in:

AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer
SPIB/MA0081.1/Jaspar

I tried this but it doesn't give what I want:

> sub("\\(.*?\\)More Information | Similar Motifs Found","",x)
[1] "AP-1| Similar Motifs Found"

What's the right way to do it?

akuiper · Accepted Answer · 2017-11-17 01:26:43Z

You can use a greedy pattern (.*/\\w+).* to match until the last /word, then extract the group with back reference:

v <- c("AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found", "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found", "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found")

sub("(.*/\\w+).*", "\\1", v)
# [1] "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer"          "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer"
# [3] "SPIB/MA0081.1/Jaspar"

In (.*/\\w+).*, the first .* is greedy and will match as many as possible, the stop condition is / + a word(matched by \\w+); the second .* matches the remaining part of the string.

Collectives™ on Stack Overflow

How to remove part of string using R regex with boundary

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related