Extracting specific parts of an input string using stringr package in R

Question

Basically, this is my input;

"a ~ b c d*e !r x"
"a ~ b c"
"a ~ b c d1 !r y",
"a ~ b c D !r z",
"a~b c d*e!r z"

and would desire this as my result;

"b c d*e"
"b c"
"b c d1"
"b c D"
"b c d*e"

The input represents (mixed) models that are built up of three groups, i.e. the dependent part (~) the fixed part and the random part (!r). I thought with capture groups it would be easy enough (example). The difficulty is the random part which is not always present.

I tried different things as you can see below and of course it possible to do this in two steps. However, I desire a (robust) regex one-liner - I feel that should be possible. I employed these different sources as well for inspiration; non-capturing groups, string replacing and string removal.

library(stringr)
txt <- c("a ~ b c d*e !r x",
         "a ~ b c",
         "a ~ b c d1 !r y",
         "a ~ b c D !r z",
         "a~b c d*e!r z")

# Different tries with capture groups
str_replace(txt, "^.*~ (.*) !r.*$", "\\1")
> [1] "b c d*e"       "a ~ b c"       "b c d1"        "b c D"        
> [5] "a~b c d*e!r z"
str_replace(txt, "^(.*~ )(.*)( !r.*)$", "\\2")
> [1] "b c d*e"       "a ~ b c"       "b c d1"        "b c D"        
> [5] "a~b c d*e!r z"
str_replace(txt, "^(.*~)(.*)(!r.*|\n)$", "\\1\\2")
> [1] "a ~ b c d*e " "a ~ b c"      "a ~ b c d1 "  "a ~ b c D "  
> [5] "a~b c d*e"
str_replace(txt, "^(.*) ~ (.*)!r.*($)", "\\2")
> [1] "b c d*e "      "a ~ b c"       "b c d1 "       "b c D "       
> [5] "a~b c d*e!r z"
str_replace(txt, "^.* ~ (.*)(!r.*|\n)$", "\\1")
> [1] "b c d*e "      "a ~ b c"       "b c d1 "       "b c D "       
> [5] "a~b c d*e!r z"


# Multiple steps
step1 <- str_replace(txt, "^.*~\\s*", "")
step2 <- str_replace(step1, "\\s*!r.*$", "")
step2
> "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

EDIT: After posting I kept playing around and found something that worked for my particular case.

# My (probably non-robust) solution/monstrosity
str_replace(txt, "(^.*~\\s*(.*)\\s*!r.*$|^.*~\\s*(.*)$)", "\\2\\3")
> "b c d*e " "b c"      "b c d1 "  "b c D "   "b c d*e"

Wiktor Stribiżew · Accepted Answer · 2018-08-03 16:10:41Z

3

I suggest removing all from the start and up to and incluiding the first tilde (with optional whitespaces) and all starting with the first !r as whole word:

gsub("^[^~]+~\\s*|\\s*!r\\b.*", "", txt)

See the regex demo

Details

^ - start of string
[^~]+ - 1+ chars other than ~
~ - a ~ char
\\s* - 0+ whitespaces
| - or
\\s* - 0+ whitespaces
!r - !r substring
\\b - word boundary
.* - the rest of the string.

R demo:

txt <- c("a ~ b c d*e !r x",
         "a ~ b c",
         "a ~ b c d1 !r y",
         "a ~ b c D !r z",
         "a~b c d*e!r z")
gsub("^[^~]+~\\s*|\\s*!r\\b.*", "", txt)
## => [1] "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

answered Aug 3, 2018 at 16:10

Wiktor Stribiżew

631k41 gold badges501 silver badges629 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tstev Over a year ago

I ended up using this for my final solution. Hence, this was chosen as answer.

s_baldur · Accepted Answer · 2018-08-03 15:45:06Z

3

What about str_extract() using positive lookbehind and lookahead?

str_extract(st, "(?<=~)[^!]+") %>% trimws()
[1] "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

My try to rephrase in English:

We are looking for something that is preceded by a ~ (?<=~), and is a sequence of 1 or more characters that are not ! [^!]+, when we have found something that fits our criteria we stop searching that string (otherwise use str_extract_all()). Finalement, if what we extracted has any spaces at the start of end of string, then remove them trimws().

Data:

st <- c(
  'a ~ b c d*e !r x',
  'a ~ b c',
  'a ~ b c d1 !r y',
  'a ~ b c D !r z',
  'a~b c d*e!r z'
)

EDIT

Few updates already as examples of inputs grow. Will not update again.

edited Aug 3, 2018 at 15:45

answered Aug 3, 2018 at 14:48

s_baldur

34.4k4 gold badges43 silver badges78 bronze badges

6 Comments

tstev Over a year ago

Interesting, will play around with this. After posting I came up with my own regex monstrosity that seems to work (also works on more cases); str_replace(st, "(^.*~\\s*(.*)\\s*!r.*$|^.*~\\s*(.*)$)", "\\2\\3"). It's a shame that with the first string there is an extra space at the end. Nothing str_trim can't handle, but still...

tstev Over a year ago

you mind if I throw some more cases at your solution that might "break" it?

s_baldur Over a year ago

Sure throw them in - but better that you asked.

tstev Over a year ago

My question is oversimplification of the my actual problem. For example, I used a and b but in actual fact this can also be b1 or X. So additional cases like st <- c(st, "a ~ b c d1 !r y", "a ~ b c D ! r z") won't get desired result. EDIT: str_extract(txt, "(?<=~\\s)[a-zA-Z0-9*\\s]+(?=\\b)") improves it already...

s_baldur Over a year ago

@tstev Made another one and a final update. The current one only has a lookbehind which is (?<=~) and means has to be preceded by ~.

|

Michał Turczyn · Accepted Answer · 2018-08-03 16:18:32Z

1

This pattern will let you extract with first capturing group the text you want: ~ ?([\w\*\-\+\/ ]+)(!r)?.

First capturing group: [\w\*\-\+\/ ]+ matches any word character \w or *, +, -, \ and space one or more times (+). It will be terminetaed before second capturing group (if any) (!r)?.

Demo

edited Aug 3, 2018 at 16:18

answered Aug 3, 2018 at 15:51

Michał Turczyn

41.1k18 gold badges57 silver badges87 bronze badges

1 Comment

tstev Over a year ago

Thanks for the explanation! However I can't seem to get this to work in R with the stringr package. i.e. it didn't remove the characters before the ~ or after the !r so I edited to: str_replace(txt, ".*~ ?([\\w\\*\\-\\+\\/ ]*)(!r.*)?", "\\1") and this seems to work for my cases. Perhaps you meant to use in a different way?

Collectives™ on Stack Overflow

Extracting specific parts of an input string using stringr package in R

3 Answers 3

1 Comment

6 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

6 Comments

1 Comment

Linked

Related