Say I have a vector of strings like the following:
vector<-c("hi, how are you doing?",
"what time is it?",
"the sky is blue",
"hi, how are you doing today? You seem tired.",
"walk the dog",
"the grass is green",
"the sky is blue during the day")
vector
[1] "hi, how are you doing?"
[2] "what time is it?"
[3] "the sky is blue"
[4] "hi, how are you doing today? You seem tired."
[5] "walk the dog"
[6] "the grass is green"
[7] "the sky is blue during the day"
How can I identify all entries who's first 4 words match and subsequently keep only the longest matching string? I am looking for my result to look like the following vector:
vector
[1] "what time is it?"
[2] "hi, how are you doing today? You seem tired."
[3] "walk the dog"
[4] "the grass is green"
[5] "the sky is blue during the day"
Ideally I'd like a solution using stringr so I can feed it into a pipe.
UPDATE: Robustness check with different values:
Solution from @Wimpel is brilliant but as @Wimpel pointed out doesn't quite work in all scenarios. See for example:
vector<-c("hi, how are you doing?",
"what time is it?",
"the sky is blue",
"hi, how are you doing today? You seem tired.",
"walk the dog",
"the grass is green",
"the sky is blue during the day",
"12/7/2018",
"8/12/2018",
"9/9/2016 ")
df <- data.frame( text = vector, stringsAsFactors = FALSE )
df$group_id <- df %>% group_indices( stringr::word( text, start = 1, end = 4) )
df %>%
mutate( length = str_count( text, " ") + 1,
row_id = row_number() ) %>%
group_by( group_id ) %>%
arrange( -length ) %>%
slice(1) %>%
ungroup() %>%
arrange( row_id ) %>%
select( text )
1 what time is it?
2 hi, how are you doing today? You seem tired.
3 walk the dog
4 the grass is green
5 the sky is blue during the day
In the above example, the dates are also cut out even though they do not match.