Remove partial string based on regular expression in r

Question

Say I have a vector of strings like the following:

vector<-c("hi, how are you doing?", 
           "what time is it?", 
           "the sky is blue", 
           "hi, how are you doing today? You seem tired.", 
           "walk the dog", 
           "the grass is green", 
           "the sky is blue during the day")

vector
[1] "hi, how are you doing?"                      
[2] "what time is it?"                            
[3] "the sky is blue"                             
[4] "hi, how are you doing today? You seem tired."
[5] "walk the dog"                                
[6] "the grass is green"                          
[7] "the sky is blue during the day"

How can I identify all entries who's first 4 words match and subsequently keep only the longest matching string? I am looking for my result to look like the following vector:

vector                    
[1] "what time is it?"                                                        
[2] "hi, how are you doing today? You seem tired."
[3] "walk the dog"                                
[4] "the grass is green"                          
[5] "the sky is blue during the day"

Ideally I'd like a solution using stringr so I can feed it into a pipe.

UPDATE: Robustness check with different values:

Solution from @Wimpel is brilliant but as @Wimpel pointed out doesn't quite work in all scenarios. See for example:

vector<-c("hi, how are you doing?", 
          "what time is it?", 
          "the sky is blue", 
          "hi, how are you doing today? You seem tired.", 
          "walk the dog", 
          "the grass is green", 
          "the sky is blue during the day", 
          "12/7/2018", 
          "8/12/2018", 
          "9/9/2016 ")

df <- data.frame( text = vector, stringsAsFactors = FALSE ) 
df$group_id <- df %>% group_indices( stringr::word( text, start = 1, end = 4) ) 
df %>%
    mutate( length = str_count( text, " ") + 1,
            row_id = row_number() ) %>%
    group_by( group_id ) %>%
    arrange( -length ) %>%
    slice(1) %>%
    ungroup() %>%
    arrange( row_id ) %>%
    select( text )

1 what time is it?                            
2 hi, how are you doing today? You seem tired.
3 walk the dog                                
4 the grass is green                          
5 the sky is blue during the day

In the above example, the dates are also cut out even though they do not match.

Wimpel · Accepted Answer · 2019-01-28 08:10:58Z

use updated sample data

vec <- c("hi, how are you doing?", 
          "what time is it?", 
          "the sky is blue", 
          "hi, how are you doing today? You seem tired.", 
          "walk the dog", 
          "the grass is green", 
          "the sky is blue during the day", 
          "12/7/2018", 
          "8/12/2018", 
          "9/9/2016")

code

library( tidyverse )

df <- data.frame( text = vec, stringsAsFactors = FALSE ) 
#greate group_indices
df$group_id <- df %>% group_indices( stringr::word( text, start = 1, end = 4) ) 

df %>%
  #create some helping variables
  mutate( length = str_count( text, " ") + 1,
          row_id = row_number() ) %>%
  #now group on id
  group_by( group_id ) %>%
  #arrange by group on length (descending)
  arrange( -length ) %>%
  #keep only the first row (of every group ), also keep all strings shorter than 4 words
  filter( (row_number() == 1L & length >= 4) | length < 4 ) %>%
  ungroup() %>%
  #set back to the original order
  arrange( row_id ) %>%
  select( text )

output

# # A tibble: 8 x 1
# text                                        
#   <chr>                                       
# 1 what time is it?                            
# 2 hi, how are you doing today? You seem tired.
# 3 walk the dog                                
# 4 the grass is green                          
# 5 the sky is blue during the day              
# 6 12/7/2018                                   
# 7 8/12/2018  
# 8 9/9/2016

just noticed it will fail when two (or mote) strings are <4 words (it keep only one of them).. adjust answer shortly...
Updated by question accordingly. Thanks so much for your help!

Collectives™ on Stack Overflow

Remove partial string based on regular expression in r

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related