2

Say I have a vector of strings like the following:

vector<-c("hi, how are you doing?", 
           "what time is it?", 
           "the sky is blue", 
           "hi, how are you doing today? You seem tired.", 
           "walk the dog", 
           "the grass is green", 
           "the sky is blue during the day")

vector
[1] "hi, how are you doing?"                      
[2] "what time is it?"                            
[3] "the sky is blue"                             
[4] "hi, how are you doing today? You seem tired."
[5] "walk the dog"                                
[6] "the grass is green"                          
[7] "the sky is blue during the day" 

How can I identify all entries who's first 4 words match and subsequently keep only the longest matching string? I am looking for my result to look like the following vector:

vector                    
[1] "what time is it?"                                                        
[2] "hi, how are you doing today? You seem tired."
[3] "walk the dog"                                
[4] "the grass is green"                          
[5] "the sky is blue during the day"                          

Ideally I'd like a solution using stringr so I can feed it into a pipe.

UPDATE: Robustness check with different values:

Solution from @Wimpel is brilliant but as @Wimpel pointed out doesn't quite work in all scenarios. See for example:

vector<-c("hi, how are you doing?", 
          "what time is it?", 
          "the sky is blue", 
          "hi, how are you doing today? You seem tired.", 
          "walk the dog", 
          "the grass is green", 
          "the sky is blue during the day", 
          "12/7/2018", 
          "8/12/2018", 
          "9/9/2016 ")

df <- data.frame( text = vector, stringsAsFactors = FALSE ) 
df$group_id <- df %>% group_indices( stringr::word( text, start = 1, end = 4) ) 
df %>%
    mutate( length = str_count( text, " ") + 1,
            row_id = row_number() ) %>%
    group_by( group_id ) %>%
    arrange( -length ) %>%
    slice(1) %>%
    ungroup() %>%
    arrange( row_id ) %>%
    select( text )

1 what time is it?                            
2 hi, how are you doing today? You seem tired.
3 walk the dog                                
4 the grass is green                          
5 the sky is blue during the day  

In the above example, the dates are also cut out even though they do not match.

0

1 Answer 1

5

use updated sample data

vec <- c("hi, how are you doing?", 
          "what time is it?", 
          "the sky is blue", 
          "hi, how are you doing today? You seem tired.", 
          "walk the dog", 
          "the grass is green", 
          "the sky is blue during the day", 
          "12/7/2018", 
          "8/12/2018", 
          "9/9/2016")

code

library( tidyverse )

df <- data.frame( text = vec, stringsAsFactors = FALSE ) 
#greate group_indices
df$group_id <- df %>% group_indices( stringr::word( text, start = 1, end = 4) ) 

df %>%
  #create some helping variables
  mutate( length = str_count( text, " ") + 1,
          row_id = row_number() ) %>%
  #now group on id
  group_by( group_id ) %>%
  #arrange by group on length (descending)
  arrange( -length ) %>%
  #keep only the first row (of every group ), also keep all strings shorter than 4 words
  filter( (row_number() == 1L & length >= 4) | length < 4 ) %>%
  ungroup() %>%
  #set back to the original order
  arrange( row_id ) %>%
  select( text )

output

# # A tibble: 8 x 1
# text                                        
#   <chr>                                       
# 1 what time is it?                            
# 2 hi, how are you doing today? You seem tired.
# 3 walk the dog                                
# 4 the grass is green                          
# 5 the sky is blue during the day              
# 6 12/7/2018                                   
# 7 8/12/2018  
# 8 9/9/2016  
Sign up to request clarification or add additional context in comments.

4 Comments

just noticed it will fail when two (or mote) strings are <4 words (it keep only one of them).. adjust answer shortly...
Updated by question accordingly. Thanks so much for your help!
Brilliant.That did it perfectly.
answer is updated to handle strings <4 words correctly

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.