R stringr with multiple regexes

Question

I'm trying to filter a character vector created from pdf_ocr_text using multiple regex expressions. Specifically, I want to select elements that either (1) start with a digit or (2) with two spaces and a digit. I also want to keep the space in the string. Here's a reproducible example.

df <- c("  065074                         10/1/91   10/1/96 8 10 5  ", 
"060227                          10/1/93   10/1/93 9 5 5  ", 
"  060178                  10/1/95   10/1/98 8 10 5  ", "060294                      10/1/91   10/1/98 8 10 5  ", 
"060212                 10/1/91   10/1/93 8 10 5   ", "  060228                   10/1/92   10/1/92 9 5 5  ", 
"  060257                        10/1/92   10/1/92 9 5 5   ", 
"060348                     10/1/91   10/1/93 8 10 5  ", "  080379                    10/1/91   10/1/96 6 20 5   ", 
"  060239                 10/1/91   10/1/98 8 10 5  ", "  060012                      10/1/92   10/1/92 9 5 5  ", 
"  060360                    10/1/96   10/1/96 9 5 5  ", "   060035                     10/1/95   10/1/95 9 5 5  ", 
"  060243                     10/1/92   10/1/93 8 10 5  ", "  060262                   10/1/92 ; 10/1/94 7 15 5  ", 
"            =          =          ", "                                    40097       2      4 40097 _"
)

I've tried the following but it doesn't seem to work. However, if I use only one of the two conditions, it works.

df[df %>% str_detect(., "^\\s{2}\\d | ^\\d")]. # This fails
df[df %>% str_detect(., "^\\d")]. # With only one condition, it works
[1] "060227                          10/1/93   10/1/93 9 5 5  " "060294                      10/1/91   10/1/98 8 10 5  "   
[3] "060212                 10/1/91   10/1/93 8 10 5   "        "060348                     10/1/91   10/1/93 8 10 5  "

How can I use two regex expressions as a pattern?

jared_mamrot · Accepted Answer · 2020-06-15 04:13:49Z

2

Using your existing approach, drop the spaces surrounding the pipe char:

df[df %>% str_detect("^\\s{2}\\d|^\\d")]

answered Jun 15, 2020 at 4:13

jared_mamrot

26.7k5 gold badges27 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

qnp1521 Over a year ago

Thank you! It's such a simple solution!

Ronak Shah · Accepted Answer · 2020-06-15 04:09:58Z

Using grep :

grep('^\\s{2}\\d|^\\d', df, value = TRUE)

# [1] "  065074                         10/1/91   10/1/96 8 10 5  "
# [2] "060227                          10/1/93   10/1/93 9 5 5  "  
# [3] "  060178                  10/1/95   10/1/98 8 10 5  "       
# [4] "060294                      10/1/91   10/1/98 8 10 5  "     
# [5] "060212                 10/1/91   10/1/93 8 10 5   "         
# [6] "  060228                   10/1/92   10/1/92 9 5 5  "       
# [7] "  060257                        10/1/92   10/1/92 9 5 5   " 
# [8] "060348                     10/1/91   10/1/93 8 10 5  "      
# [9] "  080379                    10/1/91   10/1/96 6 20 5   "    
#[10] "  060239                 10/1/91   10/1/98 8 10 5  "        
#[11] "  060012                      10/1/92   10/1/92 9 5 5  "    
#[12] "  060360                    10/1/96   10/1/96 9 5 5  "      
#[13] "  060243                     10/1/92   10/1/93 8 10 5  "    
#[14] "  060262                   10/1/92 ; 10/1/94 7 15 5  "

Or if you prefer stringr you can use str_subset with the same pattern :

stringr::str_subset(df, '^\\s{2}\\d|^\\d')

You can also combine the two patterns with an optional 2 character whitespace.

grep('^(\\s{2})?\\d', df, value = TRUE)

Thanks for a variety of options. Good to learn that I need to drop the spaces.

Tim Biegeleisen · Accepted Answer · 2020-06-15 04:16:43Z

1

Try using grep here with the pattern ^\\s{2}?\\d:

grep('^\\s{2}?\\d', df)

Here is an explanation of the regex pattern:

^       from the start of the string
\s{2}?  match 2 spaces, zero or one times (read: match two spaces, or no spaces)
\d      match a single digit

edited Jun 15, 2020 at 4:16

answered Jun 15, 2020 at 4:13

Tim Biegeleisen

526k32 gold badges323 silver badges400 bronze badges

1 Comment

Tim Biegeleisen Over a year ago

@qnp1521 Check my answer which now has a helpful edit.

Collectives™ on Stack Overflow

R stringr with multiple regexes

3 Answers 3

1 Comment

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

1 Comment

Related