str_extract specific patterns

Question

I'm trying to extract strings having same patterns from the text

The Tragedy of Romeo and Juliet by William Shakespeare

library(readr)

txt <- read_file('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')

Text example:

Scene I.\r\nVerona. A public place.\r\n\r\nEnter Sampson and Gregory (with swords and bucklers) of the house\r\nof Capulet.
...
Scene II.\r\nA Street.\r\n\r\nEnter Capulet, County Paris, and [Servant] -the Clown.\r\n\r\n\r\n Cap.

I want to extract

Verona. A public place.
A Street

I tried with

library(stringr)

str_extract(txt, "Scene\\s[IV]+\\.\\s\\s\\b[A-Z]+\\b")

It didn't work.

Thank you in advance for your advice.

Onyambu · Accepted Answer · 2018-06-10 17:33:07Z

str_extract_all(gsub("(Scene.*?)\r\n","\\1 ",txt),"Scene.*")
[[1]]
 [1] "Scene I. Verona. A public place."                                    
 [2] "Scene II. A Street."                                                 
 [3] "Scene III. Capulet's house."                                         
 [4] "Scene IV. A street."                                                 
 [5] "Scene V. Capulet's house."                                           
 [6] "Scene I. A lane by the wall of Capulet's orchard."                   
 [7] "Scene II. Capulet's orchard."                                        
 [8] "Scene III. Friar Laurence's cell."                                   
 [9] "Scene IV. A street."                                                 
[10] "Scene V. Capulet's orchard."                                         
[11] "Scene VI. Friar Laurence's cell."                                    
[12] "Scene I. A public place."                                            
[13] "Scene II. Capulet's orchard."                                        
[14] "Scene III. Friar Laurence's cell."                                   
[15] "Scene IV. Capulet's house"                                           
[16] "Scene V. Capulet's orchard."                                         
[17] "Scene I. Friar Laurence's cell."                                     
[18] "Scene II. Capulet's house."                                          
[19] "Scene III. Juliet's chamber."                                        
[20] "Scene IV. Capulet's house."                                          
[21] "Scene V. Juliet's chamber."                                          
[22] "Scene I. Mantua. A street."                                          
[23] "Scene II. Verona. Friar Laurence's cell."                            
[24] "Scene III. Verona. A churchyard; in it the monument of the Capulets."

THANK YOU for your answer, Onyambu! Then, how about txt <- readLines('http://www.gutenberg.org/cache/epub/1112/pg1112.txt', encoding='UTF-8')? The txt has line, so your code is didn't work.
With thisone, you have read the file as lines. You therefore have to be able to pick only those with the name scene and their succeeding rows. eg t(matrix(txt[c(rbind(grep("Scene",txt),grep("Scene",txt)+1))],2)) should work. But will give two columns. You can paste the columns together of course

Collectives™ on Stack Overflow

str_extract specific patterns

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related