2

I'm trying to extract strings having same patterns from the text

The Tragedy of Romeo and Juliet by William Shakespeare

library(readr)

txt <- read_file('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')

Text example:

Scene I.\r\nVerona. A public place.\r\n\r\nEnter Sampson and Gregory (with swords and bucklers) of the house\r\nof Capulet.
...
Scene II.\r\nA Street.\r\n\r\nEnter Capulet, County Paris, and [Servant] -the Clown.\r\n\r\n\r\n Cap.

I want to extract

Verona. A public place.
A Street

I tried with

library(stringr)

str_extract(txt, "Scene\\s[IV]+\\.\\s\\s\\b[A-Z]+\\b")

It didn't work.

Thank you in advance for your advice.

1 Answer 1

2
str_extract_all(gsub("(Scene.*?)\r\n","\\1 ",txt),"Scene.*")
[[1]]
 [1] "Scene I. Verona. A public place."                                    
 [2] "Scene II. A Street."                                                 
 [3] "Scene III. Capulet's house."                                         
 [4] "Scene IV. A street."                                                 
 [5] "Scene V. Capulet's house."                                           
 [6] "Scene I. A lane by the wall of Capulet's orchard."                   
 [7] "Scene II. Capulet's orchard."                                        
 [8] "Scene III. Friar Laurence's cell."                                   
 [9] "Scene IV. A street."                                                 
[10] "Scene V. Capulet's orchard."                                         
[11] "Scene VI. Friar Laurence's cell."                                    
[12] "Scene I. A public place."                                            
[13] "Scene II. Capulet's orchard."                                        
[14] "Scene III. Friar Laurence's cell."                                   
[15] "Scene IV. Capulet's house"                                           
[16] "Scene V. Capulet's orchard."                                         
[17] "Scene I. Friar Laurence's cell."                                     
[18] "Scene II. Capulet's house."                                          
[19] "Scene III. Juliet's chamber."                                        
[20] "Scene IV. Capulet's house."                                          
[21] "Scene V. Juliet's chamber."                                          
[22] "Scene I. Mantua. A street."                                          
[23] "Scene II. Verona. Friar Laurence's cell."                            
[24] "Scene III. Verona. A churchyard; in it the monument of the Capulets."
Sign up to request clarification or add additional context in comments.

2 Comments

THANK YOU for your answer, Onyambu! Then, how about txt <- readLines('http://www.gutenberg.org/cache/epub/1112/pg1112.txt', encoding='UTF-8')? The txt has line, so your code is didn't work.
With thisone, you have read the file as lines. You therefore have to be able to pick only those with the name scene and their succeeding rows. eg t(matrix(txt[c(rbind(grep("Scene",txt),grep("Scene",txt)+1))],2)) should work. But will give two columns. You can paste the columns together of course

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.