0

I have a Dataframe with a column like this

Title
"Over the Hill,to the Poorhouse"
"Wilson"                                    
"Darling Lili"                            
"The Ten Commandments"                      
"12 Angry Men"                              
"Twelve Monkeys"                            
"1776"                                      
"1941"                                                                                
"Chacun sa nuit"                                                            
"2001: A Space Odyssey"                                            
"20,000 Leagues Under the Sea"                             
"20,000 Leagues Under the Sea"                             
"24,7: Twenty Four Seven"                                       
"Twin Falls Idaho"                                                        
"Three Kingdoms: Resurrection of the Dragon"
.......
.......

and I would like to transform this column into an array like this.

[Over, the, Hill, to, the, Poorhouse] 
[Wilson] 
[Darling, Lili]                                   
[The, Ten, Commandments]  
[12, Angry, Men]
[Twelve, Monkeys] 
[1776]   
[1941] 
[Chacun, sa, nuit]   
[2001, , A, Space, Odyssey] 
[20, 000, Leagues, Under, the, Sea]
[20, 000, Leagues, Under, the, Sea]
[24, 7, , Twenty, Four, Seven]
[Twin, Falls, Idaho]
[Three, Kingdoms, , Resurrection, of, the, Dragon]

so I would have this two columns

Title                     Title_Words
Over the Hill to the Poorhouse            [Over, the, Hill, to, the, Poorhouse]             
Wilson                                    [Wilson]                                          
Darling Lili                              [Darling, Lili]                                   
The Ten Commandments                      [The, Ten, Commandments]                          
12 Angry Men                              [12, Angry, Men]                                  
Twelve Monkeys                            [Twelve, Monkeys]                                 
1776                                      [1776]                                            
1941                                      [1941]                                            
Chacun sa nuit                            [Chacun, sa, nuit]                                
2001: A Space Odyssey                     [2001, , A, Space, Odyssey]                       
20,000 Leagues Under the Sea              [20, 000, Leagues, Under, the, Sea]               
20,000 Leagues Under the Sea              [20, 000, Leagues, Under, the, Sea]               
24 7: Twenty Four Seven                   [24, 7, , Twenty, Four, Seven]                    
Twin Falls Idaho                          [Twin, Falls, Idaho]                              
Three Kingdoms: Resurrection of the Dragon[Three, Kingdoms, , Resurrection, of, the, Dragon]

The problem is that a String could have several separators: spaces, comma, colon.

How could it be done?

1 Answer 1

2

Try this-

df.withColumn("Title_Words", split(col("Title"), "\\s+|[,:]"))
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.