0

I need extract a part of a string from a address variable. My data looks like this

"                                                                      
[45] "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL 80630, Sri Lanka "                                   
[46] "Jungle Beach Road, Buonavista | Rumassala, Unawatuna, Galle 80600, Sri Lanka "                            
[47] "10 Church Street | inside the Fort, Galle, Sri Lanka "                                                    
[48] "78 Mile Post Matara Road Mihiripenna, Unawatuna, Galle 80615, Sri Lanka "                                 
[49] "No: 288 Galle Road | Dadella, Galle 80000, Sri Lanka "                                                    
[50] "Matara Road, Koggala, Galle, Sri Lanka "  

I want to extract the city from this string, which in this case should be "Galle". The only pattern I can think of is that it appears before "Sri lanka". Or the city is in between "," and ", Sri Lanka". Here is the code that I used

gsub("\\.s*|(, Sri Lanka).*", "", a)

However using this code I get the following results.

[45] "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL 80630"                                   
[46] "Jungle Beach Road, Buonavista | Rumassala, Unawatuna, Galle 80600"                            
[47] "10 Church Street | inside the Fort, Galle"                                                    
[48] "78 Mile Post Matara Road Mihiripenna, Unawatuna, Galle 80615"                                 
[49] "No: 288 Galle Road | Dadella, Galle 80000"                                                    
[50] "Matara Road, Koggala, Galle" 

Is there anyway to keep only the city

0

3 Answers 3

1
n <- c(
     "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL 80630, Sri Lanka "       ,
     "Jungle Beach Road, Buonavista | Rumassala, Unawatuna, Galle 80600, Sri Lanka ",
     "10 Church Street | inside the Fort, Galle, Sri Lanka "                        ,
     "78 Mile Post Matara Road Mihiripenna, Unawatuna, Galle 80615, Sri Lanka "     ,
     "No: 288 Galle Road | Dadella, Galle 80000, Sri Lanka "                        ,
     "Matara Road, Koggala, Galle, Sri Lanka " )

First, you want to extract the cityname with the possible statename and the possible zip code>

m <- sub('.*, (.*), Sri Lanka *$', '\\1', n)

m is now:

[1] "Galle GL 80630" "Galle 80600" "Galle" "Galle 80615" "Galle 80000" "Galle"

Extract the zip codes

l <- sub(' \\d{5} *$', '', m )

l is:

[1] "Galle GL" "Galle" "Galle" "Galle" "Galle" "Galle"

Finally, extract the state abbreviation

sub('( \\w{2})$', '', l)

[1] "Galle" "Galle" "Galle" "Galle" "Galle" "Galle"

Sign up to request clarification or add additional context in comments.

Comments

0

I would use strsplit instead:

line  <- "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL"
array <- strsplit(line,",")[[1]]
city  <- array[length(array)-1]

Try it!

to get rid of the numbers just take city and remove them with gsub. Hope it helps!

Comments

0

You can write a function to split the string at commas and take second last element which is usually the city name.

myfunction=function(x)
{
    x=strsplit(x,",")[[1]][length(unlist(strsplit(x,",")))-1]
    x=gsub("[[:digit:]]","",x )
}

This function does the job. Additionally,it then removes any number/digit.

Now use it in lapply function to get desired output

lapply(x,myfunction)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.