2

I have a pandas dataframe looks like as below:

date     |    location          | occurance <br>
------------------------------------------------------
somedate |united_kingdom_london | 5  
somedate |united_state_newyork  | 5   

I want it to transform into

date     | country        | city    | occurance <br>
---------------------------------------------------
somedate | united kingdom | london  | 5  
---------------------------------------------------
somedate | united state   | newyork | 5     

I am new to Python and after some research I have written following code, but seems to unable to extract country and city:

df.location= df.location.replace({'-': ' '}, regex=True)
df.location= df.location.replace({'_': ' '}, regex=True)

temp_location = df['location'].str.split(' ').tolist() 

location_data = pd.DataFrame(temp_location, columns=['country', 'city'])

I appreciate your response.

1
  • Thanks guys for your response. With given context, all of your solutions works fine, but actual dataset I am working quite complicated. As a result, I was unable to work it out as yet. From above snippet of mine, after replacing '-', '_' I am doing for item in temp: if str(item) == 'United': frames = [temp[0], temp[2].str.partition(" ", expand=True)] result = pd.concat(frames) print result //but this does not seems working Commented Aug 9, 2016 at 14:29

5 Answers 5

3

Starting with this:

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

Try this:

df['Country'] = df['location'].str.rpartition('_')[0].str.replace("_", " ")
df['City']    = df['location'].str.rpartition('_')[2]
df[['Date','Country', 'City', 'occurence']]

      Date        Country      City  occurence
0  somedate  united kingdom   london          5
1  somedate    united state  newyork          5

Borrowing idea from @MaxU

df[['Country'," " , 'City']] = (df.location.str.replace('_',' ').str.rpartition(' ', expand= True ))
df[['Date','Country', 'City','occurence' ]]

      Date        Country      City  occurence
0  somedate  united kingdom   london          5
1  somedate    united state  newyork          5
Sign up to request clarification or add additional context in comments.

2 Comments

But you will have a empty column name in second method.
@shivsn, yes its not used.
0

Consider splitting the column's string value using rfind()

import pandas as pd

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

df['country'] = df['location'].apply(lambda x: x[0:x.rfind('_')])
df['city'] = df['location'].apply(lambda x: x[x.rfind('_')+1:])

df = df[['Date', 'country', 'city', 'occurence']]
print(df)

#        Date         country     city  occurence
# 0  somedate  united_kingdom   london          5
# 1  somedate    united_state  newyork          5

Comments

0

Try this:

temp_location = {}
splits = df['location'].str.split(' ')
temp_location['country'] = splits[0:-1].tolist() 
temp_location['city'] = splits[-1].tolist() 

location_data = pd.DataFrame(temp_location)

If you want it back in the original df:

df['country'] = splits[0:-1].tolist() 
df['city'] = splits[-1].tolist() 

Comments

0

Something like this works

import pandas as pd

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

df.location = df.location.str[::-1].str.replace("_", " ", 1).str[::-1]
newcols = df.location.str.split(" ")
newcols = pd.DataFrame(df.location.str.split(" ").tolist(),
                         columns=["country", "city"])
newcols.country = newcols.country.str.replace("_", " ")
df = pd.concat([df, newcols], axis=1)
df.drop("location", axis=1, inplace=True)
print(df)

         Date  occurence         country     city
  0  somedate          5  united kingdom   london
  1  somedate          5    united state  newyork

You could use regex in the replace for a more complicated pattern but if it's just the word after the last _ I find it easier to just reverse the str twice as a hack rather than fiddling around with regular expressions

Comments

0

I would use .str.extract() method:

In [107]: df
Out[107]:
       Date               location  occurence
0  somedate  united_kingdom_london          5
1  somedate   united_state_newyork          5
2  somedate         germany_munich          5

In [108]: df[['country','city']] = (df.location.str.replace('_',' ')
   .....:                             .str.extract(r'(.*)\s+([^\s]*)', expand=True))

In [109]: df
Out[109]:
       Date               location  occurence         country     city
0  somedate  united_kingdom_london          5  united kingdom   london
1  somedate   united_state_newyork          5    united state  newyork
2  somedate         germany_munich          5         germany   munich

In [110]: df = df.drop('location', 1)

In [111]: df
Out[111]:
       Date  occurence         country     city
0  somedate          5  united kingdom   london
1  somedate          5    united state  newyork
2  somedate          5         germany   munich

PS please be aware that it's not possible to parse properly (to distinguish) between rows containing two-words country + one-word city and rows containing one-word country + two-words city (unless you have a full list of countries so you check it against this list)...

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.