Add labels to Categorical Data in Dataframe

Question

I am trying to convert survey data on the marital status which look as follows:

df['d11104'].value_counts()

[1] Married        1    250507
[2] Single         2     99131
[4] Divorced       4     32817
[3] Widowed        3     24839
[5] Separated      5      8098
[-1] keine Angabe         2571
Name: d11104, dtype: int64

So far, I did df['marstat'] = df['d11104'].cat.codes.astype('category'), yielding

df['marstat'].value_counts()
1    250507
2     99131
4     32817
3     24839
5      8098
0      2571
Name: marstat, dtype: int64

Now, I'd like to add labels to the columnmarstat, such that the numerical values are maintained, i.e. I like to identify people by the condition df['marstat'] == 1, while at the same time being having labels ['Married','Single','Divorced','Widowed'] attached to this variable. How can this be done?

EDIT: Thanks to jpp's Answer, i simply created a new variable and defined the labels by hand:

df['marstat_lb'] = df['marstat'].map({1: 'Married', 2: 'Single', 3: 'Widowed', 4: 'Divorced', 5: 'Separated'})

jpp · Accepted Answer · 2018-04-27 09:09:52Z

You can convert your result to a dataframe and include both the category code and name in the output.

A dictionary of category mapping can be extracted via enumerating the categories. Minimal example below.

import pandas as pd

df = pd.DataFrame({'A': ['M', 'M', 'S', 'D', 'W', 'M', 'M', 'S',
                         'S', 'S', 'M', 'W']}, dtype='category')

print(df.A.cat.categories)

# Index(['D', 'M', 'S', 'W'], dtype='object')

res = df.A.cat.codes.value_counts().to_frame('count')

cat_map = dict(enumerate(df.A.cat.categories))

res['A'] = res.index.map(cat_map.get)

print(res)

#    count  A
# 1      5  M
# 2      4  S
# 3      2  W
# 0      1  D

For example, you can access "M" by either df['A'] == 'M' or df.index == 1.

A more straightforward solution is just to use apply value_counts and then add an extra column for codes:

res = df.A.value_counts().to_frame('count').reset_index()

res['code'] = res['index'].cat.codes

  index  count  code
0     M      5     1
1     S      4     2
2     W      2     3
3     D      1     0

Thanks. Maybe I wasn't specific enough. I don't want to address items in the frequency table, but in the individual data. Slightly altering your approach, I tried: df['marstat']=df['d11104'].cat.codes and labels = dict(enumerate(df['d11104'].cat.categories)) . However, df['marstat_lb'] = df['marstat'].index.map(labels.get) gives me None for every value of df['marstat'] Is there no way to set a set of labels (maps (?) in python terminology) 'along' the categorical data? Coming from Stata, it is pretty common there.
Wouldn't you need to do df['marstat_lb'] = df['marstat'].map(labels) in your example? Then you shouldn't get None ..

Collectives™ on Stack Overflow

Add labels to Categorical Data in Dataframe

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related