DataFrames: iterating over set-values to create multiple boolean columns?

Question

Column term stores a set with a few strings (out of a fixed set of ~1000 strings).

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])

Out[1]
           terms
0  {mouse, city}
1        {mouse}
2         {blue}

I want to iterate over the rows and count occurrences of each unique term per row, so I plan to create a boolean column for each term found. Something like:

           terms  has_mouse  has_city  has_blue
0  {mouse, city}          1         1         0
1        {mouse}          1         0         0
2         {blue}          0         0         1

I tried this:

def count_terms_in_row(row):
    for term in row['terms']:
        row['has_{}'.format(term)] = 1

df.apply(count_terms_in_row, axis=1)

However, that didn't work as planned . What's the right approach here?

Thank you, please see edit - need to count each term separately. — Adam B
– Adam B, Commented Apr 27, 2020 at 14:30

user13417995 · Accepted Answer · 2020-04-27 14:46:21Z

You can do the following:

import pandas as pd
import numpy as np

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])


all_terms = set()
for idx, data in df.iterrows():
  all_terms = all_terms.union(data["terms"])

# find out all new columns
new_columns = []
term2idx = {}
for idx, term in enumerate(all_terms):
  new_columns.append("has_term_{}".format(term))
  term2idx[term] = idx

# add new data per new column
new_data = []
for idx, data in df.iterrows():
  _row = [0] * len(new_columns)
  for term in data["terms"]:
    _row[term2idx[term]] = 1
  new_data.append(_row)

# add new data to existing DataFrame
new_data = np.asarray(new_data)
for idx in range(len(new_columns)):
  df[new_columns[idx]] = new_data[:,idx]

print(df.head())

This results in:

    terms   has_term_city   has_term_blue   has_term_mouse
0   {city, mouse}   1   0   1
1   {mouse} 0   0   1
2   {blue}  0   1

Quang Hoang · Accepted Answer · 2020-04-27 14:34:23Z

1

This is essentially get_dummies:

df.join(pd.get_dummies(df.terms.apply(list).explode())
          .sum(level=0)
          .add_prefix('has_')
       )

Output:

           terms  has_blue  has_city  has_mouse
0  {mouse, city}         0         1          1
1        {mouse}         0         0          1
2         {blue}         1         0          0

answered Apr 27, 2020 at 14:34

Quang Hoang

151k11 gold badges63 silver badges86 bronze badges

Comments

NYC Coder · Accepted Answer · 2020-04-27 14:24:43Z

0

You can try this:

df['count'] = df['terms'].str.len()
print(df)

           terms  count
0  {mouse, city}      2
1        {mouse}      1
2         {blue}      1

answered Apr 27, 2020 at 14:24

NYC Coder

7,6343 gold badges14 silver badges25 bronze badges

1 Comment

Adam B Over a year ago

Thank you, please see edit - need to count each term separately.

Collectives™ on Stack Overflow

DataFrames: iterating over set-values to create multiple boolean columns?

3 Answers 3

Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Related