1

Column term stores a set with a few strings (out of a fixed set of ~1000 strings).

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])

Out[1]
           terms
0  {mouse, city}
1        {mouse}
2         {blue}

I want to iterate over the rows and count occurrences of each unique term per row, so I plan to create a boolean column for each term found. Something like:

           terms  has_mouse  has_city  has_blue
0  {mouse, city}          1         1         0
1        {mouse}          1         0         0
2         {blue}          0         0         1

I tried this:

def count_terms_in_row(row):
    for term in row['terms']:
        row['has_{}'.format(term)] = 1

df.apply(count_terms_in_row, axis=1)

However, that didn't work as planned . What's the right approach here?

2
  • df.terms.apply(len)? Commented Apr 27, 2020 at 14:15
  • Thank you, please see edit - need to count each term separately. Commented Apr 27, 2020 at 14:30

3 Answers 3

2

You can do the following:

import pandas as pd
import numpy as np

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])


all_terms = set()
for idx, data in df.iterrows():
  all_terms = all_terms.union(data["terms"])

# find out all new columns
new_columns = []
term2idx = {}
for idx, term in enumerate(all_terms):
  new_columns.append("has_term_{}".format(term))
  term2idx[term] = idx

# add new data per new column
new_data = []
for idx, data in df.iterrows():
  _row = [0] * len(new_columns)
  for term in data["terms"]:
    _row[term2idx[term]] = 1
  new_data.append(_row)

# add new data to existing DataFrame
new_data = np.asarray(new_data)
for idx in range(len(new_columns)):
  df[new_columns[idx]] = new_data[:,idx]

print(df.head())

This results in:

    terms   has_term_city   has_term_blue   has_term_mouse
0   {city, mouse}   1   0   1
1   {mouse} 0   0   1
2   {blue}  0   1   
Sign up to request clarification or add additional context in comments.

Comments

1

This is essentially get_dummies:

df.join(pd.get_dummies(df.terms.apply(list).explode())
          .sum(level=0)
          .add_prefix('has_')
       ) 

Output:

           terms  has_blue  has_city  has_mouse
0  {mouse, city}         0         1          1
1        {mouse}         0         0          1
2         {blue}         1         0          0

Comments

0

You can try this:

df['count'] = df['terms'].str.len()
print(df)

           terms  count
0  {mouse, city}      2
1        {mouse}      1
2         {blue}      1

1 Comment

Thank you, please see edit - need to count each term separately.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.