How to create a new dataframe from existing dataframes?

Question

I have the following 2 dataframes:

df1

product_ID         tags
100         chocolate, sprinkles
101         chocolate, filled
102         glazed

df2

customer   product_ID
A            100
A            101
B            101
C            100
C            102
B            101
A            100
C            102

I should be able to create a new dataframe like this.

| customer | chocolate | sprinkles | filled | glazed |
|----------|-----------|-----------|--------|--------|
| A        | ?         | ?         | ?      | ?      |
| B        | ?         | ?         | ?      | ?      |
| C        | ?         | ?         | ?      | ?      |

Where the contents of cells represent the count of occurrences of product attribute.

I've used merge and got the following result

df3 = pd.merge(df2, df1)
df3.drop(['product'], axis = 1)

customer       tags
A        chocolate, sprinkles
C        chocolate, sprinkles
A        chocolate, sprinkles
A        chocolate, filled
B        chocolate, filled
B        chocolate, filled
C        glazed
C        glazed

How do we get to the final result from here? Thanks in advance!

BENY · Accepted Answer · 2018-06-08 01:02:15Z

3

Using get_dummies

df.set_index('customer').tags.str.get_dummies(sep=',').sum(level=0)
Out[593]: 
          chocolate  filled  glazed  sprinkles
customer                                      
A                 3       1       0          2
C                 1       0       2          1
B                 2       2       0          0

answered Jun 8, 2018 at 1:02

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jpp · Accepted Answer · 2018-06-08 00:03:27Z

You can do this in 2 steps:

Expand / flatten your dataframe with a series of comma separated strings.
Use pandas.crosstab to tabulate your counts.

Here's an example assuming you have performed your merge and the result is df:

import numpy as np
from itertools import chain

# split by comma to form series of lists
tag_split = df['tags'].str.split(',')

# create expanded dataframe
df_full = pd.DataFrame({'customer': np.repeat(df['customer'], tag_split.map(len)),
                        'tags': list(chain.from_iterable(tag_split))})

# use pd.crosstab for result
res = pd.crosstab(df_full['customer'], df_full['tags'])

print(res)

tags       filled   sprinkles  chocolate  glazed
customer                                        
A               1           2          3       0
B               2           0          2       0
C               0           1          1       2

worked like a charm! Could you explain the code you've done under the section #create expanded dataframes?
Sure, np.repeat(df['customer'], tag_split.map(len)) repeats each value in the customer column by the respective length of tags. list(chain.from_iterable(tag_split)) creates a single list of all the comma-separated elements in tags. You can read up the docs for np.repeat and itertools.chain.from_iterable.

Collectives™ on Stack Overflow

How to create a new dataframe from existing dataframes?

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related