1

I'm looking to create category-specific columns based the corresponding category for some of the columns.

I've accomplished this in a round-about way by (1) slicing the 2 categories into two separate dataframes, (2) merging the two dataframes on the date (3) deleting redundant columns (4) creating new columns (category agnostic) (4) delete category specific columns. Do you know of a more efficient way to do this transfomration? My code is below the example input/output

Input:

      wk start  car            rims color   Autopilot$  Sunroof$
0   2018-09-09  tesla model x   17  black   3000         0
1   2018-09-16  tesla model x   14  yellow  3000         0
2   2018-09-23  tesla model x   13  white   3000         0
3   2018-09-09  tesla model 3   19  grey    0            2000
4   2018-09-16  tesla model 3   21  pink    0            2000

Ideal Output:

     wk       rims-mod3 rims-modx   color-mod3  color-modx  Auto$   roof$
0   2018-09-09  17         0        black       grey        3000    2000
1   2018-09-16  14         19       yellow      pink        3000    2000
2   2018-09-23  13         21       white       NaN         3000    0

My code:

import pandas as pd
df = pd.DataFrame({'wk start': ['2018-09-09', '2018-09-16', '2018-09-23','2018-09-09', '2018-09-16'], 
    'car': [ 'tesla model x', 'tesla model x', 'tesla model x','tesla model 3','tesla model 3'],
    'rims': [17,14,13,19,21],
    'color':['black','yellow','white','grey','pink'],
    'Autopilot$':[3000,3000, 3000,0,0],
    'Sunroof$':[0,0,0,2000,2000]})
model3 = df[df['car']=='tesla model 3']
modelx = df[df['car']=='tesla model x']
example = model3.merge(modelx, how='outer',left_on='wk start',right_on='wk start',suffixes=('_model3', '_modelx'))
del example['car_model3']
del example['car_modelx']
example['AUTOPILOT']=example['Autopilot$_model3']+example['Autopilot$_modelx']
example['SUNROOF']=example['Sunroof$_model3']+example['Sunroof$_modelx']
del example['Autopilot$_model3']
del example['Autopilot$_modelx']
del example['Sunroof$_modelx']
del example['Sunroof$_model3']

Other resources used are question1, question2

1 Answer 1

3

Use:

df = df.set_index(['wk start','car']).unstack()
df.columns = df.columns.map('_'.join)

df = df.reset_index()

df = df.loc[:, df.fillna(0).ne(0).any()]
print (df)
     wk start  rims_tesla model 3  rims_tesla model x color_tesla model 3  \
0  2018-09-09                19.0                17.0                grey   
1  2018-09-16                21.0                14.0                pink   
2  2018-09-23                 NaN                13.0                 NaN   

  color_tesla model x  Autopilot$_tesla model x  Sunroof$_tesla model 3  
0               black                    3000.0                  2000.0  
1              yellow                    3000.0                  2000.0  
2               white                    3000.0                     NaN  

Explanation:

  1. Reshape by set_index with unstack
  2. Flatten MultiIndex in columns by map and join
  3. Index to column by DataFrame.reset_index
  4. Last remove only 0 columns by boolean indexing with loc

EDIT:

can you explain this line a bit df.loc[:, df.fillna(0).ne(0).any()] ? I can't figure out what it does? There aren't any nan values.

If use unstack then is possible some missing values like in this sample:

print (df)
     wk start  rims_tesla model 3  rims_tesla model x color_tesla model 3  \
0  2018-09-09                19.0                17.0                grey   
1  2018-09-16                21.0                14.0                pink   
2  2018-09-23                 NaN                13.0                 NaN   

  color_tesla model x  Autopilot$_tesla model 3  Autopilot$_tesla model x  \
0               black                       0.0                    3000.0   
1              yellow                       0.0                    3000.0   
2               white                       NaN                    3000.0   

   Sunroof$_tesla model 3  Sunroof$_tesla model x  
0                  2000.0                     0.0  
1                  2000.0                     0.0  
2                     NaN                     0.0  

So need return True values for columns which not contains all zero or all zero with NaNs (what is reason for use fillna(0)):

print (df.fillna(0).ne(0))
   wk start  rims_tesla model 3  rims_tesla model x  color_tesla model 3  \
0      True                True                True                 True   
1      True                True                True                 True   
2      True               False                True                False   

   color_tesla model x  Autopilot$_tesla model 3  Autopilot$_tesla model x  \
0                 True                     False                      True   
1                 True                     False                      True   
2                 True                     False                      True   

   Sunroof$_tesla model 3  Sunroof$_tesla model x  
0                    True                   False  
1                    True                   False  
2                   False                   False  

Check if at least one True with any:

print (df.fillna(0).ne(0).any())
wk start                     True
rims_tesla model 3           True
rims_tesla model x           True
color_tesla model 3          True
color_tesla model x          True
Autopilot$_tesla model 3    False
Autopilot$_tesla model x     True
Sunroof$_tesla model 3       True
Sunroof$_tesla model x      False
dtype: bool
Sign up to request clarification or add additional context in comments.

1 Comment

jezrael, can you explain this line a bit df.loc[:, df.fillna(0).ne(0).any()] ? I can't figure out what it does? There aren't any nan values.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.