Converting specific column values to header and header to column values

Question

I have csv like this,

Market_id   Metric  New_Mar_19  New_Feb_19  tot_Mar_19  tot_Feb_19  
7528            1       9           7           6.5         8           
7528            2       6           3.2         2.2         3

I want to convert the above csv to like this below,

  Month         Nrx_count       Nrx_quantity    Trx_count       Trx_quantity

    Mar_19      9                   6               6.5             2.2
    Feb_19      7                   3.2             8               3

I have tried using pivot and pd.melt but I am not able to like above.

I am thinking about Nrx and Trx, what is logic for assign to output columns? — jezrael
– jezrael, Commented May 21, 2019 at 13:49
@jezrael, Basically my metric, market_id and other column will be having repeating values, its throwing error 'Index contains duplicate entries, cannot reshape' Could you help ? — venkat
– venkat, Commented May 21, 2019 at 14:46
@jezrael, Is it possible to do this whole transformation is SQL ? — venkat
– venkat, Commented May 22, 2019 at 1:06
I think is it possible, but it is something similar like pivot, what is really complicated in SQL. — jezrael
– jezrael, Commented May 22, 2019 at 5:00

jezrael · Accepted Answer · 2019-05-22 05:14:33Z

Idea is convert columns without _ to MultiIndex, then split column by _ and reshape by DataFrame.stack, then by DataFrame.unstack:

df1 = df.set_index(['Market_id','Metric'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.stack()
print (df1)
                         New  tot
Market_id Metric                 
7528      1      Feb_19  7.0  8.0
                 Mar_19  9.0  6.5
          2      Feb_19  3.2  3.0
                 Mar_19  6.0  2.2

#second level, here Metric is converted to columns
df1 = df1.unstack(1)
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
   Market_id       y  New_1  New_2  tot_1  tot_2
0       7528  Feb_19    7.0    3.2    8.0    3.0
1       7528  Mar_19    9.0    6.0    6.5    2.2

#first level, here Market_id is converted to columns
df1 = df1.unstack(0)
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
   Metric       y  New_7528  tot_7528
0       1  Feb_19       7.0       8.0
1       1  Mar_19       9.0       6.5
2       2  Feb_19       3.2       3.0
3       2  Mar_19       6.0       2.2

EDIT:

If get error:

ValueError: Index contains duplicate entries, cannot reshape

it means there are duplicates. Solution is same, only is necessary add aggregation - e.g. by mean, sum...

#change data for duplicates in column for Multiindex - Market_id, Metric
print (df)
   Market_id  Metric  New_Mar_19  New_Feb_19  tot_Mar_19  tot_Feb_19
0       7528       1           2         5.0         5.0           8
1       7528       1          10        20.0         2.5          16
2       7528       2           6         3.2         2.2           3  

df1 = df.set_index(['Market_id','Metric'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.stack()
print (df1)
                          New   tot
Market_id Metric                   
7528      1      Feb_19   5.0   8.0
                 Mar_19   2.0   5.0
                 Feb_19  20.0  16.0
                 Mar_19  10.0   2.5
          2      Feb_19   3.2   3.0
                 Mar_19   6.0   2.2

#number of levels passes to groupby - here 3 levels -> 0,1,2
print (df1.index.nlevels)
3

#grouping by all levels and aggregate mean, sum,max...
df1 = df1.groupby(level=[0,1,2]).mean().unstack(1)
print (df1)
                   New         tot     
Metric               1    2      1    2
Market_id                              
7528      Feb_19  12.5  3.2  12.00  3.0
          Mar_19   6.0  6.0   3.75  2.2

df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
   Market_id       y  New_1  New_2  tot_1  tot_2
0       7528  Feb_19   12.5    3.2  12.00    3.0
1       7528  Mar_19    6.0    6.0   3.75    2.2

sorry I want the market_id column also in output Could you also help me with that ?
Will this kind of approach works if i have few other columns like market_id ?
@venkat - Change df1 = df.set_index(['Market_id','Metric']) to df1 = df.set_index(['Market_id','Metric', 'col without _', 'another col without_'])
Basically my metric, market_id and other column will be having repeating values, its throwing error Could you help ?
@venkat: it looks like you have a new question. If you have done sufficient research on the problem, then ask a new question. You can invite someone to take a look by providing the link to them, but tread carefully - if someone had the time to help you on this occasion, it does not mean they will always have the time to help you.

Collectives™ on Stack Overflow

Converting specific column values to header and header to column values

1 Answer 1

7 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Related