Pandas DataFrame Calculations

Question

I have a fairly complex dataframe that looks like this :

df = pd.DataFrame({'0': {('Total Number of End Points', '0.01um', '0hr'): 12,
  ('Total Number of End Points', '0.1um', '0hr'): 8,
  ('Total Number of End Points', 'Control', '0hr'): 4,
  ('Total Number of End Points', '0.01um', '24hr'): 18,
  ('Total Number of End Points', '0.1um', '24hr'): 12,
  ('Total Number of End Points', 'Control', '24hr'): 6,
  ('Total Vessel Length', '0.01um', '0hr'): 12,
  ('Total Vessel Length', '0.1um', '0hr'): 8,
  ('Total Vessel Length', 'Control', '0hr'): 4,
  ('Total Vessel Length', '0.01um', '24hr'): 18,
  ('Total Vessel Length', '0.1um',  '24hr'): 12,
  ('Total Vessel Length', 'Control',  '24hr'): 6},
  '1': {('Total Number of End Points', '0.01um', '0hr'): 12,
  ('Total Number of End Points', '0.1um', '0hr'): 8,
  ('Total Number of End Points', 'Control', '0hr'): 4,
  ('Total Number of End Points', '0.01um', '24hr'): 18,
  ('Total Number of End Points', '0.1um', '24hr'): 12,
  ('Total Number of End Points', 'Control', '24hr'): 6,
  ('Total Vessel Length', '0.01um', '0hr'): 12,
  ('Total Vessel Length', '0.1um', '0hr'): 8,
  ('Total Vessel Length', 'Control', '0hr'): 4,
  ('Total Vessel Length', '0.01um', '24hr'): 18,
  ('Total Vessel Length', '0.1um',  '24hr'): 12,
  ('Total Vessel Length', 'Control',  '24hr'): 6},
  '2': {('Total Number of End Points', '0.01um', '0hr'): 12,
  ('Total Number of End Points', '0.1um', '0hr'): 8,
  ('Total Number of End Points', 'Control', '0hr'): 4,
  ('Total Number of End Points', '0.01um', '24hr'): 18,
  ('Total Number of End Points', '0.1um', '24hr'): 12,
  ('Total Number of End Points', 'Control', '24hr'): 6,
  ('Total Vessel Length', '0.01um', '0hr'): 12,
  ('Total Vessel Length', '0.1um', '0hr'): 8,
  ('Total Vessel Length', 'Control', '0hr'): 4,
  ('Total Vessel Length', '0.01um', '24hr'): 18,
  ('Total Vessel Length', '0.1um',  '24hr'): 12,
  ('Total Vessel Length', 'Control',  '24hr'): 6}})

print(df)
                                                 0   1   2
        Total Number of End Points 0.01um  0hr   12  12  12
                                           24hr  18  18  18
                                   0.1um   0hr    8   8   8
                                           24hr  12  12  12
                                   Control 0hr    4   4   4
                                           24hr   6   6   6
        Total Vessel Length        0.01um  0hr   12  12  12
                                           24hr  18  18  18
                                   0.1um   0hr    8   8   8
                                           24hr  12  12  12
                                   Control 0hr    4   4   4
                                           24hr   6   6   6

I'm trying to divide each value by the average of the columns in the corresponding control level. I tried the following but it didn't work.

df2 = df.divide(df.xs('Control', level=1).mean(axis=1), axis='index')

I'm pretty new to python and pandas so I tend to think in MS Excel terms on this problem.

If it were in Excel the formula for A1 ('Total Number of End Points', '0.01um', '0hr', 0) would look be :

=A1 / AVERAGE($A$5:$C$5)

B1 ('Total Number of End Points', '0.01um', '0hr', 1) would be :

=B1 / AVERAGE($A$5:$C$5)

and A2 ('Total Number of End Points', '0.01um', '24hr', 0) would be

=A1 / AVERAGE($A$6:$C$6)

The desired result of this example would be :

                                                 0  1  2
        Total Number of End Points 0.01um  0hr   3  3  3
                                           24hr  3  3  3
                                   0.1um   0hr   2  2  2
                                           24hr  2  2  2
                                   Control 0hr   1  1  1
                                           24hr  1  1  1
        Total Vessel Length        0.01um  0hr   3  3  3
                                           24hr  3  3  3
                                   0.1um   0hr   2  2  2
                                           24hr  2  2  2
                                   Control 0hr   1  1  1
                                           24hr  1  1  1

Note : There are many indexes and columns in the real data.

When I put you data, at the top of your question into a DataFrame it differs from what you get with print(df). The df=... and the print(df) are two different DataFrames. Your print(df) is unrelated to the code above. Your input columns as ['a', 'b'], but your printed columns are [0, 1, 2]. Can you make it all consistent. Thanks. — Mark Graph
– Mark Graph, Commented Apr 18, 2015 at 0:24
In Pandas the data is organized internally by columns, and so it is easiest to extract or calculate a column. Could you perhaps reorganize the data so that all the control values are in their own column? — Paul
– Paul, Commented Apr 18, 2015 at 0:32

unutbu · Accepted Answer · 2015-04-18 10:33:39Z

It helps to have the Control values in their own columns. You can do that using unstack:

df.index.names = ['field', 'type', 'time']
df2 = df.unstack(['type']).swaplevel(0, 1, axis=1)

# type                            0.01um 0.1um Control 0.01um 0.1um Control  \
#                                      0     0       0      1     1       1   
# field                      time                                             
# Total Number of End Points 0hr      12     8       4     12     8       4   
#                            24hr     18    12       6     18    12       6   
# Total Vessel Length        0hr      12     8       4     12     8       4   
#                            24hr     18    12       6     18    12       6   

# type                            0.01um 0.1um Control  
#                                      2     2       2  
# field                      time                       
# Total Number of End Points 0hr      12     8       4  
#                            24hr     18    12       6  
# Total Vessel Length        0hr      12     8       4  
#                            24hr     18    12       6

Now find the average of each Control:

ave = df2['Control'].mean(axis=1)
# field                       time
# Total Number of End Points  0hr     4
#                             24hr    6
# Total Vessel Length         0hr     4
#                             24hr    6
# dtype: float64

As you expected, you can use df2.divide to compute the desired result. Be sure to use axis=0 to tell Pandas to match values (in df2 and ave) based on the row index.

result = df2.divide(ave, axis=0)
# type                            0.01um 0.1um Control 0.01um 0.1um Control  \
#                                      0     0       0      1     1       1   
# field                      time                                             
# Total Number of End Points 0hr       3     2       1      3     2       1   
#                            24hr      3     2       1      3     2       1   
# Total Vessel Length        0hr       3     2       1      3     2       1   
#                            24hr      3     2       1      3     2       1   

# type                            0.01um 0.1um Control  
#                                      2     2       2  
# field                      time                       
# Total Number of End Points 0hr       3     2       1  
#                            24hr      3     2       1  
# Total Vessel Length        0hr       3     2       1  
#                            24hr      3     2       1

There are essentially the values that you are after. However, if you want to rearrange the DataFrame to look exactly as you posted, then:

result = result.stack(['type'])
result = result.reorder_levels(['field','type','time'], axis=0)
result = result.reindex(df.index)

yields

                                         0  1  2
field                      type    time         
Total Number of End Points 0.01um  0hr   3  3  3
                                   24hr  3  3  3
                           0.1um   0hr   2  2  2
                                   24hr  2  2  2
                           Control 0hr   1  1  1
                                   24hr  1  1  1
Total Vessel Length        0.01um  0hr   3  3  3
                                   24hr  3  3  3
                           0.1um   0hr   2  2  2
                                   24hr  2  2  2
                           Control 0hr   1  1  1
                                   24hr  1  1  1

Putting it all together:

df.index.names = ['field', 'type', 'time']
df2 = df.unstack(['type']).swaplevel(0, 1, axis=1)
ave = df2['Control'].mean(axis=1)
result = df2.divide(ave, axis=0)
result = result.stack(['type'])
result = result.reorder_levels(['field','type','time'], axis=0)
result = result.reindex(df.index)

Interesting. I hadn't noticed that indexes could be tuples and had all these associated methods.

Paul · Accepted Answer · 2015-04-18 01:26:02Z

The issue here is that pandas is organized to easily calculate over columns, and the question requires an average over a row to be deducted from other rows. Pandas isn't designed to work that way.

However, you can easily switch rows and columns with the transpose .T, and then it may be more tractable, and in fact the control mean is a one liner.

>>> df.T[(u'Total Vessel Length', u'Control', u'0hr')].mean()
4.0

This 4.0 comes from two 4.0 values in the original data:

>>> df.T[(u'Total Vessel Length', u'Control', u'0hr')]
a    4
b    4

At this point it looks like for loops will take care of the issue.

Untested:

for primary in (u'Total Vessel Length',u'Total Number of End Points'):
     for um in (u'0.01um',u'0.1um'):
         for hours in (u'0hr',u'24hr'):
             df.T[(primary,um,hours)]=df.T[(primary,um,hours)]/df.T[(primary, u'Control', hours)].mean()

Note that this doesn't divide the non-control columns, but it is easy to include 'control' into the um loop.

UPDATE That doesn't work, somehow it is not modifying the dataframe in place. Right now, I'm not sure why.

But you can construct a new data frame by calling pd.DataFrame on a dict comprehension.

This seems to be working...

import pandas as pd

df = pd.DataFrame({'0': {('Total Number of End Points', '0.01um', '0hr'): 12,
  ('Total Number of End Points', '0.1um', '0hr'): 8,
  ('Total Number of End Points', 'Control', '0hr'): 4,
  ('Total Number of End Points', '0.01um', '24hr'): 18,
  ('Total Number of End Points', '0.1um', '24hr'): 12,
  ('Total Number of End Points', 'Control', '24hr'): 6,
  ('Total Vessel Length', '0.01um', '0hr'): 12,
  ('Total Vessel Length', '0.1um', '0hr'): 8,
  ('Total Vessel Length', 'Control', '0hr'): 4,
  ('Total Vessel Length', '0.01um', '24hr'): 18,
  ('Total Vessel Length', '0.1um',  '24hr'): 12,
  ('Total Vessel Length', 'Control',  '24hr'): 6},
  '1': {('Total Number of End Points', '0.01um', '0hr'): 12,
  ('Total Number of End Points', '0.1um', '0hr'): 8,
  ('Total Number of End Points', 'Control', '0hr'): 4,
  ('Total Number of End Points', '0.01um', '24hr'): 18,
  ('Total Number of End Points', '0.1um', '24hr'): 12,
  ('Total Number of End Points', 'Control', '24hr'): 6,
  ('Total Vessel Length', '0.01um', '0hr'): 12,
  ('Total Vessel Length', '0.1um', '0hr'): 8,
  ('Total Vessel Length', 'Control', '0hr'): 4,
  ('Total Vessel Length', '0.01um', '24hr'): 18,
  ('Total Vessel Length', '0.1um',  '24hr'): 12,
  ('Total Vessel Length', 'Control',  '24hr'): 6},
  '2': {('Total Number of End Points', '0.01um', '0hr'): 12,
  ('Total Number of End Points', '0.1um', '0hr'): 8,
  ('Total Number of End Points', 'Control', '0hr'): 4,
  ('Total Number of End Points', '0.01um', '24hr'): 18,
  ('Total Number of End Points', '0.1um', '24hr'): 12,
  ('Total Number of End Points', 'Control', '24hr'): 6,
  ('Total Vessel Length', '0.01um', '0hr'): 12,
  ('Total Vessel Length', '0.1um', '0hr'): 8,
  ('Total Vessel Length', 'Control', '0hr'): 4,
  ('Total Vessel Length', '0.01um', '24hr'): 18,
  ('Total Vessel Length', '0.1um',  '24hr'): 12,
  ('Total Vessel Length', 'Control',  '24hr'): 6}})

print df

df2 = pd.DataFrame({(primary,um,hours):df.T[(primary,um,hours)]/df.T[(primary,u'Control',hours)].mean() for primary in (u'Total Vessel Length',u'Total Number of End Points') for um in (u'0.01um',u'0.1um') for hours in (u'0hr',u'24hr')})

print df2.T

OUTPUT

paul@home:~/SO$ python ./r.py 
                                              0   1   2
(Total Number of End Points, 0.01um, 0hr)    12  12  12
(Total Number of End Points, 0.01um, 24hr)   18  18  18
(Total Number of End Points, 0.1um, 0hr)      8   8   8
(Total Number of End Points, 0.1um, 24hr)    12  12  12
(Total Number of End Points, Control, 0hr)    4   4   4
(Total Number of End Points, Control, 24hr)   6   6   6
(Total Vessel Length, 0.01um, 0hr)           12  12  12
(Total Vessel Length, 0.01um, 24hr)          18  18  18
(Total Vessel Length, 0.1um, 0hr)             8   8   8
(Total Vessel Length, 0.1um, 24hr)           12  12  12
(Total Vessel Length, Control, 0hr)           4   4   4
(Total Vessel Length, Control, 24hr)          6   6   6

[12 rows x 3 columns]
                                            0  1  2
(Total Number of End Points, 0.01um, 0hr)   3  3  3
(Total Number of End Points, 0.01um, 24hr)  3  3  3
(Total Number of End Points, 0.1um, 0hr)    2  2  2
(Total Number of End Points, 0.1um, 24hr)   2  2  2
(Total Vessel Length, 0.01um, 0hr)          3  3  3
(Total Vessel Length, 0.01um, 24hr)         3  3  3
(Total Vessel Length, 0.1um, 0hr)           2  2  2
(Total Vessel Length, 0.1um, 24hr)          2  2  2

[8 rows x 3 columns]

I'm getting the same result out as in. Is there an inplace=True needed somewhere?
maybe related. still looking. stackoverflow.com/questions/17995328/…

Collectives™ on Stack Overflow

Pandas DataFrame Calculations

2 Answers 2

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Linked

Related