Pandas: creating dataframe rows from other dataframe information

Question

I'm working with aggregated data, which I need to dis-aggregate in order to process it further. The original df contains a value 'no. of students' per row and I need one row in the new df per student:

Original df:

                faculty A   faculty B   faculty x
male students           2           7       ...
female students         4           3       ...

New df:

 No.           gender  faculty   ...
 1             m       A
 2             m       A
 3             f       A

and so on. The original df contains some more information (like nationality and regional info), but that could be dealt with the same way as with gender, etc. Obviously I'd start by transposing (df.T), but then the fun begins... I'm quite the beginner, any pointer would be very welcome.

unutbu · Accepted Answer · 2015-06-14 12:59:11Z

I think the easiest way to "disaggregate" the data is to use a generator expression to simply enumerate all the desired rows:

(key for key, val in series.iteritems() for i in range(val))

import pandas as pd

df = pd.DataFrame({'faculty A': [2,4], 'faculty B':[7,3]}, 
                  index=['male students', 'female students'])
df.columns = [re.sub(r'faculty ', '', col) for col in df.columns]
df.index = ['m', 'f']
series = df.stack()
df = pd.DataFrame(
    (key for key, val in series.iteritems() for i in range(val)),
    columns=['gender','faculty'])

yields

   gender faculty
0       m       A
1       m       A
2       m       B
3       m       B
4       m       B
5       m       B
6       m       B
7       m       B
8       m       B
9       f       A
10      f       A
11      f       A
12      f       A
13      f       B
14      f       B
15      f       B

PS. The above shows it is possible to "disaggregate" the data, but are you sure you want to do that? Disaggregation seems rather inefficient. If one of the values is a million, then you would end up with a million duplicate rows...

Instead of disaggregating, you might be better off finding a way to perform your computation on the aggregated data.

Thank you, works for me. I'm not sure if dis-aggregation is the right step here, but it should in the end allow me to cross-reference with another 'aggregate' dataset. You are right about the 'huge' values of course.

Collectives™ on Stack Overflow

Pandas: creating dataframe rows from other dataframe information

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related