0

I'm working with aggregated data, which I need to dis-aggregate in order to process it further. The original df contains a value 'no. of students' per row and I need one row in the new df per student:

Original df:

                faculty A   faculty B   faculty x
male students           2           7       ...
female students         4           3       ...

New df:

 No.           gender  faculty   ...
 1             m       A
 2             m       A
 3             f       A

and so on. The original df contains some more information (like nationality and regional info), but that could be dealt with the same way as with gender, etc. Obviously I'd start by transposing (df.T), but then the fun begins... I'm quite the beginner, any pointer would be very welcome.

1 Answer 1

1

I think the easiest way to "disaggregate" the data is to use a generator expression to simply enumerate all the desired rows:

(key for key, val in series.iteritems() for i in range(val))

import pandas as pd

df = pd.DataFrame({'faculty A': [2,4], 'faculty B':[7,3]}, 
                  index=['male students', 'female students'])
df.columns = [re.sub(r'faculty ', '', col) for col in df.columns]
df.index = ['m', 'f']
series = df.stack()
df = pd.DataFrame(
    (key for key, val in series.iteritems() for i in range(val)),
    columns=['gender','faculty'])

yields

   gender faculty
0       m       A
1       m       A
2       m       B
3       m       B
4       m       B
5       m       B
6       m       B
7       m       B
8       m       B
9       f       A
10      f       A
11      f       A
12      f       A
13      f       B
14      f       B
15      f       B

PS. The above shows it is possible to "disaggregate" the data, but are you sure you want to do that? Disaggregation seems rather inefficient. If one of the values is a million, then you would end up with a million duplicate rows...

Instead of disaggregating, you might be better off finding a way to perform your computation on the aggregated data.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, works for me. I'm not sure if dis-aggregation is the right step here, but it should in the end allow me to cross-reference with another 'aggregate' dataset. You are right about the 'huge' values of course.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.