4

I am working on a dataset which is in the following dataframe.

#print(old_df)
   col1 col2 col3
0   1   10  1.5
1   1   11  2.5
2   1   12  5,6
3   2   10  7.8
4   2   24  2.1
5   3   10  3.2
6   4   10  22.1
7   4   11  1.3
8   4   89  0.5
9   4   91  3.3

I am trying to generate another data frame which contains selected col1 values as index, selected col2 values as columns and assign respective col3 value.

Eg:

selected_col1 = [1,2]
selected_col2 = [10,11,24]

New data frame should be looking like:

#print(selected_df)
     10     11     24
1    1.5    2.5    Nan
2    7.8    Nan    2.1

I have tried following method

selected_col1 = [1,2]
selected_col2 = [10,11,24]
selected_df =pd.DataFrame(index=selected_col1,columns=selected_col2) 
for col1_value in selected_col1:
    for col2_value in selected_col2:
        qry = 'col1 == {} & col2 == {}'.format(col1_value,col2_value)
        col3_value = old_df.query(qry).col3.values
        if(len(col3_value) > 0):
            selected_df.at[col1_value,col2_value] = col3_value[0]

But because my dataframe has around 20 million rows, this brute force kind of method is taking long time. Is there a way better than this?

1 Answer 1

6

First filter rows by membership by Series.isin in both columns chained by & for bitwise AND and then use DataFrame.pivot:

df = df[df['col1'].isin(selected_col1) & df['col2'].isin(selected_col2)]

df = df.pivot('col1','col2','col3')
print (df)
col2   10   11   24
col1               
1     1.5  2.5  NaN
2     7.8  NaN  2.1

If possible some duplicated pairs in col1 with col2 after filtering use DataFrame.pivot_table:

df = df.pivot_table(index='col1',columns='col2',values='col3', aggfunc='mean')

EDIT:

If use | for bitwise OR get different output:

df = df[df['col1'].isin(selected_col1) | df['col2'].isin(selected_col2)]

df = df.pivot('col1','col2','col3')
print (df)
col2    10   11   12   24
col1                     
1      1.5  2.5  5,6  NaN
2      7.8  NaN  NaN  2.1
3      3.2  NaN  NaN  NaN
4     22.1  1.3  NaN  NaN
Sign up to request clarification or add additional context in comments.

8 Comments

I am getting the following error: ValueError: Unstacked DataFrame is too big, causing int32 overflow\n by the way i am using "|" instead "&" while initializing new dataframe
@SatheeshK - Do you need | for or, not & for and?
@SatheeshK - Unfortunately error means very large data, what is length of selected_col1 and selected_col2 lists?
I am using | for or only. len(selected_col1)= 1894 ,len(selected_col2)= 8546
@SatheeshK - If understand well, after filtering are removed only few rows. So it is reason for weird error, because large DataFrame. Also one thing - you can try upgrade to last pandas version, maybe help.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.