Fill missing data and transform rows to column in Python Pandas

Question

I have a dataframe like this,

df_nba = pd.DataFrame({'col1': ['name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',

                               ], 
                        'col2': ['jim', 'California','M','Biology','Biology Club',1,
                                 'jim', 'California','M','Physics',2,
                                 'greg', 'Arizona','M','Geography','Jazz Band',3,
                                 'greg', 'Arizona','M','Physics','Photography',4,
                                 'jesse', 'Washington','F','Economics',5,
                                 'jesse', 'Washington','F','Literature','Photography',6,
       
                     ]})

col1    col2
0   name    jim
1   NaN California
2   NaN M
3   course  Biology
4   eca Biology Club
5   pages   1
6   name    jim
7   NaN California
8   NaN M
9   course  Physics
10  pages   2
11  name    greg
12  NaN Arizona
13  NaN M
14  course  Geography
15  eca Jazz Band
16  pages   3
17  name    greg
18  NaN Arizona
19  NaN M
20  course  Physics
21  eca Photography
22  pages   4
23  name    jesse
24  NaN Washington
25  NaN F
26  course  Economics
27  pages   5
28  name    jesse
29  NaN Washington
30  NaN F
31  course  Literature
32  eca Photography
33  pages   6

There are two consecutive rows always missing after the row name for each person. Can I fill the data with States and Gender first then I can transpose the data to a column wise view?

The output will be like,

        name      states     gender   course           eca           pages
                                      
0       jim      California    M       Biology       Biology Club     1
1       jim      California    M       Physics       NaN              2
2       greg     Arizona       M       Geography     Jazz Band        3
3       greg     Arizona       M       Physics       Photography      4
4      jesse     Washington    F       Economics     NaN              5
5      jesse     Washington    F       Literature    Photography      6

There is always two fields missing after each name? No other variations? — Henry Yik
– Henry Yik, Commented Oct 30, 2020 at 15:22

Ben.T · Accepted Answer · 2020-10-30 15:25:04Z

You can use a mask where the value "name" is in col1 and shift to fill the right values in col1. Then reshape the result with unstack, after set_index with a cumsum on the mask, incremental value every "name" in col1 and col1 itself.

#get a mask where name in col1
mask = df_nba['col1'].eq('name')

# fill the two following nan with the rigth value
df_nba.loc[mask.shift(1,fill_value=False), 'col1'] = 'states'
df_nba.loc[mask.shift(2,fill_value=False), 'col1'] = 'gender'

#reshape
df_ = (df_nba.set_index([mask.cumsum(),
                         df_nba['col1'].to_numpy()])
             ['col2'].unstack()
             .rename_axis(None) #cosmetic
             [['name','states','gender','course','eca','pages']] #reorder the columns
      )

print(df_)
    name      states gender      course           eca pages
1    jim  California      M     Biology  Biology Club     1
2    jim  California      M     Physics           NaN     2
3   greg     Arizona      M   Geography     Jazz Band     3
4   greg     Arizona      M     Physics   Photography     4
5  jesse  Washington      F   Economics           NaN     5
6  jesse  Washington      F  Literature   Photography     6

Alternatively you can pivot by df_nba.assign(group=mask.cumsum()).pivot("group", "col1", "col2") for the reshape part.
Hi @HenryYik and Ben. Thanks for the answers! Do you know how to avoid duplicates when set the pivot method? I re-run both of your solution it gives me an error ValueError: Index contains duplicate entries, cannot reshape, It seems when I group up the values it didn't consider the index from pages. Ben's solution works well on this test dataset. But it showed me the same error on my file about the duplicates when reshaping the dataframe.
Thank you so much! I final found why it didn't work... There are some rows in my file that missing the name that made the unstack function not work. Thanks again the code is perfect!

Rajan · Accepted Answer · 2020-10-30 15:43:48Z

1

It is not an efficient solution but it can do what you want. if you provide col1 & col2 as lists

# to fill missing values in col1
for i in range(1,len(col1)):
    if(col1[i-1] == "name"):
       col1[i] = "states"
    if(col1[i-1] == "states"):
       col1[i] = "gender"

# to create list of dictionaries for each record
data=[]
temp={}
for i in range(len(c1)):
    temp[col1[i]]=col2[i]
    if(col1[i]=="pages"):
        data.append(temp)
        temp={}

pd.DataFrame(data)

edited Oct 30, 2020 at 15:43

answered Oct 30, 2020 at 15:38

Rajan

7671 gold badge10 silver badges22 bronze badges

Comments

jlb_gouveia · Accepted Answer · 2020-10-30 15:45:19Z

You can do the following:

name_index = df_nba.loc[df_nba['col1']=='name'].index
for i in name_index:
    df_nba.loc[i+1:i+2, 'col1'] = ['states', 'gender']

Now to get the transposed table:

pivot = df_nba.pivot(columns = 'col1')
pivot_nba = pd.DataFrame()
for col in pivot['col2']:
    pivot_nba[col] = pivot['col2'][col].dropna().reset_index(drop = True)
pivot_nba

    course        eca               gender  name    pages   states
0   Biology       Biology Club      M       jim     1       California
1   Physics       Jazz Band         M       jim     2       California
2   Geography     Photography       M       greg    3       Arizona
3   Physics       Photography       M       greg    4       Arizona
4   Economics     NaN               F       jesse   5       Washington
5   Literature    NaN               F       jesse   6       Washington

Collectives™ on Stack Overflow

Fill missing data and transform rows to column in Python Pandas

3 Answers 3

3 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Related