0

I have a dataframe like this,

df_nba = pd.DataFrame({'col1': ['name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',

                               ], 
                        'col2': ['jim', 'California','M','Biology','Biology Club',1,
                                 'jim', 'California','M','Physics',2,
                                 'greg', 'Arizona','M','Geography','Jazz Band',3,
                                 'greg', 'Arizona','M','Physics','Photography',4,
                                 'jesse', 'Washington','F','Economics',5,
                                 'jesse', 'Washington','F','Literature','Photography',6,
       
                     ]})

col1    col2
0   name    jim
1   NaN California
2   NaN M
3   course  Biology
4   eca Biology Club
5   pages   1
6   name    jim
7   NaN California
8   NaN M
9   course  Physics
10  pages   2
11  name    greg
12  NaN Arizona
13  NaN M
14  course  Geography
15  eca Jazz Band
16  pages   3
17  name    greg
18  NaN Arizona
19  NaN M
20  course  Physics
21  eca Photography
22  pages   4
23  name    jesse
24  NaN Washington
25  NaN F
26  course  Economics
27  pages   5
28  name    jesse
29  NaN Washington
30  NaN F
31  course  Literature
32  eca Photography
33  pages   6

There are two consecutive rows always missing after the row name for each person. Can I fill the data with States and Gender first then I can transpose the data to a column wise view?

The output will be like,

        name      states     gender   course           eca           pages
                                      
0       jim      California    M       Biology       Biology Club     1
1       jim      California    M       Physics       NaN              2
2       greg     Arizona       M       Geography     Jazz Band        3
3       greg     Arizona       M       Physics       Photography      4
4      jesse     Washington    F       Economics     NaN              5
5      jesse     Washington    F       Literature    Photography      6
1
  • There is always two fields missing after each name? No other variations? Commented Oct 30, 2020 at 15:22

3 Answers 3

2

You can use a mask where the value "name" is in col1 and shift to fill the right values in col1. Then reshape the result with unstack, after set_index with a cumsum on the mask, incremental value every "name" in col1 and col1 itself.

#get a mask where name in col1
mask = df_nba['col1'].eq('name')

# fill the two following nan with the rigth value
df_nba.loc[mask.shift(1,fill_value=False), 'col1'] = 'states'
df_nba.loc[mask.shift(2,fill_value=False), 'col1'] = 'gender'

#reshape
df_ = (df_nba.set_index([mask.cumsum(),
                         df_nba['col1'].to_numpy()])
             ['col2'].unstack()
             .rename_axis(None) #cosmetic
             [['name','states','gender','course','eca','pages']] #reorder the columns
      )

print(df_)
    name      states gender      course           eca pages
1    jim  California      M     Biology  Biology Club     1
2    jim  California      M     Physics           NaN     2
3   greg     Arizona      M   Geography     Jazz Band     3
4   greg     Arizona      M     Physics   Photography     4
5  jesse  Washington      F   Economics           NaN     5
6  jesse  Washington      F  Literature   Photography     6
Sign up to request clarification or add additional context in comments.

3 Comments

Alternatively you can pivot by df_nba.assign(group=mask.cumsum()).pivot("group", "col1", "col2") for the reshape part.
Hi @HenryYik and Ben. Thanks for the answers! Do you know how to avoid duplicates when set the pivot method? I re-run both of your solution it gives me an error ValueError: Index contains duplicate entries, cannot reshape, It seems when I group up the values it didn't consider the index from pages. Ben's solution works well on this test dataset. But it showed me the same error on my file about the duplicates when reshaping the dataframe.
Thank you so much! I final found why it didn't work... There are some rows in my file that missing the name that made the unstack function not work. Thanks again the code is perfect!
1

It is not an efficient solution but it can do what you want. if you provide col1 & col2 as lists

# to fill missing values in col1
for i in range(1,len(col1)):
    if(col1[i-1] == "name"):
       col1[i] = "states"
    if(col1[i-1] == "states"):
       col1[i] = "gender"

# to create list of dictionaries for each record
data=[]
temp={}
for i in range(len(c1)):
    temp[col1[i]]=col2[i]
    if(col1[i]=="pages"):
        data.append(temp)
        temp={}

pd.DataFrame(data)

Comments

1

You can do the following:

name_index = df_nba.loc[df_nba['col1']=='name'].index
for i in name_index:
    df_nba.loc[i+1:i+2, 'col1'] = ['states', 'gender']

Now to get the transposed table:

pivot = df_nba.pivot(columns = 'col1')
pivot_nba = pd.DataFrame()
for col in pivot['col2']:
    pivot_nba[col] = pivot['col2'][col].dropna().reset_index(drop = True)
pivot_nba

    course        eca               gender  name    pages   states
0   Biology       Biology Club      M       jim     1       California
1   Physics       Jazz Band         M       jim     2       California
2   Geography     Photography       M       greg    3       Arizona
3   Physics       Photography       M       greg    4       Arizona
4   Economics     NaN               F       jesse   5       Washington
5   Literature    NaN               F       jesse   6       Washington

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.