How can I pivot a dataframe? [closed]

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Closed 8 months ago.

The community reviewed whether to reopen this question 4 months ago and left it closed:

Needs more focus Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Improve this question

How do I pivot the pandas dataframe below such that the col values become columns, row values become the index, and mean of val0 becomes the values? (In some cases this is called transforming from long-format to wide-format.)

Consider a dataframe df with columns 'key', 'row', 'item', 'col', and random float values 'val0', 'val1'. I conspicuously named the columns and relevant column values to correspond with how I want to pivot them. (Setup code at bottom.)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Subquestions

How to avoid getting ValueError: Index contains duplicate entries, cannot reshape?

How do I pivot df such that the col values become columns, row values become the index, and mean of val0 are the values?

col   col0   col1   col2   col3  col4
row
row0  0.77  0.605    NaN  0.860  0.65
row2  0.13    NaN  0.395  0.500  0.25
row3   NaN  0.310    NaN  0.545   NaN
row4   NaN  0.100  0.395  0.760  0.24

How do I pivot...

... so that missing values are 0?

col   col0   col1   col2   col3  col4
row
row0  0.77  0.605  0.000  0.860  0.65
row2  0.13  0.000  0.395  0.500  0.25
row3  0.00  0.310  0.000  0.545  0.00
row4  0.00  0.100  0.395  0.760  0.24

... to do an aggregate function other than mean, like sum?

col   col0  col1  col2  col3  col4
row
row0  0.77  1.21  0.00  0.86  0.65
row2  0.13  0.00  0.79  0.50  0.50
row3  0.00  0.31  0.00  1.09  0.00
row4  0.00  0.10  0.79  1.52  0.24

... to do more that one aggregation at a time?

       sum                          mean
col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
row
row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24

... to aggregate over multiple 'value' columns?

      val0                             val1
col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
row
row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

... to subdivide by multiple columns? (item0,item1,item2..., col0,col1,col2...)

item item0             item1                         item2
col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
row
row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

... to subdivide by multiple rows: (key0,key1... row0,row1,row2...)

item      item0             item1                         item2
col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
key  row
key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
     row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
     row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
     row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
     row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
     row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
     row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
     row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
     row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

... to aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

col   col0  col1  col2  col3  col4
row
row0     1     2     0     1     1
row2     1     0     2     1     2
row3     0     1     0     2     0
row4     0     1     2     2     1

... to convert a DataFrame from long-to-wide by pivoting on ONLY two columns? Given:

np.random.seed([3, 1415])
df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
df2
   A   B
0  a   0
1  a  11
2  a   2
3  a  11
4  b  10
5  b  10
6  b  14
7  c   7

The expected should look something like

      a     b    c
0   0.0  10.0  7.0
1  11.0  10.0  NaN
2   2.0  14.0  NaN
3  11.0   NaN  NaN

How do I flatten the multi-index to single index after pivot?

From:

To:

   1|1  2|1  2|2
a    2    1    1
b    2    1    0
c    1    0    0

Setup

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

Why is this question not a duplicate? and more useful than the following autosuggestions:

How to pivot a dataframe in Pandas? only covers the specific case of 'Country' to row-index, values of 'Indicator' for 'Year' to multiple columns and no aggregation of values.
When pivoting a Pandas dataframe, how do I make the column names the same as in R? (flat, label in column names) asks how to pivot in pandas like in R, i.e. autogenerate an individual column for each value of strength...
pandas pivoting a dataframe, duplicate rows asks about the syntax for pivoting multiple columns, without needing to list them all.

None of the existing questions and answers are comprehensive, so this is an attempt at a canonical question and answer that encompasses all aspects of pivoting.

Very helpful question! A small suggestion: would it not be more suitable to split these question up into several posts? I had a problem similar to question 8, but did not find it here after a short a glance. Only after I created a (now marked as duplicate) question I was redirected here again and found the solution I needed. — Viktor
– Viktor, Commented Oct 24, 2022 at 12:01
IMHO, this is too broad to be a good canonical question, and it should be broken up. I'm not a Pandas expert, but my intuition is that questions 2-6 should be kept here, while questions 1, 7-8, 9, 10, and 11 should all be separate. But by all means use the same example data and link them to each other. I'm open to discussing this on Meta. — wjandrea
– wjandrea, Commented Nov 30, 2022 at 3:04
@wjandrea the question shouldn't be preceded by lengthy meta commentary on the need for having such a canonical. Ideally, all of this would happen on Meta, but you can't get SMEs to congregate there and have a discussion; plus the format is not suited to that kind of discussion. We really need some kind of environment where people can collaborate on a Markdown document in real time while also chatting. — Karl Knechtel
– Karl Knechtel, Commented Jan 2, 2023 at 17:31
@SeaBean The problem is that "all aspects" is too broad. An overview would be fine, like Qs 2-6 as I mentioned before, but as it stands it's too broad to be generally useful as a duplicate target. But also, there's no problem with it being closed - I mean, it's still available to read and it's not like it's going to be deleted. — wjandrea
– wjandrea, Commented Jul 8 at 12:22
@wjandrea An example would be your comment over a year ago on the selected answer mentioning that "pivot_table() and crosstab() can take string function names now...". If we allow new answers to include such enhanced functionalities, it would be good to improve the quality of the overall total answers in this post. Also to enhance this post to make it more up-to-date. — SeaBean
– SeaBean, Commented Jul 8 at 17:48

congusbongus · Accepted Answer · 2023-07-19 03:58:33Z

Here is a list of idioms we can use to pivot

pd.DataFrame.pivot_table
- A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And it is the intended approach by the developers.
- Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
pd.DataFrame.groupby + pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
pd.DataFrame.set_index + pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
- Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
pd.crosstab
- This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
pd.factorize + np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies + pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys on which it is being asked to pivot. For example: Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Examples

What I'm going to do for each subsequent question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

Questions 2 and 3

How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index='row', columns='col',
    aggfunc='mean')

col   col0   col1   col2   col3  col4
row                                  
row0  0.77  0.605    NaN  0.860  0.65
row2  0.13    NaN  0.395  0.500  0.25
row3   NaN  0.310    NaN  0.545   NaN
row4   NaN  0.100  0.395  0.760  0.24

aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

How do I make it so that missing values are 0?

pd.DataFrame.pivot_table

fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0.

df.pivot_table(
    values='val0', index='row', columns='col',
    fill_value=0, aggfunc='mean')

col   col0   col1   col2   col3  col4
row
row0  0.77  0.605  0.000  0.860  0.65
row2  0.13  0.000  0.395  0.500  0.25
row3  0.00  0.310  0.000  0.545  0.00
row4  0.00  0.100  0.395  0.760  0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)

pd.crosstab

pd.crosstab(
    index=df['row'], columns=df['col'],
    values=df['val0'], aggfunc='mean').fillna(0)

Question 4

Can I get something other than mean, like maybe sum?

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index='row', columns='col',
    fill_value=0, aggfunc='sum')

col   col0  col1  col2  col3  col4
row
row0  0.77  1.21  0.00  0.86  0.65
row2  0.13  0.00  0.79  0.50  0.50
row3  0.00  0.31  0.00  1.09  0.00
row4  0.00  0.10  0.79  1.52  0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)

pd.crosstab

pd.crosstab(
    index=df['row'], columns=df['col'],
    values=df['val0'], aggfunc='sum').fillna(0)

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index='row', columns='col',
    fill_value=0, aggfunc=[np.size, np.mean])

     size                      mean
col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
row
row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)

pd.crosstab

pd.crosstab(
    index=df['row'], columns=df['col'],
    values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')

Question 6

Can I aggregate over multiple value columns?

pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

df.pivot_table(
    values=['val0', 'val1'], index='row', columns='col',
    fill_value=0, aggfunc='mean')

      val0                             val1
col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
row
row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)

Question 7

Can I subdivide by multiple columns?

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index='row', columns=['item', 'col'],
    fill_value=0, aggfunc='mean')

item item0             item1                         item2
col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
row
row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

pd.DataFrame.groupby

df.groupby(
    ['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

Question 8

Can I subdivide by multiple columns?

pd.DataFrame.pivot_table

df.pivot_table(
    values='val0', index=['key', 'row'], columns=['item', 'col'],
    fill_value=0, aggfunc='mean')

item      item0             item1                         item2
col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
key  row
key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
     row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
     row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
     row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
     row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
     row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
     row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
     row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
     row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

pd.DataFrame.groupby

df.groupby(
    ['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

pd.DataFrame.set_index because the set of keys are unique for both rows and columns

df.set_index(
    ['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

pd.DataFrame.pivot_table

df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

col   col0  col1  col2  col3  col4
row
row0     1     2     0     1     1
row2     1     0     2     1     2
row3     0     1     0     2     0
row4     0     1     2     2     1

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)

pd.crosstab
```
pd.crosstab(df['row'], df['col'])
```

pd.factorize + np.bincount

# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`.  Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)

      col3  col2  col0  col1  col4
row3     2     0     0     1     0
row2     1     2     1     0     2
row0     1     0     1     2     1
row4     2     2     0     1     1

pd.get_dummies

pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

      col0  col1  col2  col3  col4
row0     1     2     0     1     1
row2     1     0     2     1     2
row3     0     1     0     2     0
row4     0     1     2     2     1

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?

DataFrame.pivot

The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

df2.insert(0, 'count', df2.groupby('A').cumcount())
df2

   count  A   B
0      0  a   0
1      1  a  11
2      2  a   2
3      3  a  11
4      0  b  10
5      1  b  10
6      2  b  14
7      0  c   7

The second step is to use the newly created column as the index to call DataFrame.pivot.

df2.pivot(*df2)
# df2.pivot(index='count', columns='A', values='B')

A         a     b    c
count
0       0.0  10.0  7.0
1      11.0  10.0  NaN
2       2.0  14.0  NaN
3      11.0   NaN  NaN

DataFrame.pivot_table

Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.
```
df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')

A         a     b    c
0       0.0  10.0  7.0
1      11.0  10.0  NaN
2       2.0  14.0  NaN
3      11.0   NaN  NaN
```

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns = df.columns.map('|'.join)

else format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)

pivot_table() and crosstab() can take string function names now, though I'm not sure when it changed since it's not documented very clearly. I'm using Pandas 1.4.4.

Ch3steR · Accepted Answer · 2020-06-05 20:59:49Z

To extend @piRSquared's answer another version of Question 10

Question 10.1

DataFrame:

d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)

   A  B
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  3  a
6  5  c

Output:

   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Using df.groupby and pd.Series.tolist

t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Or A much better alternative using pd.pivot_table with df.squeeze.

t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)

Mykola Zotko · Accepted Answer · 2022-12-01 13:04:44Z

19

To better understand how the function pivot works you can look at the example from Pandas documentation. However pivot will fail if you have repeating index-columns (foo-bar) combinations (like df in the second example):

pivot

In opposite to pivot the function pivot_table supports data aggregation using the mean function by default. Here is an example with the sum aggregation function:

edited Dec 1, 2022 at 13:04

answered Feb 17, 2021 at 7:42

Mykola Zotko

18.1k6 gold badges87 silver badges90 bronze badges

Comments

cottontail · Accepted Answer · 2023-05-05 22:15:03Z

Call `reset_index()` (along with `add_suffix()`)

Oftentimes, reset_index() is needed after you call pivot_table or pivot. For example, to make the following transformation (where one column become column names)

you use the following code, where after pivot, you add prefix to the newly created column names and convert the index (in this case "movies") back into a column and remove the name of the axis name:

df.pivot(index='movie', columns='week', values='sales').add_prefix('week_').reset_index().rename_axis(columns=None)

As the other answers mentioned, "pivot" may refer to 2 different operations:

Unstacked aggregation (i.e. make the results of groupby.agg wider.)
Reshaping (similar to pivot in Excel, reshape in numpy or pivot_wider in R)

1. Aggregation

pivot_table or crosstab are simply unstacked results of groupby.agg operation. In fact, the source code shows that, under the hood, the following are true:

pivot_table = groupby + unstack (read here for more info.)
crosstab = pivot_table

N.B. You can use list of column names as index, columns and values arguments.

df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols)
# equivalently,
df.pivot_table(vals, rows, cols, aggfuncs)

1.1. `crosstab` is a special case of `pivot_table`; thus of `groupby` + `unstack`

The following are equivalent:

pd.crosstab(df['colA'], df['colB'])
df.pivot_table(index='colA', columns='colB', aggfunc='size', fill_value=0)
df.groupby(['colA', 'colB']).size().unstack(fill_value=0)

Note that pd.crosstab has a significantly larger overhead, so it's significantly slower than both pivot_table and groupby + unstack. In fact, as noted here, pivot_table is slower than groupby + unstack as well.

2. Reshaping

pivot is a more limited version of pivot_table where its purpose is to reshape a long dataframe into a long one.

df.set_index(rows+cols)[vals].unstack(cols)
# equivalently, 
df.pivot(index=rows, columns=cols, values=vals)

2.1. Augment rows/columns as in Question 10

You can also apply the insight from Question 10 to multi-column pivot operation as well. There are two cases:

"long-to-long": reshape by augmenting the indices

Code:

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [*'xxyyzz'], 
                   'C': [*'CCDCDD'], 'E': [100, 200, 300, 400, 500, 600]})
rows, cols, vals = ['A', 'B'], ['C'], 'E'

# using pivot syntax
df1 = (
    df.assign(ix=df.groupby(rows+cols).cumcount())
    .pivot(index=[*rows, 'ix'], columns=cols, values=vals)
    .fillna(0, downcast='infer')
    .droplevel(-1).reset_index().rename_axis(columns=None)
)

# equivalently, using set_index + unstack syntax
df1 = (
    df
    .set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals]
    .unstack(fill_value=0)
    .droplevel(-1).reset_index().rename_axis(columns=None)
)

"long-to-wide": reshape by augmenting the columns

Code:

df1 = (
    df.assign(ix=df.groupby(rows+cols).cumcount())
    .pivot(index=rows, columns=[*cols, 'ix'])[vals]
    .fillna(0, downcast='infer')
)
df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index()

# equivalently, using the set_index + unstack syntax
df1 = (
    df
    .set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals]
    .unstack([-1, *range(-2, -len(cols)-2, -1)], fill_value=0)
)
df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index()

minimum case using the set_index + unstack syntax:

Code:

df1 = df.set_index(['A', df.groupby('A').cumcount()])['E'].unstack(fill_value=0).add_prefix('Col').reset_index()

^{¹ pivot_table() aggregates the values and unstacks it. Specifically, it creates a single flat list out of index and columns, calls groupby() with this list as the grouper and aggregates using the passed aggregator methods (the default is mean). Then after aggregation, it calls unstack() by the list of columns. So internally, pivot_table = groupby + unstack. Moreover, if fill_value is passed, fillna() is called.
In other words, the method that produces pv_1 is the same as the method that produces gb_1 in the example below.
pv_1 = df.pivot_table(index=rows, columns=cols, values=vals, aggfunc=aggfuncs, fill_value=0)
# internal operation of `pivot_table()`
gb_1 = df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols).fillna(0, downcast="infer")
pv_1.equals(gb_1) # True

² crosstab() calls pivot_table(), i.e., crosstab = pivot_table. Specifically, it builds a DataFrame out of the passed arrays of values, filters it by the common indices and calls pivot_table(). It's more limited than pivot_table() because it only allows a one-dimensional array-like as values, unlike pivot_table() that can have multiple columns as values.}

wjandrea · Accepted Answer · 2025-07-02 14:44:20Z

2

The pivot function in pandas has the same functionality as the pivot operation in Excel. We can transform a dataset from a long format to a wide format.

Lets have a example

We want to convert the dataset into a form such that each country becomes a column and the new confirmed cases as values corresponding to the countries. We can perform this data manipulation using the pivot function.

Pivot the dataset

pivot_df = pd.pivot(df, index='Date', columns='Country', values='NewConfirmed')
# renaming the columns
pivot_df.columns = df['Country'].sort_values().unique()

We can bring the new columns to the same level as the index column Data by resetting the index.

# reset the index to modify the column levels
pivot_df = pivot_df.reset_index()

edited Jul 2 at 14:44

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Oct 25, 2022 at 7:27

Huzefa Khan

1531 silver badge5 bronze badges

4 Comments

prashanth manohar Over a year ago

I get index Date contains duplicate values error. How to handle that?

wjandrea Jul 2 at 14:40

Please don't post pictures of data (the example input and output). Instead, you can use print(df) and put it in code formatting or print(df.to_markdown()) which gives you table formatting.

wjandrea Jul 2 at 15:29

Why are you doing pivot_df.columns = df['Country'].sort_values().unique()? That happens automatically. Effectively all that's doing is removing the axis name, i.e. pivot_df.columns.name = None.

wjandrea Jul 2 at 15:31

Also, why are you using pd.pivot(df, ...) instead of the shorter df.pivot(...)?

Collectives™ on Stack Overflow

How can I pivot a dataframe? [closed]

Subquestions

Setup

5 Answers 5

Question 1

Examples

Questions 2 and 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

1 Comment

Question 10.1

Comments

Comments

Call `reset_index()` (along with `add_suffix()`)

1. Aggregation

1.1. `crosstab` is a special case of `pivot_table`; thus of `groupby` + `unstack`

2. Reshaping

2.1. Augment rows/columns as in Question 10

Comments

Pivot the dataset

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Subquestions

Setup

5 Answers 5

Question 1

Examples

Questions 2 and 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

1 Comment

Question 10.1

Comments

Comments

Call reset_index() (along with add_suffix())

1. Aggregation

1.1. crosstab is a special case of pivot_table; thus of groupby + unstack

2. Reshaping

2.1. Augment rows/columns as in Question 10

Comments

Pivot the dataset

4 Comments

Linked

Related

Call `reset_index()` (along with `add_suffix()`)

1.1. `crosstab` is a special case of `pivot_table`; thus of `groupby` + `unstack`