2

I want to generate a dataframe that is made up of separate dataframes generated in a for loop. Each individual dataframe consists of a name column, a range of integers and a column identify a category to which the integer belongs (e.g. quintile 1 to 5). If I generate each dataframe individually and then append one to the other to create a 'master' dataframe then there are no problems. However, when I use a loop to create each individual dataframe (as I will need to do in my real life situation) then trying to append a dataframe to the master dataframe results in:

ValueError: incompatible categories in categorical concat

I've written a simplified loop to illustrate:

import numpy as np
import pandas as pd

# Define column names
colNames = ('a','b','c')

# Define a dataframe with the required column names
masterDF = pd.DataFrame(columns = colNames)

# A list of the group names
names = ['Group1','Group2','Group3']

# Create a dataframe for each group
for i in names:
    tempDF = pd.DataFrame(columns = colNames)
    tempDF['a'] = np.arange(1,11,1)
    tempDF['b'] = i
    tempDF['c'] = pd.cut(np.arange(1,11,1),
                        bins = np.linspace(0,10,6),
                        labels = [1,2,3,4,5])
    print(tempDF)
    print('\n')

    # Try to append temporary DF to master DF
    masterDF = masterDF.append(tempDF,ignore_index=True)

print(masterDF)

I would expect a dataframe that looked like:

     a       b  c
 0   1  Group1  1
 1   2  Group1  1
 2   3  Group1  2
 3   4  Group1  2
 4   5  Group1  3
 5   6  Group1  3
 6   7  Group1  4
 7   8  Group1  4
 8   9  Group1  5
 9  10  Group1  5
10  11  Group2  1
11  12  Group2  1
12  13  Group2  2
13  14  Group2  2
...
28  29  Group3  5
29  30  Group3  5

It seems that a partial solution can be obtained by typecasting the categories as they are added to the tempDF as follows:

tempDF['c'] = pd.cut(np.arange(1,11,1),
                     bins = np.linspace(0,10,6),
                     labels = [1,2,3,4,5]).astype('int')

However, in this case, the categories (column 'c') are now displayed as 1.0, 2.0, etc. rather than 1, 2, etc. so is not ideal.

Can anyone please explain why this happens and suggest a more satisfactory solution.

1 Answer 1

1

You can first append all DataFrames to list dfs and then concat:

dfs = []
# Create a dataframe for each group
for i in names:
    tempDF = pd.DataFrame(columns = colNames)
    tempDF['a'] = np.arange(1,11,1)
    tempDF['b'] = i
    tempDF['c'] = pd.cut(np.arange(1,11,1),
                        bins = np.linspace(0,10,6),
                        labels = [1,2,3,4,5])
    print(tempDF)
    print('\n')

    # Try to append temporary DF to master DF
    dfs.append(tempDF)

masterDF = pd.concat(dfs, ignore_index=True)
print(masterDF)
     a       b  c
0    1  Group1  1
1    2  Group1  1
2    3  Group1  2
3    4  Group1  2
4    5  Group1  3
5    6  Group1  3
6    7  Group1  4
7    8  Group1  4
8    9  Group1  5
9   10  Group1  5
10   1  Group2  1
11   2  Group2  1
12   3  Group2  2
13   4  Group2  2
14   5  Group2  3
15   6  Group2  3
16   7  Group2  4
17   8  Group2  4
18   9  Group2  5
19  10  Group2  5
20   1  Group3  1
21   2  Group3  1
22   3  Group3  2
23   4  Group3  2
24   5  Group3  3
25   6  Group3  3
26   7  Group3  4
27   8  Group3  4
28   9  Group3  5
29  10  Group3  5
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the lightening-fast response! That's a great solution. My only concern is about the size of the list-of-dataframes that is generated if the individual dataframes are large. How are multiple dataframes stored in a list? In my real-life situation, the dataframes contain 40K+ rows of data. Is this likely to cause any performance problems?
It depends of type of data... But loops in pandas are generally slow, so maybe better is find solution with pandas functions.
I've tested this with dataframes containing unto 40k rows and it works perfectly. Thanks for the solution. I've marked this as the answer. Is the fact that my original attempt failed a misunderstanding on my part or is it a bug?
I think it is bug. I think Concat function is used more, so is more bugless.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.