python pandas iterate over dataframe rows with iterrows is slow, can it be replaced somehow?

Question

I have the following problem. I have 2 dataframes one with only 0's and columns name as the attributes which I know them from a different text file, and one which have column from first dataframe as values and NaN's for each row. Now, I want to set 1's on the dataframe with 0's, where the second dataframe values have the attribute.

the second data frame looks like this.

the first data frame looks like this.

and i want to change the first data frame into this.

for index, row in df.iterrows():
for element in row:
    if pd.isnull(element) : break
    # row index, element column so we change the value of the column thats named element from 0 to 1.
    Final_Df.at[index,element] = 1

This is the code I am using to achieve that. df is the second dataframe with NaN values, and Final_Df is the first dataframe with 0's. Is there a way to achieve it faster somehow by not using iterrows because the dataset is larger? Any help will be appreciated, and sorry if the question is bad. Thanks in advance!

jezrael · Accepted Answer · 2020-06-19 07:18:05Z

Idea is create dictionary for each row in list comprehension, pass to DataFrame constructor, replace missing values to 0 and last use DataFrame.reindex for remove NaN column, change order and add non exist values to column filled by 0:

codes = ['ca', 'ct', 'dc', 'fl', 'hi', 'il', 'ky', 'la', 'md', 'mi', 'ms', 'nc', 'pr']

Final_Df = (pd.DataFrame([dict.fromkeys(x, 1) for x in df.to_numpy()])
              .fillna(0)
              .astype(np.int8)
              .reindex(codes, axis=1, fill_value=0))

Another solution:

Use get_dummies with max values per columns (for always 0,1) values and then DataFrame.reindex for change order of columns and also add some non exist columns filled by 0:

codes = ['ca', 'ct', 'dc', 'fl', 'hi', 'il', 'ky', 'la', 'md', 'mi', 'ms', 'nc', 'pr']

df = (pd.get_dummies(df, prefix='', prefix_sep='')
        .max(axis=1, level=0)
        .reindex(codes, axis=1, fill_value=0))
print (df)

Thanks, the second solution helped me a lot and is easier to understand. Also it reduced my execution time from 6 secs to 1.

Valdi_Bo · Accepted Answer · 2020-06-19 05:41:08Z

To test my solution, I used the following DataFrame with smaller number of codes:

    0   1    2    3    4    5    6    7    8    9
0  fl  nc  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1  fl  nc  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2  ct  dc   fl   hi   il   ky   la   md   mi   ms
3  ct  dc   fl   il   ky   la   md   mi   ms   nc
4  hi  pr  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5  hi  il   ky   md   mi   ms  NaN  NaN  NaN  NaN
6  hi  il   ky   la   mi   ms  NaN  NaN  NaN  NaN
7  ct  la   md   ms   nc  NaN  NaN  NaN  NaN  NaN
8  dc  md   mi   ms   nc  NaN  NaN  NaN  NaN  NaN
9  dc  md   mi   nc  NaN  NaN  NaN  NaN  NaN  NaN

To create Final_Df I started with a list of codes:

codes = ['ca', 'ct', 'dc', 'fl', 'hi', 'il', 'ky', 'la', 'md', 'mi', 'ms', 'nc', 'pr']

and created Final_Df (full of zeroes) the following way:

Final_Df = pd.DataFrame(0, index=df.index, columns=codes)

I need also a dictionary to translate codes into column numbers, with -1 for NaN (these values will be omitted):

codeToInd = { code: ind for ind, code in enumerate(codes) }
codeToInd[np.nan] = -1

The first step of actual computation is to translate df into ind - a Numpy array:

ind = np.vectorize(codeToInd.get)(df)

The result is:

array([[ 3, 11, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 3, 11, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  5,  6,  7,  8,  9, 10, 11],
       [ 4, 12, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 4,  5,  6,  8,  9, 10, -1, -1, -1, -1],
       [ 4,  5,  6,  7,  9, 10, -1, -1, -1, -1],
       [ 1,  7,  8, 10, 11, -1, -1, -1, -1, -1],
       [ 2,  8,  9, 10, 11, -1, -1, -1, -1, -1],
       [ 2,  8,  9, 11, -1, -1, -1, -1, -1, -1]])

One more preparatory step is to extract the underlaying Numpy array from Final_Df:

finDfVal = Final_Df.values

And the actual processing (setting *1*s at proper cells) is performed with the following loop:

for r, c in np.argwhere(ind >= 0):
    finDfVal[r, ind[r, c]] = 1

After that Final_Df contains:

   ca  ct  dc  fl  hi  il  ky  la  md  mi  ms  nc  pr
0   0   0   0   1   0   0   0   0   0   0   0   1   0
1   0   0   0   1   0   0   0   0   0   0   0   1   0
2   0   1   1   1   1   1   1   1   1   1   1   0   0
3   0   1   1   1   0   1   1   1   1   1   1   1   0
4   0   0   0   0   1   0   0   0   0   0   0   0   1
5   0   0   0   0   1   1   1   0   1   1   1   0   0
6   0   0   0   0   1   1   1   1   0   1   1   0   0
7   0   1   0   0   0   0   0   1   1   0   1   1   0
8   0   0   1   0   0   0   0   0   1   1   1   1   0
9   0   0   1   0   0   0   0   0   1   1   0   1   0

Execution speed

Using %timeit I compared the execution time of my code with yours and on this very limited data sample I got about 7 times shorter time.

I think that on a bigger DataFrame the difference should be greater. Write what was the execution time of your code and mine.

In case of an error

To check what is goin on, try to create a DataFrame counterpart of ind, just for display:

df.applymap(lambda x: codeToInd[x])

If everything is OK, a DataFrame with translated codes should be printed.

But in case of any missing value in codeToInd a KeyError exception is raised, showing the missing value. Add this missing value to codes and repeat the whole procedure.

on the line np.vectorize i get this error, TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
It actually doesn't convert the last column whith NaN's into -1 but it does for the rest. I mean after using applymap show that.
Start with a test run on my test data. I suppose that there is some disorder with your source data. E.g. NaN is in the DataFrame as a string, not "true" NaN (which is actually a special case of float).
Actually, they are as true NaN, I mean I've checked it and it shows class type float,but I don't know what's going on to be honest. I will try your code also, with your teset data and let you know and I really thank you ^^

Shadan Golestan · Accepted Answer · 2020-06-18 19:48:17Z

0

I would suggest to use Pandas Vectorization. This tutorial is a good starting point:

https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06

answered Jun 18, 2020 at 19:48

Shadan Golestan

1061 silver badge7 bronze badges

1 Comment

M.Mar Over a year ago

I checked this but didn't find a good way to implement any of this.

Michael Mitter · Accepted Answer · 2020-06-18 21:44:59Z

If it is not important that rows or columns that are all NaN are included you can also try the following:

Melt your dataframe:

# >>> df
#      0    1    2    3    4
# 0  NaN  NaN   ct  NaN  NaN
# 1  NaN  NaN  NaN  NaN  NaN
# 2   ta  NaN  NaN   ga  NaN
# 3  NaN  NaN  NaN  NaN  NaN
# 4  NaN  NaN  NaN  NaN  NaN

molten = pd.melt(df.T)

# >>> molten
#     variable value
# 0          0   NaN
# 1          0   NaN
# 2          0    ct
# 3          0   NaN
# 4          0   NaN
# 5          1   NaN

use pandas.crosstab to tabulate the entries:

tab = pd.crosstab(molten["variable"], molten["value"])

# >>> tab2
# value     ct  ga  ta
# variable
# 0          1   0   0
# 2          0   1   1

Collectives™ on Stack Overflow

python pandas iterate over dataframe rows with iterrows is slow, can it be replaced somehow?

4 Answers 4

1 Comment

4 Comments

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

1 Comment

1 Comment

Related