How to avoid iterrows for this pandas dataframe processing

Question

I need some help in converting the following code to a more efficient one without using iterrows().

for index, row in df.iterrows():
alist=row['index_vec'].strip("[] ").split(",")
blist=[int(i) for i in alist]
for col in blist:
    df.loc[index, str(col)] = df.loc[index, str(col)] +1

The above code basically reads a string under 'index_vec' column, parses and converts to integers, and then increments the associated columns by one for each integer. An example of the output is shown below:

Take the 0th row as an example. Its string value is "[370, 370, -1]". So the above code increments column "370" by 2 and column "-1" by 1. The output display is truncated so that only "-10" to "17" columns are shown.

The use of iterrows() is very slow to process a large dataframe. I'd like to get some help in speeding it up. Thank you.

Anna Nevison · Accepted Answer · 2020-05-25 00:38:47Z

1

You can also use apply and set axis = 1 to go row wise. Then create a custom function pass into apply:

Example starting df:

      index_vec  1201  370  -1
0  [370, -1, -1]     0    0   1
1   [1201, 1201]     0    1   1

import pandas as pd 

df = pd.DataFrame({'index_vec': ["[370, -1, -1]", "[1201, 1201]"], '1201': [0, 0], '370': [0, 1], '-1': [1, 1]})

def add_counts(x):
  counts = pd.Series(x['index_vec'].strip("[]").split(", ")).value_counts()
  x[counts.index] = x[counts.index] + counts
  return x

df.apply(add_counts, axis = 1)

print(df)

Outputs:

      index_vec  1201  370  -1
0  [370, -1, -1]     0    1   3
1   [1201, 1201]     2    1   1

edited May 25, 2020 at 0:38

answered May 24, 2020 at 23:56

Anna Nevison

2,7379 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

David293836 Over a year ago

This works. Thanks, Anna. The 'index_vec' does not have all the values I need. So I first manually created these columns. Then I added your code next. The manual column creation code is:

David293836 Over a year ago

for i in range (neg_index, pos_index): df[str(i)]= 0 df[str(i)]= df[str(i)].astype(np.int16)

Anna Nevison Over a year ago

@David293836 hmm I think I have an idea of how you wouldn't need to do that manually & make it much faster. If you want to post that as a new question with the full code I can took a look at it.

David293836 Over a year ago

Great. A new question has been posted here: stackoverflow.com/questions/61994503/… Thank you again, Anna.

BENY · Accepted Answer · 2020-05-25 00:44:11Z

1

Let us do

a=df['index_vec'].str.strip("[] ").str.split(",").explode()
s=pd.crosstab(a.index,a).reindex_like(df).fillna(0)
df=df.add(a)

edited May 25, 2020 at 0:44

answered May 24, 2020 at 23:09

BENY

324k22 gold badges176 silver badges250 bronze badges

5 Comments

David293836 Over a year ago

The first line generated the following error message: AttributeError: 'Series' object has no attribute 'split'

BENY Over a year ago

@David293836 add str before split

David293836 Over a year ago

Thanks. The 2nd line has problem with the fill_value argument in the reindex_like(). The error message is: TypeError: reindex_like() got an unexpected keyword argument 'fill_value'

David293836 Over a year ago

Note that 'index_vec' does not have all the numbers. So I had to manually create all the columns from the lowest to the highest. (e.g., -10 to the upper limit).

David293836 Over a year ago

Is the 2nd line supposed to be s = pd.crosstab(a.index,a).reindex_like(df).fillna(0) or a = pd.crosstab(a.index,a).reindex_like(df).fillna(0)?

Collectives™ on Stack Overflow

How to avoid iterrows for this pandas dataframe processing

2 Answers 2

4 Comments

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

5 Comments

Linked

Related