1

I have seen few questions like these

Vectorized alternative to iterrows , Faster alternative to iterrows , Pandas: Alternative to iterrow loops , for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance. But it seems like everyone is a unique case rather a generalized approach.

My questions is also again about .iterrows.

I am trying to pass the first and second row to a function and create a list out of it.

What I have:

I have a pandas DataFrame with two columns that look like this.

         I.D         Score
1         11          26
3         12          26
5         13          26
6         14          25

What I did:

where the term Point is a function I earlier defined.

my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()]

What I am trying to do:

The faster and vectorized form of the above.

3
  • 1
    So you want to apply a function on values in a DataFrame, and return a list? Try DataFrame.apply - pandas.pydata.org/pandas-docs/stable/generated/…. Commented Nov 29, 2018 at 9:25
  • Yes, that looks like the solution ! Thanks¨ Commented Nov 29, 2018 at 9:26
  • 1
    The way you wrote the sentence actually made me understand my question more. Commented Nov 29, 2018 at 9:27

3 Answers 3

1

Try list comprehension:

score = pd.concat([score] * 1000, ignore_index=True)

def Points(a,b):
    return (a,b)

In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])]
1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()]
3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

5 Comments

This reduced my processing time from 21 minutes to 31 seconds. Thank you.
@Matthew - ya, try apply, but in my opinion it should be slowier, because some security checks.
it seems like itertuples is also very close to the performance of this one, in my case.
.apply in this is not applicable for my case
@Matthew - yes, only a bit slowier - add to timings in my answer.
1

Have you ever tried the method .itertuples()?

my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()]

Is a faster way to iterate over a pandas dataframe.

I hope it help.

2 Comments

This one also fits very well! Thank you
The jezrael answer seems to be the fastest @alejandro but thank you for your time!
1

The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column.

You can use pandas.DataFrame.apply with axis set to 1:

df.apply(func, axis=1)

To put in a list, it depends what your function returns but you could:

df.apply(Points, axis=1).tolist()

If you want to apply on only some columns:

df[['Score', 'I.D']].apply(Points, axis=1)

If you want to apply on a func that takes multiple args use numpy.vectorize for speed:

np.vectorize(Points)(df['Score'], df['I.D'])

Or a lambda:

df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()

3 Comments

this does not work as the function needs to take in 2 values and it needs to take it from some columns and not everything
It wasn't obvious from your question that your DataFrame contained more columns than ID and Score so I don't think that is a valid point. But, you can just apply by selecting the columns you want first. This can be used in a function that takes multiple values, but it depends on how the function is written - you didn't post it in your question.
My apologies, I wanted to thank you for actually rephrasing my question correctly.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.