Vectorized way for applying a function to a dataframe to create lists

Question

I have seen few questions like these

Vectorized alternative to iterrows , Faster alternative to iterrows , Pandas: Alternative to iterrow loops , for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance. But it seems like everyone is a unique case rather a generalized approach.

My questions is also again about .iterrows.

I am trying to pass the first and second row to a function and create a list out of it.

What I have:

I have a pandas DataFrame with two columns that look like this.

         I.D         Score
1         11          26
3         12          26
5         13          26
6         14          25

What I did:

where the term Point is a function I earlier defined.

my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()]

What I am trying to do:

The faster and vectorized form of the above.

So you want to apply a function on values in a DataFrame, and return a list? Try DataFrame.apply - pandas.pydata.org/pandas-docs/stable/generated/…. — user3471881
– user3471881, Commented Nov 29, 2018 at 9:25
The way you wrote the sentence actually made me understand my question more. — PolarBear10
– PolarBear10, Commented Nov 29, 2018 at 9:27

jezrael · Accepted Answer · 2018-11-29 10:00:16Z

1

Try list comprehension:

score = pd.concat([score] * 1000, ignore_index=True)

def Points(a,b):
    return (a,b)

In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])]
1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()]
3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 29, 2018 at 10:00

answered Nov 29, 2018 at 9:30

jezrael

867k102 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

PolarBear10 Over a year ago

This reduced my processing time from 21 minutes to 31 seconds. Thank you.

jezrael Over a year ago

@Matthew - ya, try apply, but in my opinion it should be slowier, because some security checks.

PolarBear10 Over a year ago

it seems like itertuples is also very close to the performance of this one, in my case.

PolarBear10 Over a year ago

.apply in this is not applicable for my case

jezrael Over a year ago

@Matthew - yes, only a bit slowier - add to timings in my answer.

Alejandro · Accepted Answer · 2018-11-29 09:45:26Z

1

Have you ever tried the method .itertuples()?

my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()]

Is a faster way to iterate over a pandas dataframe.

I hope it help.

answered Nov 29, 2018 at 9:45

Alejandro

114 bronze badges

2 Comments

PolarBear10 Over a year ago

This one also fits very well! Thank you

PolarBear10 Over a year ago

The jezrael answer seems to be the fastest @alejandro but thank you for your time!

user3471881 · Accepted Answer · 2018-11-29 10:18:06Z

1

The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column.

You can use pandas.DataFrame.apply with axis set to 1:

df.apply(func, axis=1)

To put in a list, it depends what your function returns but you could:

df.apply(Points, axis=1).tolist()

If you want to apply on only some columns:

df[['Score', 'I.D']].apply(Points, axis=1)

If you want to apply on a func that takes multiple args use numpy.vectorize for speed:

np.vectorize(Points)(df['Score'], df['I.D'])

Or a lambda:

df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()

edited Nov 29, 2018 at 10:18

answered Nov 29, 2018 at 9:30

user3471881

2,7443 gold badges21 silver badges35 bronze badges

3 Comments

PolarBear10 Over a year ago

this does not work as the function needs to take in 2 values and it needs to take it from some columns and not everything

user3471881 Over a year ago

It wasn't obvious from your question that your DataFrame contained more columns than ID and Score so I don't think that is a valid point. But, you can just apply by selecting the columns you want first. This can be used in a function that takes multiple values, but it depends on how the function is written - you didn't post it in your question.

PolarBear10 Over a year ago

My apologies, I wanted to thank you for actually rephrasing my question correctly.

Collectives™ on Stack Overflow

Vectorized way for applying a function to a dataframe to create lists

3 Answers 3

5 Comments

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

3 Comments

Linked

Related