How to execute a groupby and count fastly on Spark in Python?

Question

I've been trying for a long time to perform a GroupBy and a count() on a Spark DataFrame but and it takes like forever to be processed...

The below line takes about 13 seconds to be processed. From my perspective I think it takes too much time but I don't know how to reduce the processing time.

matched.limit(100).groupBy('Date','Period').agg(count("*").alias('cnt')).show()

I'm running on Spark 2.4 with the following config: Driver: 2 vCPU 8 GB RAM 10 Executors: 2 vCPU 8 GB RAM

Can anyone give me a hint on how to solve this issue ?

@Lamanus I can't tell exactly as I have never ended the process. I tried df.count() but I'm stuck. — JulienD
– JulienD, Commented Aug 14, 2020 at 11:26
Oh... that's too bad. Is that possible to split the file? How big the file is? — Daeho Ro
– Daeho Ro, Commented Aug 14, 2020 at 11:27
I have never tried but I know that the number of rows should be somewhere below 70 millions and I think that it shouldn't take so many time to process a count(). — JulienD
– JulienD, Commented Aug 14, 2020 at 11:34
Yeah everything is working fine. I have filtered the data based on the location and it takes no more than 30 seconds... — JulienD
– JulienD, Commented Aug 14, 2020 at 12:51

Daeho Ro · Accepted Answer · 2020-08-14 11:02:49Z

1

That is the correct way, I think. The spending time will depend on how many rows are there.

df.groupBy('Date', 'Period').count().show(10, False)

answered Aug 14, 2020 at 11:02

Daeho Ro

13.7k4 gold badges25 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to execute a groupby and count fastly on Spark in Python?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related