0

I've been trying for a long time to perform a GroupBy and a count() on a Spark DataFrame but and it takes like forever to be processed...

The below line takes about 13 seconds to be processed. From my perspective I think it takes too much time but I don't know how to reduce the processing time.

matched.limit(100).groupBy('Date','Period').agg(count("*").alias('cnt')).show()

I'm running on Spark 2.4 with the following config: Driver: 2 vCPU 8 GB RAM 10 Executors: 2 vCPU 8 GB RAM

Can anyone give me a hint on how to solve this issue ?

8
  • How many rows do you have? Commented Aug 14, 2020 at 10:58
  • @Lamanus I can't tell exactly as I have never ended the process. I tried df.count() but I'm stuck. Commented Aug 14, 2020 at 11:26
  • Oh... that's too bad. Is that possible to split the file? How big the file is? Commented Aug 14, 2020 at 11:27
  • I have never tried but I know that the number of rows should be somewhere below 70 millions and I think that it shouldn't take so many time to process a count(). Commented Aug 14, 2020 at 11:34
  • 1
    Yeah everything is working fine. I have filtered the data based on the location and it takes no more than 30 seconds... Commented Aug 14, 2020 at 12:51

1 Answer 1

1

That is the correct way, I think. The spending time will depend on how many rows are there.

df.groupBy('Date', 'Period').count().show(10, False)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.