Spark SQL .distinct() performance

Question

I want to source few hundreds of gigabytes from a database via JDBC and then process it using Spark SQL. Currently I am doing some partitioning at that data and process is by batches of milion records. The thing is that I would like also to apply some deduplication to my dataframes and I was going to leave that idea of separated batches processing and try to process those hundreds of gigabytes using a one dataframe partitioned accordingly.

The main concern is: how will .distinct() work in such case? Will Spark SQL firstly try to load ALL the data into the RAM and then apply deduplication involving many shuffles and repartitioning? Do I have to ensure that a cluster has enough of RAM to contain that raw data or maybe it will be able to help itself with HDD storage (thus killing the performance)?

Or maybe I should do it without Spark - move the data to the target storage and there apply distinct counts and detect duplicates and get rid off them?

suj1th · Accepted Answer · 2018-03-04 17:58:20Z

4

Spark SQL does NOT use predicate pushdown for distinct queries; meaning that the processing to filter out duplicate records happens at the executors, rather than at the database. So, your assumption regarding shuffles happening over at the executors to process distinct is correct.

Inspite of this, I would still advise you to go ahead and perform the de-duplication on Spark, rather than build a separate arrangement for it. My personal experience with distinct has been more than satisfactory. It has always been the joins which push my buttons.

answered Mar 4, 2018 at 17:58

suj1th

1,8112 gold badges15 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark SQL .distinct() performance

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related