1

I want to source few hundreds of gigabytes from a database via JDBC and then process it using Spark SQL. Currently I am doing some partitioning at that data and process is by batches of milion records. The thing is that I would like also to apply some deduplication to my dataframes and I was going to leave that idea of separated batches processing and try to process those hundreds of gigabytes using a one dataframe partitioned accordingly.

The main concern is: how will .distinct() work in such case? Will Spark SQL firstly try to load ALL the data into the RAM and then apply deduplication involving many shuffles and repartitioning? Do I have to ensure that a cluster has enough of RAM to contain that raw data or maybe it will be able to help itself with HDD storage (thus killing the performance)?

Or maybe I should do it without Spark - move the data to the target storage and there apply distinct counts and detect duplicates and get rid off them?

1 Answer 1

4

Spark SQL does NOT use predicate pushdown for distinct queries; meaning that the processing to filter out duplicate records happens at the executors, rather than at the database. So, your assumption regarding shuffles happening over at the executors to process distinct is correct.

Inspite of this, I would still advise you to go ahead and perform the de-duplication on Spark, rather than build a separate arrangement for it. My personal experience with distinct has been more than satisfactory. It has always been the joins which push my buttons.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.