I want to source few hundreds of gigabytes from a database via JDBC and then process it using Spark SQL. Currently I am doing some partitioning at that data and process is by batches of milion records. The thing is that I would like also to apply some deduplication to my dataframes and I was going to leave that idea of separated batches processing and try to process those hundreds of gigabytes using a one dataframe partitioned accordingly.
The main concern is: how will .distinct() work in such case? Will Spark SQL firstly try to load ALL the data into the RAM and then apply deduplication involving many shuffles and repartitioning? Do I have to ensure that a cluster has enough of RAM to contain that raw data or maybe it will be able to help itself with HDD storage (thus killing the performance)?
Or maybe I should do it without Spark - move the data to the target storage and there apply distinct counts and detect duplicates and get rid off them?