I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
dropDuplicates()before the join might already reduce the workload, since then no user_id has to be mapped to multiple records in the second dataframe. SodropDuplicatesfor dfA and dfB before the join can help.dropDuplicates("user_id")