I'm using the code below to join and drop duplicated between two dataframes.
However, get error AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans...Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
My df1 has 15 columns and my df2 has 50+ columns. How can I join on multiple columns without hardcoding the columns to join on?
def join(dataset_standardFalse, dataset, how='left'):
final_df = dataset_standardFalse.join(dataset, how=how)
repeated_columns = [c for c in dataset_standardFalse.columns if c in dataset.columns]
for col in repeated_columns:
final_df = final_df.drop(dataset[col])
return final_df
Specific example, when comparing the columns of the dataframes, they will have multiple columns in common. Can I join on the list of cols? I need to avoid hard-coding names since the cols would vary by case.
cols = set(dataset_standardFalse.columns) & (set(dataset.columns))
print(cols)