pySpark join dataframe on multiple columns

Question

I'm using the code below to join and drop duplicated between two dataframes. However, get error AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans...Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;

My df1 has 15 columns and my df2 has 50+ columns. How can I join on multiple columns without hardcoding the columns to join on?

def join(dataset_standardFalse, dataset,  how='left'):
    final_df = dataset_standardFalse.join(dataset,  how=how)
    repeated_columns = [c for c in dataset_standardFalse.columns if c in dataset.columns]
    for col in repeated_columns:
        final_df = final_df.drop(dataset[col])
    return final_df

Specific example, when comparing the columns of the dataframes, they will have multiple columns in common. Can I join on the list of cols? I need to avoid hard-coding names since the cols would vary by case.

cols = set(dataset_standardFalse.columns) & (set(dataset.columns))
print(cols)

take a look at this spark-jira issues.apache.org/jira/browse/SPARK-21380. might be helpful — Som
– Som, Commented Jun 8, 2020 at 4:49

Shubham Jain · Accepted Answer · 2020-06-08 06:07:00Z

1

IIUC you can join on multiple columns directly if they are present in both the dataframes

#This gives you the common columns list from both the dataframes
cols = list(set(dataset_standardFalse.columns) & (set(dataset.columns)))

#Modify your function to specify list of columns for join condition
def join(dataset_standardFalse, dataset,  how='left'):
    cols = list(set(dataset_standardFalse.columns) & (set(dataset.columns)))
    final_df = dataset_standardFalse.join(dataset, cols, how=how)
    repeated_columns = [c for c in dataset_standardFalse.columns if c in dataset.columns]
    for col in repeated_columns:
        final_df = final_df.drop(dataset[col])
    return final_df

When you pass the list of columns in the join condition, the columns should be present in both the dataframes. If the column is not present then you should rename the column in the preprocessing step or create the join condition dynamically.

For dynamic column names use this:

#Identify the column names from both df
df = df1.join(df2,[col(c1) == col(c2) for c1, c2 in zip(columnDf1, columnDf2)],how='left')

answered Jun 8, 2020 at 6:07

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jgtrz Over a year ago

for the junction, I'm not able to display my final_df. When using display(final_df), I get NameError: name 'final_df' is not defined. So I ended up removing the function aspects and simpy leveraging it as a list comprehension. And for dynamic columns names, how do I define columnDf1 , would I need to hardcode a column name? I'm trying both of your suggestions to see which one better suits my case.@Shubham Jain

Shubham Jain Over a year ago

In case your joining column names are different then you have to somehow map the columns of df1 and df2, hence hardcoding or if there is any relation in col names then it can be dynamic.

jgtrz Over a year ago

qq, I'm using code final_df = dataset_standardFalse.join(dataset_comb2, cols_comb3, how='left') to join dfs and it actually drop duplicate columns. Now, I've noticed that in some cases my dataframes will end up with a 4 or more 'duplicate column names' - in theory. But the rows-to-row values will not be duplicated. Right now, my code only takes into account duplicate columns so all duplicate columns are drop without accounting for ...but unique ROWS within the duplicate columns. Is there a way to tweak it?@Shubham Jain

Shubham Jain Over a year ago

One way to do it is, before dropping the column compare the two columns of all the values are same drop the extra column else keep it or rename it with new name

jgtrz Over a year ago

yes, I think I can do it w/ a union and then drop? @Shubham Jain

|

Collectives™ on Stack Overflow

pySpark join dataframe on multiple columns

1 Answer 1

7 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Related