In version 3.3.* Spark does this by default:
df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])
joined_df = df1.join(df2, on="id", how="inner")
+---+------+------+
| id|value2|value1|
+---+------+------+
| 1| X| A|
| 2| Y| B|
+---+------+------+
Although, if you are using a different older Spark version please do the following.
- Get the columns with
df.columns
- Use Python
set
to retrieve unique columns
- Use the unique columns in a select expression
unique_cols = set(joined_df.columns)
joined_df.select(*unique_cols).show()