How to remove only one column, when there are multiple columns with same in name in dataframe

Question

I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.

df.printSchema()
 --- Id String
 --- Name String
 --- Country String
 --- Id String

I would like to remove the second occurrence of Id column using PySpark.

Obtain the list of column names and find the index of the column you want to drop then drop that column using the index. — JonSG
– JonSG, Commented Mar 11 at 14:25
I did this, but its dropping both columns even-though index is different for same column name — N9909
– N9909, Commented Mar 11 at 18:54

abiratsis · Accepted Answer · 2025-03-12 07:48:53Z

In version 3.3.* Spark does this by default:

df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])

joined_df = df1.join(df2, on="id", how="inner")
+---+------+------+
| id|value2|value1|
+---+------+------+
|  1|     X|     A|
|  2|     Y|     B|
+---+------+------+

Although, if you are using a different older Spark version please do the following.

Get the columns with df.columns
Use Python set to retrieve unique columns
Use the unique columns in a select expression

unique_cols = set(joined_df.columns)

joined_df.select(*unique_cols).show()

ZygD · Accepted Answer · 2025-03-17 06:29:07Z

This is an example where you get several columns with the same name after a join:

df1 = spark.createDataFrame([(1, 'value_df1')], ['id', 'other_col'])
df2 = spark.createDataFrame([(1, 'value_df2')], ['id', 'other_col'])
df = df1.join(df2, 'id')
df.show()
# +---+---------+---------+
# | id|other_col|other_col|
# +---+---------+---------+
# |  1|value_df1|value_df2|
# +---+---------+---------+

Removing specified columns can be done in several ways.

drop and/or select before the join:

df1.drop('other_col').join(df2.select('id', 'other_col'), 'id').show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df2|
# +---+---------+

df1.select('id', 'other_col').join(df2.drop('other_col'), 'id').show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df1|
# +---+---------+

drop after the join:

df = df1.join(df2, 'id')
df.drop(df2.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df1|
# +---+---------+

df.drop(df1.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df2|
# +---+---------+

select after the join:

df = df1.join(df2, 'id')
df.select('id', df1.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df1|
# +---+---------+

df.select('id', df2.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df2|
# +---+---------+

Renaming columns before the join. After renaming you can unambiguously specify which columns to delete, but this is probably more practical in cases when you need "duplicate" columns for some other future use:

df = df1.join(df2.toDF(*[c if c in {'id'} else f'df2_{c}' for c in df2.columns]), 'id')
df.show()
# +---+---------+-------------+
# | id|other_col|df2_other_col|
# +---+---------+-------------+
# |  1|value_df1|    value_df2|
# +---+---------+-------------+
df.selectExpr("id", "array(other_col, df2_other_col) as others").show(truncate=0)
# +---+----------------------+
# |id |others                |
# +---+----------------------+
# |1  |[value_df1, value_df2]|
# +---+----------------------+

Collectives™ on Stack Overflow

How to remove only one column, when there are multiple columns with same in name in dataframe

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related