1

I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.

df.printSchema()
 --- Id String
 --- Name String
 --- Country String
 --- Id String

I would like to remove the second occurrence of Id column using PySpark.

2
  • Obtain the list of column names and find the index of the column you want to drop then drop that column using the index. Commented Mar 11 at 14:25
  • I did this, but its dropping both columns even-though index is different for same column name Commented Mar 11 at 18:54

2 Answers 2

0

In version 3.3.* Spark does this by default:

df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])

joined_df = df1.join(df2, on="id", how="inner")
+---+------+------+
| id|value2|value1|
+---+------+------+
|  1|     X|     A|
|  2|     Y|     B|
+---+------+------+

Although, if you are using a different older Spark version please do the following.

  1. Get the columns with df.columns
  2. Use Python set to retrieve unique columns
  3. Use the unique columns in a select expression
unique_cols = set(joined_df.columns)

joined_df.select(*unique_cols).show()
Sign up to request clarification or add additional context in comments.

Comments

0

This is an example where you get several columns with the same name after a join:

df1 = spark.createDataFrame([(1, 'value_df1')], ['id', 'other_col'])
df2 = spark.createDataFrame([(1, 'value_df2')], ['id', 'other_col'])
df = df1.join(df2, 'id')
df.show()
# +---+---------+---------+
# | id|other_col|other_col|
# +---+---------+---------+
# |  1|value_df1|value_df2|
# +---+---------+---------+

Removing specified columns can be done in several ways.

  • drop and/or select before the join:

    df1.drop('other_col').join(df2.select('id', 'other_col'), 'id').show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df2|
    # +---+---------+
    
    df1.select('id', 'other_col').join(df2.drop('other_col'), 'id').show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df1|
    # +---+---------+
    
  • drop after the join:

    df = df1.join(df2, 'id')
    df.drop(df2.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df1|
    # +---+---------+
    
    df.drop(df1.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df2|
    # +---+---------+
    
  • select after the join:

    df = df1.join(df2, 'id')
    df.select('id', df1.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df1|
    # +---+---------+
    
    df.select('id', df2.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df2|
    # +---+---------+
    
  • Renaming columns before the join. After renaming you can unambiguously specify which columns to delete, but this is probably more practical in cases when you need "duplicate" columns for some other future use:

    df = df1.join(df2.toDF(*[c if c in {'id'} else f'df2_{c}' for c in df2.columns]), 'id')
    df.show()
    # +---+---------+-------------+
    # | id|other_col|df2_other_col|
    # +---+---------+-------------+
    # |  1|value_df1|    value_df2|
    # +---+---------+-------------+
    df.selectExpr("id", "array(other_col, df2_other_col) as others").show(truncate=0)
    # +---+----------------------+
    # |id |others                |
    # +---+----------------------+
    # |1  |[value_df1, value_df2]|
    # +---+----------------------+
    

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.