concatenating columns in a dataframe pyspark with null values

Question

Data:
Name1            Name2            Name3(Expected)
RR Industries    null            RR Industries
RR Industries    RR Industries   RR IndustriesRR Industries

Code:

.withColumn("Name3",F.concat(F.trim(Name1), F.trim(Name2)))

Actual result: The columns with null values are deleted. I want the output to be as seen in Name3(Expected Columnt)

I think, the issue occurs after the joining the tables The name column is available in df2 and df3. before joining they do not contain null values.

Issue: After joining; since pyspark doesnt delete the common columns, we have two name1 columns from 2 tables I tried replcaing it with empty string;it didnt work and throws error

How do I replace null values with empty string after joining tables

df = df1\
.join(df2,"code",how = 'left') \
.join(df3,"id",how = 'left')\
.join(df4,"id",how = 'left')\
.withColumn('name1',F.when(df2('name1').isNull(),'').otherwise(df2('name1')))\
.withColumn('name1',F.when(df3('name1').isNull(),'').otherwise(df3('name1')))\
.withColumn("Name1",F.concat(F.trim(df2.name1), F.trim(df3.name1)))

If any of the columns in your concat statement are null, the result of the concat is null, that's how it works. Use coalesce to replace the null values with an empty string, and use that for your concat. — Andrew
– Andrew, Commented Jun 1, 2020 at 15:06
df.fillna is not working...any other examples which I can try — Eden T
– Eden T, Commented Jun 1, 2020 at 17:35
df=df.withColumn('name2',F.when(F.col('name2').isNull(),' ') .otherwise(F.col('name2'))) doesnt work either — Eden T
– Eden T, Commented Jun 1, 2020 at 17:40

Som · Accepted Answer · 2020-06-01 15:48:11Z

Try this-

It should be implemented in python with minimal change

   val data =
      """
        |Name1         |   Name2
        |RR Industries |
        |RR Industries |   RR Industries
      """.stripMargin

    val stringDS = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .csv(stringDS)
    df.show(false)
    df.printSchema()

    /**
      * +-------------+-------------+
      * |Name1        |Name2        |
      * +-------------+-------------+
      * |RR Industries|null         |
      * |RR Industries|RR Industries|
      * +-------------+-------------+
      *
      * root
      * |-- Name1: string (nullable = true)
      * |-- Name2: string (nullable = true)
      */
    df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col).map(c => coalesce(c, lit(""))): _*))
      .show(false)

    /**
      * +-------------+-------------+--------------------------+
      * |Name1        |Name2        |Name3(Expected)           |
      * +-------------+-------------+--------------------------+
      * |RR Industries|null         |RR Industries             |
      * |RR Industries|RR Industries|RR IndustriesRR Industries|
      * +-------------+-------------+--------------------------+
      */
    df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col): _*))
      .show(false)

    /**
      * +-------------+-------------+--------------------------+
      * |Name1        |Name2        |Name3(Expected)           |
      * +-------------+-------------+--------------------------+
      * |RR Industries|null         |RR Industries             |
      * |RR Industries|RR Industries|RR IndustriesRR Industries|
      * +-------------+-------------+--------------------------+
      */

DO I need to download any libraries for map? I am using pyspark and .map(col): gives error
no need download any libraries. I'm converting all the column names to column. here col means functions.col

kites · Accepted Answer · 2020-06-02 03:17:44Z

you can try this approach in pyspark

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()

sc= spark.sparkContext

df = sc.parallelize([
("RR Industries",None), ("RR Industries", "RR Industries")]).toDF(["Name1", 
  "Name2"])


 df.withColumn("Name3", F.concat_ws("", F.col("Name1"), 
 F.col("Name2"))).show(truncate=False)

+-------------+-------------+--------------------------+
|Name1        |Name2        |Name3                     |
+-------------+-------------+--------------------------+
|RR Industries|null         |RR Industries             |
|RR Industries|RR Industries|RR IndustriesRR Industries|
+-------------+-------------+--------------------------+

Collectives™ on Stack Overflow

concatenating columns in a dataframe pyspark with null values

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related