0
Data:
Name1            Name2            Name3(Expected)
RR Industries    null            RR Industries
RR Industries    RR Industries   RR IndustriesRR Industries

Code:

.withColumn("Name3",F.concat(F.trim(Name1), F.trim(Name2)))

Actual result: The columns with null values are deleted. I want the output to be as seen in Name3(Expected Columnt)

I think, the issue occurs after the joining the tables The name column is available in df2 and df3. before joining they do not contain null values.

Issue: After joining; since pyspark doesnt delete the common columns, we have two name1 columns from 2 tables I tried replcaing it with empty string;it didnt work and throws error

How do I replace null values with empty string after joining tables

df = df1\
.join(df2,"code",how = 'left') \
.join(df3,"id",how = 'left')\
.join(df4,"id",how = 'left')\
.withColumn('name1',F.when(df2('name1').isNull(),'').otherwise(df2('name1')))\
.withColumn('name1',F.when(df3('name1').isNull(),'').otherwise(df3('name1')))\
.withColumn("Name1",F.concat(F.trim(df2.name1), F.trim(df3.name1)))
3
  • 1
    If any of the columns in your concat statement are null, the result of the concat is null, that's how it works. Use coalesce to replace the null values with an empty string, and use that for your concat. Commented Jun 1, 2020 at 15:06
  • df.fillna is not working...any other examples which I can try Commented Jun 1, 2020 at 17:35
  • df=df.withColumn('name2',F.when(F.col('name2').isNull(),' ') .otherwise(F.col('name2'))) doesnt work either Commented Jun 1, 2020 at 17:40

2 Answers 2

1

Try this-

It should be implemented in python with minimal change

   val data =
      """
        |Name1         |   Name2
        |RR Industries |
        |RR Industries |   RR Industries
      """.stripMargin

    val stringDS = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .csv(stringDS)
    df.show(false)
    df.printSchema()

    /**
      * +-------------+-------------+
      * |Name1        |Name2        |
      * +-------------+-------------+
      * |RR Industries|null         |
      * |RR Industries|RR Industries|
      * +-------------+-------------+
      *
      * root
      * |-- Name1: string (nullable = true)
      * |-- Name2: string (nullable = true)
      */
    df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col).map(c => coalesce(c, lit(""))): _*))
      .show(false)

    /**
      * +-------------+-------------+--------------------------+
      * |Name1        |Name2        |Name3(Expected)           |
      * +-------------+-------------+--------------------------+
      * |RR Industries|null         |RR Industries             |
      * |RR Industries|RR Industries|RR IndustriesRR Industries|
      * +-------------+-------------+--------------------------+
      */
    df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col): _*))
      .show(false)

    /**
      * +-------------+-------------+--------------------------+
      * |Name1        |Name2        |Name3(Expected)           |
      * +-------------+-------------+--------------------------+
      * |RR Industries|null         |RR Industries             |
      * |RR Industries|RR Industries|RR IndustriesRR Industries|
      * +-------------+-------------+--------------------------+
      */
Sign up to request clarification or add additional context in comments.

2 Comments

DO I need to download any libraries for map? I am using pyspark and .map(col): gives error
no need download any libraries. I'm converting all the column names to column. here col means functions.col
0

you can try this approach in pyspark

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()

sc= spark.sparkContext

df = sc.parallelize([
("RR Industries",None), ("RR Industries", "RR Industries")]).toDF(["Name1", 
  "Name2"])


 df.withColumn("Name3", F.concat_ws("", F.col("Name1"), 
 F.col("Name2"))).show(truncate=False)

+-------------+-------------+--------------------------+
|Name1        |Name2        |Name3                     |
+-------------+-------------+--------------------------+
|RR Industries|null         |RR Industries             |
|RR Industries|RR Industries|RR IndustriesRR Industries|
+-------------+-------------+--------------------------+

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.