1

I have a dataframe with following schema:

id         : int,
emp_details: Array(String)

Some sample data:

1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)

This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:

.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)

Could you please guide how we can achieve this by using Spark (1.6) Scala.

Really appreciate your help...

Thanks a lot

0

1 Answer 1

7

You can use withColumn and split to get the required data

df1.withColumn("empname", split($"emp_details" (0), "=")(1))
  .withColumn("city", split($"emp_details" (1), "=")(1))
  .withColumn("zip", split($"emp_details" (2), "=")(1)) 

Output:

+---+----------------------------------+-------+----+-----+
|id |emp_details                       |empname|city|zip  |
+---+----------------------------------+-------+----+-----+
|1  |[empname=xxx, city=yyy, zip=12345]|xxx    |yyy |12345|
|2  |[empname=bbb, city=bbb, zip=22345]|bbb    |bbb |22345|
+---+----------------------------------+-------+----+-----+

UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as

val getColumnsUDF = udf((details: Seq[String]) => {
  val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
  (detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})

Now use the udf

df1.withColumn("emp",getColumnsUDF($"emp_details"))
 .select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
 .show(false)

Output:

+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1  |xxx    |xxx |xxx|
|2  |bbb    |bbb |bbb|
+---+-------+----+---+

Hope this helps!

Sign up to request clarification or add additional context in comments.

13 Comments

Thank you for quick response and I really appreciate the help. I have one more observation, if the values in an array always not in same order. For example, I may get one array as (city,zip,empname), other record I may get as (empname,city,zip). if these positions are changing dynamically, how can we read it? is there any option to read based on the name..
@Bab please check the updated answer. Now it wont matter how the data is in array.
Hi, For continuing above question, some times some key value pair may not come in the column(array of string) in a dataframe. How can we handle it dynamically. For example, in some records zip will be there or may not be there. Do we have any option to handle such situation. Thanks in Advance
That depends on you, You can return null or empty string by checking in the map.
If you don't mind, can you give example to return null if key is not present.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.