Transform 3 level nested dictionary key values to pyspark dataframe

Question

I have a Pyspark dataframe that looks like this:

I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this:

Also, there would be some variations of the keys in each row, i.e., some rows may have fields that other rows don't. I would like to include all the fields and if a record doesn't have certain fields /keys, the value can be shown as "null".

Please let me know how I can achieve this.

Thanks!

Are the keys in the dic column always the same? Does dic have the same structure for every row of data? — Powers
– Powers, Commented Jul 21, 2020 at 21:16
cool, feel free to update the question with a representative set of the variations that the solution should be able to handle. — Powers
– Powers, Commented Jul 21, 2020 at 21:26
Does this answer your question? Transform nested dictionary key values to pyspark dataframe — Shubham Jain
– Shubham Jain, Commented Jul 22, 2020 at 6:23

Powers · Accepted Answer · 2020-07-22 01:46:11Z

Here's some code that'll help you get started:

data = [
    ("hi", {"Name": "David", "Age": "25", "Location": "New York", "Height": "170", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "1", "Singing": "2"}, "Skills": {"Coding": "2", "Swimming": "4"}}}, "bye"),
    ("hi", {"Name": "Helen", "Age": "28", "Location": "New York", "Height": "160", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "5", "Singing": "6"}}}, "bye"),
    ]
df = spark.createDataFrame(data, ["greeting", "dic", "farewell"])
res = df.select(
    F.col("dic").getItem("Name").alias(str("Name")),
    F.col("dic")["Age"].alias(str("Age"))
)

res.show()

+-----+---+
| Name|Age|
+-----+---+
|David| 25|
|Helen| 28|
+-----+---+

res.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)

Spark can't handle dictionary values that are multiple different types. Regular Python can handle dictionary keys / values with mixed types.

We can run df.printSchema() to see how PySpark is interpreting the dictionary values:

root
 |-- greeting: string (nullable = true)
 |-- dic: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- farewell: string (nullable = true)

Your example dataset has a mix of string and dictionary values. Run df.select(F.col("dic").getItem("fields")).printSchema() to see:

root
 |-- dic[fields]: string (nullable = true)

There might be some way to parse the string and convert it to a map, but that'd be costly. Can you add a printSchema in your question? You might need to restructure your data so the answer is a little easier ;)

I have made an update to the sample data. Sorry I was writing it in haste.
They should all be dictionary values. Sorry that I can't provide real data

Collectives™ on Stack Overflow

Transform 3 level nested dictionary key values to pyspark dataframe

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related