2

I have a Pyspark dataframe that looks like this:

enter image description here

I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this:

enter image description here

Also, there would be some variations of the keys in each row, i.e., some rows may have fields that other rows don't. I would like to include all the fields and if a record doesn't have certain fields /keys, the value can be shown as "null".

Please let me know how I can achieve this.

Thanks!

6
  • Are the keys in the dic column always the same? Does dic have the same structure for every row of data? Commented Jul 21, 2020 at 21:16
  • @Powers I believe there are some variations in these row Commented Jul 21, 2020 at 21:22
  • cool, feel free to update the question with a representative set of the variations that the solution should be able to handle. Commented Jul 21, 2020 at 21:26
  • 1
    @Powers I just made the update. Commented Jul 21, 2020 at 21:33
  • Does this answer your question? Transform nested dictionary key values to pyspark dataframe Commented Jul 22, 2020 at 6:23

1 Answer 1

1

Here's some code that'll help you get started:

data = [
    ("hi", {"Name": "David", "Age": "25", "Location": "New York", "Height": "170", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "1", "Singing": "2"}, "Skills": {"Coding": "2", "Swimming": "4"}}}, "bye"),
    ("hi", {"Name": "Helen", "Age": "28", "Location": "New York", "Height": "160", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "5", "Singing": "6"}}}, "bye"),
    ]
df = spark.createDataFrame(data, ["greeting", "dic", "farewell"])
res = df.select(
    F.col("dic").getItem("Name").alias(str("Name")),
    F.col("dic")["Age"].alias(str("Age"))
)
res.show()

+-----+---+
| Name|Age|
+-----+---+
|David| 25|
|Helen| 28|
+-----+---+
res.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)

Spark can't handle dictionary values that are multiple different types. Regular Python can handle dictionary keys / values with mixed types.

We can run df.printSchema() to see how PySpark is interpreting the dictionary values:

root
 |-- greeting: string (nullable = true)
 |-- dic: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- farewell: string (nullable = true)

Your example dataset has a mix of string and dictionary values. Run df.select(F.col("dic").getItem("fields")).printSchema() to see:

root
 |-- dic[fields]: string (nullable = true)

There might be some way to parse the string and convert it to a map, but that'd be costly. Can you add a printSchema in your question? You might need to restructure your data so the answer is a little easier ;)

Sign up to request clarification or add additional context in comments.

2 Comments

I have made an update to the sample data. Sorry I was writing it in haste.
They should all be dictionary values. Sorry that I can't provide real data

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.