Here's some code that'll help you get started:
data = [
("hi", {"Name": "David", "Age": "25", "Location": "New York", "Height": "170", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "1", "Singing": "2"}, "Skills": {"Coding": "2", "Swimming": "4"}}}, "bye"),
("hi", {"Name": "Helen", "Age": "28", "Location": "New York", "Height": "160", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "5", "Singing": "6"}}}, "bye"),
]
df = spark.createDataFrame(data, ["greeting", "dic", "farewell"])
res = df.select(
F.col("dic").getItem("Name").alias(str("Name")),
F.col("dic")["Age"].alias(str("Age"))
)
res.show()
+-----+---+
| Name|Age|
+-----+---+
|David| 25|
|Helen| 28|
+-----+---+
res.printSchema()
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
Spark can't handle dictionary values that are multiple different types. Regular Python can handle dictionary keys / values with mixed types.
We can run df.printSchema() to see how PySpark is interpreting the dictionary values:
root
|-- greeting: string (nullable = true)
|-- dic: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- farewell: string (nullable = true)
Your example dataset has a mix of string and dictionary values. Run df.select(F.col("dic").getItem("fields")).printSchema() to see:
root
|-- dic[fields]: string (nullable = true)
There might be some way to parse the string and convert it to a map, but that'd be costly. Can you add a printSchema in your question? You might need to restructure your data so the answer is a little easier ;)
diccolumn always the same? Doesdichave the same structure for every row of data?