1

I have a PySpark dataframe of students with schema as follows:

Id: string
 |-- School: array
 |-- element: struct
 |   |-- Subject: string
 |   |-- Classes: string
 |   |-- Score: array
 |       |-- element: struct
 |           |-- ScoreID: string
 |           |-- Value: string

I want to extract a few fields from the data frame and normalize it so that I can feed it in the database. The relational schema I expect consists of the fields Id, School, Subject, ScoreId, Value. How can I do it efficiently?

0

1 Answer 1

2

explode the array to get flattened data and then select all the required columns.

Example:

df.show(10,False)
#+---+--------------------------+
#|Id |School                    |
#+---+--------------------------+
#|1  |[[b, [[A, 3], [B, 4]], a]]|
#+---+--------------------------+

df.printSchema()
#root
# |-- Id: string (nullable = true)
# |-- School: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- Classes: string (nullable = true)
# |    |    |-- Score: array (nullable = true)
# |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |-- ScoreID: string (nullable = true)
# |    |    |    |    |-- Value: string (nullable = true)
# |    |    |-- Subject: string (nullable = true)

df.selectExpr("Id","explode(School)").\
selectExpr("Id","col.*","explode(col.Score)").\
selectExpr("Id","Classes","Subject","col.*").\
show()
#+---+-------+-------+-------+-----+
#| Id|Classes|Subject|ScoreID|Value|
#+---+-------+-------+-------+-----+
#|  1|      b|      a|      A|    3|
#|  1|      b|      a|      B|    4|
#+---+-------+-------+-------+-----+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.