0

I have the following data frame which has the following structure and values

root
|-- Row_number: integer (nullable = false)
 |-- Factor: string (nullable = false)
 |-- Country: string (nullable = false)
 |-- Date: date (nullable = false)
 |-- Amount: integer (nullable = false)

+----------+-----------------------+
|Row_number|              Factor  |
+----------+-----------------------+
|         1|[EN2_1, EN2_2, EN3_3] |
|         2|[EN2_1, EN2_2, EN3_3] |
|         3|[EN2_1, EN2_2, EN3_3] |
+----------+------------------------+

I want to convert into the following data frame

1, EN2_1

1, EN2_2

1, EN2_3

2, EN2_1

2, EN2_2

2, EN2_3

3, EN2_1

3, EN2_2

3, EN2_3

I tried to read the column-like ArrayType but it gives error

2 Answers 2

2

A combination of split and explode should work

from pyspark.sql.functions import F

df.withColumn("New_Factor", F.explode(F.split(F.regexp_replace(F.col("Factor"), "(^\[)|(\]$)", ""), ", ")))
Sign up to request clarification or add additional context in comments.

Comments

1

You can first remove the square brackets using trim both, split using ', ', and explode the resulting array into rows:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'Factor',
    F.explode(F.split(F.expr("trim(both '[]' from Factor)"), ', '))
)

df2.show()
+----------+------+
|Row_number|Factor|
+----------+------+
|         1| EN2_1|
|         1| EN2_2|
|         1| EN3_3|
|         2| EN2_1|
|         2| EN2_2|
|         2| EN3_3|
|         3| EN2_1|
|         3| EN2_2|
|         3| EN3_3|
+----------+------+

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.