converting string type into rows in pyspark

Question

I have the following data frame which has the following structure and values

root
|-- Row_number: integer (nullable = false)
 |-- Factor: string (nullable = false)
 |-- Country: string (nullable = false)
 |-- Date: date (nullable = false)
 |-- Amount: integer (nullable = false)

+----------+-----------------------+
|Row_number|              Factor  |
+----------+-----------------------+
|         1|[EN2_1, EN2_2, EN3_3] |
|         2|[EN2_1, EN2_2, EN3_3] |
|         3|[EN2_1, EN2_2, EN3_3] |
+----------+------------------------+

I want to convert into the following data frame

1, EN2_1

1, EN2_2

1, EN2_3

2, EN2_1

2, EN2_2

2, EN2_3

3, EN2_1

3, EN2_2

3, EN2_3

I tried to read the column-like ArrayType but it gives error

nobody · Accepted Answer · 2021-01-17 00:46:25Z

2

A combination of split and explode should work

from pyspark.sql.functions import F

df.withColumn("New_Factor", F.explode(F.split(F.regexp_replace(F.col("Factor"), "(^\[)|(\]$)", ""), ", ")))

answered Jan 17, 2021 at 0:46

nobody

11.1k8 gold badges51 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mck · Accepted Answer · 2021-01-17 07:33:05Z

You can first remove the square brackets using trim both, split using ', ', and explode the resulting array into rows:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'Factor',
    F.explode(F.split(F.expr("trim(both '[]' from Factor)"), ', '))
)

df2.show()
+----------+------+
|Row_number|Factor|
+----------+------+
|         1| EN2_1|
|         1| EN2_2|
|         1| EN3_3|
|         2| EN2_1|
|         2| EN2_2|
|         2| EN3_3|
|         3| EN2_1|
|         3| EN2_2|
|         3| EN3_3|
+----------+------+

Collectives™ on Stack Overflow

converting string type into rows in pyspark

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related