Expanding / Exploding column values into multiple rows in Pyspark

Question

I have a dataset like the following table below. (This data set will have the same number of elements per ID in different columns, however the number of the elements vary by ID.)

And I would like to transform this dataset to have the following table below. That is, I want to 'explode'/expand the cell values per ID into multiple rows and preserving the actual columns.

Now I have tried to explode the columns with the following script:

from pyspark.sql import functions as F

df = df.withColumn("1", F.explode(F.split(col1, ",")))\
       .withColumn("2", F.explode(F.split(col2, ",")))\
       .withColumn("3", F.explode(F.split(col3, ",")))

However, the following code provides the first column, col 1 in n times, the second column, col2 in m times and the third column col3 p times, so I end up having nmp rows instead of only n rows.

I have also tried to partition the dataset for each column value and drop duplicates per partition and joining these back together thereafter. However, joining is very memory inefficient and this never actually executed:

from pyspark.sql import function as F
from pyspark.sql.window import Window

part1 = df.select(F.explode(F.split(col_1, ",")).alias(col1),
        F.concat("ID", F.row_number().over(Window.partitionBy("ID").orderBy("ID").alias("ID"))\
       .dropDuplicates()
part2 = df.select(F.explode(F.split(col_2, ",")).alias(col1),
        F.concat("ID", F.row_number().over(Window.partitionBy("ID").orderBy("ID").alias("ID"))\
       .dropDuplicates()
df = part1.join(part2, part1.ID == part2.ID).select(part1.ID, part1.col1, part2.col2)

I have also came across the following one liner (for 1 column) for the same problem:

from pyspark.sql import functions as F

df = df.select(F.expr("stack(col1)"))

However, the error I recieve is the following: 'There is a data type mismatch in "stack('col1')" from "stack requires at least 2 arguments."

Is there anyway, I could achieve my desired output?

Any feedback would be much appreciated. Thanks.

AdibP · Accepted Answer · 2021-10-14 13:14:11Z

2

Try zip arrays of those columns (after split) with arrays_zip then explode the array of structs using inline

df = (df
      .selectExpr("id", 
                  "split(col1, ',') col1", 
                  "split(col2, ',') col2",
                  "split(col3, ',') col3")
      .withColumn("arr", F.expr("arrays_zip(col1, col2, col3)"))
      .selectExpr("id", "inline(arr)"))
df.show()

# +---+----+----+----+
# | id|col1|col2|col3|
# +---+----+----+----+
# |  x|   1|   a|   4|
# |  x|   2|   b|   5|
# |  x|   3|   c|   6|
# |  y|   1|   a|   4|
# |  y|   2|   b|   5|
# |  y|   X|   F| 100|
# |  y|   7|   8|   9|
# |  z|   1|   T|   4|
# |  z|   2|   b|   5|
# +---+----+----+----+

answered Oct 14, 2021 at 13:14

AdibP

2,9691 gold badge13 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Zsofia Over a year ago

Hi AdibP, thanks a million for your solution. I have tried this now and works perfectly well.

Zsofia Over a year ago

Hi AdibP, do you think you could extend this solution to a scenario when there are null values in the initial table?

AdibP Over a year ago

to reserve rows with null values, try using inline_outer instead of inline. If that failed, post a new question to get the better look at your data

Collectives™ on Stack Overflow

Expanding / Exploding column values into multiple rows in Pyspark

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related