Flatmap a collect_set in pyspark dataframe

Question

I have two dataframe and I'm using collect_set() in agg after using groupby. What's the best way to flatMap the resulting array after aggregating.

schema = ['col1', 'col2', 'col3', 'col4']

a = [[1, [23, 32], [11, 22], [9989]]]

df1 = spark.createDataFrame(a, schema=schema)

b = [[1, [34], [43, 22], [888, 777]]]

df2 = spark.createDataFrame(b, schema=schema)

df = df1.union(
        df2
    ).groupby(
        'col1'
    ).agg(
        collect_set('col2').alias('col2'),
        collect_set('col3').alias('col3'),
        collect_set('col4').alias('col4')
    )

df.collect()

I'm getting this as output:

[Row(col1=1, col2=[[34], [23, 32]], col3=[[11, 22], [43, 22]], col4=[[9989], [888, 777]])]

But, I want this as output:

[Row(col1=1, col2=[23, 32, 34], col3=[11, 22, 43], col4=[9989, 888, 777])]

user7337271 · Accepted Answer · 2017-01-12 16:00:32Z

3

You can use udf:

from itertools import chain
from pyspark.sql.types import *
from pyspark.sql.functions import udf

flatten = udf(lambda x: list(chain.from_iterable(x)), ArrayType(IntegerType()))

df.withColumn('col2_flat', flatten('col2'))

answered Jan 12, 2017 at 16:00

user7337271

1,7221 gold badge16 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ZettaP · Accepted Answer · 2021-10-22 06:54:40Z

1

Without UDF I supposed this should work :

from pyspark.sql.functions import array_distinct, flatten

df.withColumn('col2_flat', array_distinct(flatten('col2')))

It will flatten the nested arrays, and then deduplicates.

answered Oct 22, 2021 at 6:54

ZettaP

1,49412 silver badges15 bronze badges

Collectives™ on Stack Overflow

Flatmap a collect_set in pyspark dataframe

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related