Missing data when ordering Pyspark Window

Question

This is my current dataset:

from pyspark.sql import Window
import pyspark.sql.functions as psf

df = spark.createDataFrame([("2","1",1),
                            ("3","2",2)],
                     schema = StructType([StructField("Data",  StringType()),
                                          StructField("Source",StringType()),
                                          StructField("Date",  IntegerType())]))


display(df.withColumn("Result",psf.collect_set("Data").over(Window.partitionBy("Source").orderBy("Date"))))

Output:

Data	Source	Date	Result
2	1	1	["2"]
3	1	2	["2","3"]

Why am I missing value 3 in the first row of column Result when using collect_set function over Window that is ordered ?

I have tried to use collect_list as well, but I am getting same results.

My desired output is:

Data	Source	Date	Result
2	1	1	["2","3"]
3	1	2	["2","3"]

where the order of values in Result is preserved - first one is where Date = 1 and second one is Date = 2

see rowsBetween : spark.apache.org/docs/3.1.1/api/python/reference/api/… — Sreeram TP
– Sreeram TP, Commented Jan 5, 2022 at 10:40

blackbishop · Accepted Answer · 2022-01-05 10:39:37Z

5

You need to use a Window with unboundedPreceding and Window.unboundedFollowing:

Window.partitionBy("Source").orderBy("Date") \
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

By default Spark uses rowsBetween(Window.unboundedPreceding, Window.currentRow) when you have an orderBy

answered Jan 5, 2022 at 10:39

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Beryllium Dec 3, 2024 at 7:54

"By default Spark uses ...": I have not found a specification for this, do you know where this behaviour is defined?

blackbishop Dec 3, 2024 at 9:19

@Beryllium You can read the note here pyspark.sql.Window

Collectives™ on Stack Overflow

Missing data when ordering Pyspark Window

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related