3

This is my current dataset:

from pyspark.sql import Window
import pyspark.sql.functions as psf

df = spark.createDataFrame([("2","1",1),
                            ("3","2",2)],
                     schema = StructType([StructField("Data",  StringType()),
                                          StructField("Source",StringType()),
                                          StructField("Date",  IntegerType())]))


display(df.withColumn("Result",psf.collect_set("Data").over(Window.partitionBy("Source").orderBy("Date"))))

Output:

Data Source Date Result
2 1 1 ["2"]
3 1 2 ["2","3"]

Why am I missing value 3 in the first row of column Result when using collect_set function over Window that is ordered ?

I have tried to use collect_list as well, but I am getting same results.

My desired output is:

Data Source Date Result
2 1 1 ["2","3"]
3 1 2 ["2","3"]

where the order of values in Result is preserved - first one is where Date = 1 and second one is Date = 2

1

1 Answer 1

5

You need to use a Window with unboundedPreceding and Window.unboundedFollowing:

Window.partitionBy("Source").orderBy("Date") \
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

By default Spark uses rowsBetween(Window.unboundedPreceding, Window.currentRow) when you have an orderBy

Sign up to request clarification or add additional context in comments.

2 Comments

"By default Spark uses ...": I have not found a specification for this, do you know where this behaviour is defined?
@Beryllium You can read the note here pyspark.sql.Window

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.