This is my current dataset:
from pyspark.sql import Window
import pyspark.sql.functions as psf
df = spark.createDataFrame([("2","1",1),
("3","2",2)],
schema = StructType([StructField("Data", StringType()),
StructField("Source",StringType()),
StructField("Date", IntegerType())]))
display(df.withColumn("Result",psf.collect_set("Data").over(Window.partitionBy("Source").orderBy("Date"))))
Output:
| Data | Source | Date | Result |
|---|---|---|---|
| 2 | 1 | 1 | ["2"] |
| 3 | 1 | 2 | ["2","3"] |
Why am I missing value 3 in the first row of column Result when using collect_set function over Window that is ordered ?
I have tried to use collect_list as well, but I am getting same results.
My desired output is:
| Data | Source | Date | Result |
|---|---|---|---|
| 2 | 1 | 1 | ["2","3"] |
| 3 | 1 | 2 | ["2","3"] |
where the order of values in Result is preserved - first one is where Date = 1 and second one is Date = 2