pyspark: duplicate row with column value from another row

Question

My input df:

+-------------------+------------+
|        windowStart|      nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1           | 
|2022-03-11 15:00:00|2           |      
|2022-03-11 16:00:00|3           |

I would like to duplicate each row and use windowStart value of subsequent row, so the output should look like this:

+-------------------+------------+
|        windowStart|      nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1           | 
|2022-03-11 15:00:00|1           | 
|2022-03-11 15:00:00|2           |     
|2022-03-11 16:00:00|2           |  
|2022-03-11 16:00:00|3           |

How to achieve that ? Thanks !

Luiz Viola · Accepted Answer · 2022-03-16 13:16:17Z

df  = spark.createDataFrame(
  [
('2022-03-11 14:00:00','1'),
('2022-03-11 15:00:00','2'),
('2022-03-11 16:00:00','3')
  ], ['windowStart','nodeId'])

from pyspark.sql import Window as W
from pyspark.sql import functions as F

w = W.orderBy('windowStart')

df_lag = df\
    .withColumn('lag', F.lead(F.col("windowStart"), 1).over(w))\
    .select(F.col('lag').alias('windowStart'), 'nodeId')\
    .filter(F.col('windowStart').isNotNull())

df.union(df_lag)\
    .orderBy('windowStart', 'nodeId')\
    .show()

+-------------------+------+
|        windowStart|nodeId|
+-------------------+------+
|2022-03-11 14:00:00|     1|
|2022-03-11 15:00:00|     1|
|2022-03-11 15:00:00|     2|
|2022-03-11 16:00:00|     2|
|2022-03-11 16:00:00|     3|
+-------------------+------+

Collectives™ on Stack Overflow

pyspark: duplicate row with column value from another row

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related