0

My input df:

+-------------------+------------+
|        windowStart|      nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1           | 
|2022-03-11 15:00:00|2           |      
|2022-03-11 16:00:00|3           |      

I would like to duplicate each row and use windowStart value of subsequent row, so the output should look like this:

+-------------------+------------+
|        windowStart|      nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1           | 
|2022-03-11 15:00:00|1           | 
|2022-03-11 15:00:00|2           |     
|2022-03-11 16:00:00|2           |  
|2022-03-11 16:00:00|3           | 

How to achieve that ? Thanks !

1 Answer 1

1
df  = spark.createDataFrame(
  [
('2022-03-11 14:00:00','1'),
('2022-03-11 15:00:00','2'),
('2022-03-11 16:00:00','3')
  ], ['windowStart','nodeId'])

from pyspark.sql import Window as W
from pyspark.sql import functions as F

w = W.orderBy('windowStart')

df_lag = df\
    .withColumn('lag', F.lead(F.col("windowStart"), 1).over(w))\
    .select(F.col('lag').alias('windowStart'), 'nodeId')\
    .filter(F.col('windowStart').isNotNull())

df.union(df_lag)\
    .orderBy('windowStart', 'nodeId')\
    .show()
+-------------------+------+
|        windowStart|nodeId|
+-------------------+------+
|2022-03-11 14:00:00|     1|
|2022-03-11 15:00:00|     1|
|2022-03-11 15:00:00|     2|
|2022-03-11 16:00:00|     2|
|2022-03-11 16:00:00|     3|
+-------------------+------+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.