I have two Spark Dataframes, the first one contains information from Events as below:
| Id | User_id | Date |
|---|---|---|
| 1 | 1 | 2020-12-01 |
| 2 | 2 | 2021-10-10 |
The second Dataframe contains information related to Purchase as following:
| Id | User_id | Date | Value |
|---|---|---|---|
| 1 | 1 | 2020-11-10 | 50 |
| 2 | 1 | 2020-10-10 | 25 |
| 3 | 2 | 2020-09-15 | 100 |
I want to join both dataframes and create a column containing the last Value, another column with the difference between the the past 2 values, and the date difference in days between past 2 purchase as below:
| Id | User_id | Date | Last_Value | Diff_Value | Diff_Date |
|---|---|---|---|---|---|
| 1 | 1 | 2020-12-01 | 50 | 25 | 30 |
| 2 | 2 | 2021-10-10 | 100 | null | null |
To join the dataframes I'm using the following code:
(Events.join(Purchase,
on = [Events.User_id == Purchase.User_id,
Events.Date >= Purchase.Date],
how = "left")
.withColumn('rank_date', F.rank().over(W.partitionBy(Events['Id']).orderBy(Purchase['Data'].desc())))
With this code I can see what are the Purchase prior to Event ordered by Date, but how can I access rows values and create a columns based on those values?
eventsare not related to those inpurchase? I would think that "last purchase event of user_id 1 happened at 2020-12-01 for a value of 50".