Currently I'm working with a large database using PySpark and stuck with a problem oh how to correctly set row numbers depending on condition
My dataframe is:
id_company id_client id_loan date
c1 id1 m1 2024-10-15
c2 id1 m2 2024-10-16
c3 id2 m3 2024-10-18
c3 id2 m4 2024-10-18
c3 id2 m5 2024-10-18
c4 id3 m6 2024-10-19
c4 id3 m7 2024-10-20
c4 id3 m8 2024-10-20
c5 id4 m9 2024-10-30
c6 id4 m10 2024-10-31
c6 id4 m11 2024-10-31
My goal is to set numbers for each row within each unique id_client depending on the date of id_loan and depending on a id_company
For instance, if there are two loans within one unique id_client, and they have different date values and different id_companies, they should be marked as 1 and 2.
If there are two or more loans within one unique id_client, but they have same dates within one id_company, they should be marked with the same number
My expected result:
id_company id_client id_loan date row_number
c1 id1 m1 2024-10-15 1
c2 id1 m2 2024-10-16 2
c3 id2 m3 2024-10-18 1
c3 id2 m4 2024-10-18 1
c3 id2 m5 2024-10-18 1
c4 id3 m6 2024-10-19 1
c4 id3 m7 2024-10-20 2
c4 id3 m8 2024-10-20 2
c5 id4 m9 2024-10-30 1
c6 id4 m10 2024-10-31 2
c6 id4 m11 2024-10-31 2
The code I use:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("id_subject").orderBy('date')))
But it does not consider id_company condition
Any help is highly appreciated!