Setting a row number for each row in PySpark Dataframe

Question

Currently I'm working with a large database using PySpark and stuck with a problem oh how to correctly set row numbers depending on condition

My dataframe is:

id_company id_client id_loan date
c1         id1       m1      2024-10-15
c2         id1       m2      2024-10-16

c3         id2       m3      2024-10-18
c3         id2       m4      2024-10-18
c3         id2       m5      2024-10-18

c4         id3       m6      2024-10-19
c4         id3       m7      2024-10-20
c4         id3       m8      2024-10-20

c5         id4       m9      2024-10-30
c6         id4       m10     2024-10-31
c6         id4       m11     2024-10-31

My goal is to set numbers for each row within each unique id_client depending on the date of id_loan and depending on a id_company

For instance, if there are two loans within one unique id_client, and they have different date values and different id_companies, they should be marked as 1 and 2.

If there are two or more loans within one unique id_client, but they have same dates within one id_company, they should be marked with the same number

My expected result:

id_company id_client id_loan date       row_number
c1         id1       m1      2024-10-15 1
c2         id1       m2      2024-10-16 2

c3         id2       m3      2024-10-18 1
c3         id2       m4      2024-10-18 1
c3         id2       m5      2024-10-18 1

c4         id3       m6      2024-10-19 1
c4         id3       m7      2024-10-20 2
c4         id3       m8      2024-10-20 2

c5         id4       m9      2024-10-30 1
c6         id4       m10     2024-10-31 2
c6         id4       m11     2024-10-31 2

The code I use:

from pyspark.sql import Window
from pyspark.sql.functions import row_number

df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("id_subject").orderBy('date')))

But it does not consider id_company condition

Any help is highly appreciated!

eavom · Accepted Answer · 2025-04-17 16:49:58Z

2

Using dense_rank() instead of row_number() should give you an expected output. Also, considering partitionBy("id_subject") as typo where you are trying to partition it by partitionBy("id_client")

Updated Line Of Code

df1 = df.withColumn("row_num", dense_rank().over(Window.partitionBy("id_client").orderBy('date')))
df1.show()

edited Apr 17 at 16:49

answered Apr 14 at 0:52

eavom

1,1051 gold badge9 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Setting a row number for each row in PySpark Dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related