0

Currently I'm working with a large database using PySpark and stuck with a problem oh how to correctly set row numbers depending on condition

My dataframe is:

id_company id_client id_loan date
c1         id1       m1      2024-10-15
c2         id1       m2      2024-10-16

c3         id2       m3      2024-10-18
c3         id2       m4      2024-10-18
c3         id2       m5      2024-10-18

c4         id3       m6      2024-10-19
c4         id3       m7      2024-10-20
c4         id3       m8      2024-10-20

c5         id4       m9      2024-10-30
c6         id4       m10     2024-10-31
c6         id4       m11     2024-10-31

My goal is to set numbers for each row within each unique id_client depending on the date of id_loan and depending on a id_company

For instance, if there are two loans within one unique id_client, and they have different date values and different id_companies, they should be marked as 1 and 2.

If there are two or more loans within one unique id_client, but they have same dates within one id_company, they should be marked with the same number

My expected result:

id_company id_client id_loan date       row_number
c1         id1       m1      2024-10-15 1
c2         id1       m2      2024-10-16 2

c3         id2       m3      2024-10-18 1
c3         id2       m4      2024-10-18 1
c3         id2       m5      2024-10-18 1

c4         id3       m6      2024-10-19 1
c4         id3       m7      2024-10-20 2
c4         id3       m8      2024-10-20 2

c5         id4       m9      2024-10-30 1
c6         id4       m10     2024-10-31 2
c6         id4       m11     2024-10-31 2

The code I use:

from pyspark.sql import Window
from pyspark.sql.functions import row_number

df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("id_subject").orderBy('date')))

But it does not consider id_company condition

Any help is highly appreciated!

1 Answer 1

2

Using dense_rank() instead of row_number() should give you an expected output. Also, considering partitionBy("id_subject") as typo where you are trying to partition it by partitionBy("id_client")

Updated Line Of Code

df1 = df.withColumn("row_num", dense_rank().over(Window.partitionBy("id_client").orderBy('date')))
df1.show()
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.