Lets say I have a Pyspark DataFrame with the following columns:
user, score, country, risky/safe, payment_id
I made a list of thresholds: [10, 20, 30]
Now I want to make a new columns for each threshold:
- % of risky payments with score above the threshold out of all payments (risky and safe)
- % of risky distinct users with at least one score above the threshold out of all users (risky and safe)
both of them should be grouped by country.
The result should be something like this:
Country | % payments thresh 10 | % users thresh 10 | % payments thresh 20 ... 
A
B
C
I was able to make it work with an external for loop but I want it to be all in one dataframe.
thresholds = [10, 20, 30]
for thresh in thresholds:
    
df = (df
     .select('country', 'risk/safe', 'user', 'payment')
     .where(F.col('risk\safe') == 'risk')
     .groupBy('country').agg(F.sum(F.when(
         (F.col('score') >= thresh),1 
           )) / F.count('country').alias('% payments'))

F.count('payment')to get the % of payments over the threshold for every country?