Pyspark, writing a loop to create multiple new columns based on different conditions

Question

Lets say I have a Pyspark DataFrame with the following columns:

user, score, country, risky/safe, payment_id

I made a list of thresholds: [10, 20, 30]

Now I want to make a new columns for each threshold:

% of risky payments with score above the threshold out of all payments (risky and safe)
% of risky distinct users with at least one score above the threshold out of all users (risky and safe)

both of them should be grouped by country.

The result should be something like this:

Country | % payments thresh 10 | % users thresh 10 | % payments thresh 20 ... 
A
B
C

I was able to make it work with an external for loop but I want it to be all in one dataframe.

thresholds = [10, 20, 30]


for thresh in thresholds:

    
df = (df
     .select('country', 'risk/safe', 'user', 'payment')
     .where(F.col('risk\safe') == 'risk')
     .groupBy('country').agg(F.sum(F.when(
         (F.col('score') >= thresh),1 
           )) / F.count('country').alias('% payments'))

Shouldn't you divide by F.count('payment') to get the % of payments over the threshold for every country? — viggnah
– viggnah, Commented Aug 9, 2022 at 6:55

samkart · Accepted Answer · 2022-08-09 06:52:21Z

Use a list comprehension within the agg().

pay_aggs = [(func.sum((func.col('score')>=thresh).cast('int'))/func.count('country')).alias('% pay '+str(thresh)) for thresh in thresholds]
user_aggs = [(func.countDistinct(func.when(func.col('score')>=thresh, func.col('user')))/func.countDistinct('user')).alias('% user '+str(thresh)) for thresh in thresholds]

df. \
    select('country', 'risk/safe', 'user', 'payment'). \
    where(func.col('risk\safe') == 'risk'). \
    groupBy('country'). \
    agg(*pay_aggs, *user_aggs)

The pay_aggs list will generate the following aggregations (you can easily print the list)

# [Column<'(sum(CAST((score >= 10) AS INT)) / count(country)) AS `% pay 10`'>,
#  Column<'(sum(CAST((score >= 20) AS INT)) / count(country)) AS `% pay 20`'>,
#  Column<'(sum(CAST((score >= 30) AS INT)) / count(country)) AS `% pay 30`'>]

Collectives™ on Stack Overflow

Pyspark, writing a loop to create multiple new columns based on different conditions

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related