Suppose I have the following data frame :
+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
| 123| 989| 0.9| 0|
| 123| 990| 0.8| 1|
| 123| 999| 0.7| 0|
| 234| 789| 0.9| 0|
| 234| 777| 0.7| 0|
| 234| 769| 0.6| 1|
| 234| 798| 0.5| 0|
+----------+-----+----+-------+
I then perform the following operations to get a final data set (shown below the code) :
# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0))
# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked)
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"])
# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob)
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id'))
# Define function to sort list of tuples by prob and create list of only ad_ids
def sortByRank(ad_id_list):
sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True)
sortedIds = [i[0] for i in sortedVersion]
return(sortedIds)
# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType()))
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)'))
# Function to change clickedAdId into an array of size 1
def createClickedSet(clickedAdId):
setOfDocs = [clickedAdId]
return setOfDocs
clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType()))
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)'))
# Select the necessary columns
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set')
+----------+--------------------+---------+
|display_id|ad_id_sorted |ad_id_set|
+----------+--------------------+---------+
|234 |[789, 777, 769, 798]|[769] |
|123 |[989, 990, 999] |[990] |
+----------+--------------------+---------+
Is there a more efficient way of doing this? Doing this set of transformations in the way that I am seems to be the bottle neck in my code. I would greatly appreciate any feedback.