I have a master dataframe and a secondary dataframe which I want to go through row by row, filter the master dataframe based on the values in each row, run a function on the filtered master dataframe, and save the output.
The output could either be saved in a separate dataframe, or in a new column of the secondary dataframe.
# Master DF
df = pd.DataFrame({"Name": ["Mike", "Bob", "Steve", "Jim", "Dan"], "Age": [22, 44, 66, 22, 66], "Job": ["Doc", "Cashier", "Fireman", "Doc", "Fireman"]})
#Secondary DF
df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"]})
df = spark.createDataFrame(df)
+-----+---+-------+
| Name|Age| Job|
+-----+---+-------+
| Mike| 22| Doc|
| Bob| 44|Cashier|
|Steve| 66|Fireman|
| Jim| 22| Doc|
| Dan| 66|Fireman|
+-----+---+-------+
df1 = spark.createDataFrame(df1)
+---+-------+
|Age| Job|
+---+-------+
| 22| Doc|
| 66|Fireman|
+---+-------+
# Filter by values in first row of secondary DF
df_filt = df.filter(
(F.col("Age") == 22) &
(F.col('Job') == 'Doc')
)
# Run the filtered DF through my function
def my_func(df_filt):
my_list = df_filt.select('Name').rdd.flatMap(lambda x: x).collect()
return '-'.join(my_list)
# Output of function
my_func(df_filt)
'Mike-Jim'
# Filter by values in second row of secondary DF
df_filt = df.filter(
(F.col("Age") == 66) &
(F.col('Job') == 'Fireman')
)
# Output of function
my_func(df_filt)
'Steve-Dan'
# Desired output at the end of the iterations
new_df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"], "Returned_value": ['Mike-Jim', 'Steve-Dan']})
Basically, I want to take my Master DF and filter it in certain ways, and run an algorithm on the filtered dataset and get the output for that filtering, then go on to the next set of filtering and do the same.
What is the best way to go about this?