Pyspark -- Filter dataframe based on row values of another dataframe

Question

I have a master dataframe and a secondary dataframe which I want to go through row by row, filter the master dataframe based on the values in each row, run a function on the filtered master dataframe, and save the output.

The output could either be saved in a separate dataframe, or in a new column of the secondary dataframe.

# Master DF
df = pd.DataFrame({"Name": ["Mike", "Bob", "Steve", "Jim", "Dan"], "Age": [22, 44, 66, 22, 66], "Job": ["Doc", "Cashier", "Fireman", "Doc", "Fireman"]})

#Secondary DF
df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"]})

df = spark.createDataFrame(df)
+-----+---+-------+
| Name|Age|    Job|
+-----+---+-------+
| Mike| 22|    Doc|
|  Bob| 44|Cashier|
|Steve| 66|Fireman|
|  Jim| 22|    Doc|
|  Dan| 66|Fireman|
+-----+---+-------+

df1 = spark.createDataFrame(df1)
+---+-------+
|Age|    Job|
+---+-------+
| 22|    Doc|
| 66|Fireman|
+---+-------+

# Filter by values in first row of secondary DF
df_filt = df.filter(
    (F.col("Age") == 22) &                                
    (F.col('Job') == 'Doc')                          
)

# Run the filtered DF through my function
def my_func(df_filt):
    my_list = df_filt.select('Name').rdd.flatMap(lambda x: x).collect()
    return '-'.join(my_list)

# Output of function
my_func(df_filt)
'Mike-Jim'

# Filter by values in second row of secondary DF
df_filt = df.filter(
    (F.col("Age") == 66) &                                
    (F.col('Job') == 'Fireman')                          
)

# Output of function
my_func(df_filt)
'Steve-Dan'

# Desired output at the end of the iterations
new_df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"], "Returned_value": ['Mike-Jim', 'Steve-Dan']})

Basically, I want to take my Master DF and filter it in certain ways, and run an algorithm on the filtered dataset and get the output for that filtering, then go on to the next set of filtering and do the same.

What is the best way to go about this?

murtihash · Accepted Answer · 2020-05-27 23:46:04Z

1

Try this with join, groupBy, concat_ws/array_join and collect_list.

from pyspark.sql import functions as F

df.join(df1,['Age','Job'])\
  .groupBy("Age","Job").agg(F.concat_ws('-',F.collect_list("Name")).alias("Returned_value")).show()

#+---+-------+--------------+
#|Age|    Job|Returned_value|
#+---+-------+--------------+
#| 22|    Doc|      Mike-Jim|
#| 66|Fireman|     Steve-Dan|
#+---+-------+--------------+

answered May 27, 2020 at 23:46

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark -- Filter dataframe based on row values of another dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related