Iterating through a DataFrame using Pandas UDF and outputting a dataframe

Question

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements.

def is_pass_in(df):
    x = list(df["string"])
    result = []
    for i in x:
        if "pass" in i:
            result.append("YES")
        else:
            result.append("NO")

    df["result"] = result

    return df

The code is super simple all I'm trying to do is iterate through a column and in each row contains a sentence. I want to check if the word pass is in that sentence and if so append that to a list that will later become a column right next to the df["string"] column. Ive tried to do this using Pandas UDF but the error messages I'm getting are something that I don't understand because I'm new to spark. Could someone point me in the correct direction?

wwnde · Accepted Answer · 2022-06-05 03:17:27Z

2

There is no need to use a UDF. This can be done in pyspark as follows. Even in pandas, I would advice you dont do what you have done. use np.where()

df.withColumn('result', when(col('store')=='target','YES').otherwise('NO')).show()

answered Jun 5, 2022 at 3:17

wwnde

26.7k6 gold badges21 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AndronikMk Over a year ago

From what I've read in the docs. How would you use regex to find a substring in the column as opposed to the exact name that is in the column. I'm trying to find a single work inside a string.

AndronikMk Over a year ago

df.withColumn('new_column', when(col('text_column').rlike('pass$'),'pass step').otherwise('failed step')) I added the rlike function and got what I needed. Thanks for the help.

Collectives™ on Stack Overflow

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related