1

I am using spark 2.1 and usage is pyscripting

Problem Statement: Have an scenario where there is an need to pass multiple columns as input and return one column as output below is my input dataframe of 3 columns

a b c

S S S

S NS NS

S NS S

S S NS

NS S NS

my output has to be as below

a b c d

S S S S

S NS NS NS

S NS S S

S S NS NS

NS S NS NS

I am trying to register an UDF to pass these 3 columns[a,b,c] as input and return d column as output here a,b,c,d are the column names

I am finding difficult to get the output below is the syntax used

def return_string(x):
      if [x.a=='s' & x.b=='S' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='NS' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='S' & x.c=='NS']
          return 'NS;

func= udf(returnstring,types.StringType())

Can anyone please help me in completing this logic.

3
  • 1
    Are all three columns important? For this sample output it seems to only depend on C. Commented Aug 23, 2017 at 19:28
  • Possible duplicate of Pyspark: Pass multiple columns in UDF Commented Aug 23, 2017 at 19:30
  • Yes all three are important as missed one more logic if x.a=='NS' & x.b=='S' | x.c=='NS' return 'NS' but what you have mentioned is right for this sample output other columns can alone be considered Commented Aug 23, 2017 at 19:33

2 Answers 2

6

I was trying to do it using the built-in withColumn and when functions:

from pyspark.sql.functions import col, when, lit

df.withColumn('d', when(
     ((col('A') == 'S') & (col('B') == 'S') & (col('C')=='S'))
   | ((col('A') == 'S') & (col('B') == 'NS') & (col('C')=='S'))
 , lit('S')
 ).otherwise(lit('NS'))
).show()

This is also assuming that the two values are mutually exclusive (hence the otherwise)

Sign up to request clarification or add additional context in comments.

Comments

5

It should be:

@udf
def return_string(a, b, c):
    if a == 's' and b == 'S' and c == 's':
        return 'S'
    if a == 's' and b == 'NS' and c == 's':
        return 'S'
    if a == 's' and b == 'S' and c == 'NS':
        return 'NS'

df = sc.parallelize([('s', 'S', 'NS'), ('?', '?', '?')]).toDF(['a', 'b', 'c'])

df.withColumn('result', return_string('a', 'b', 'c')).show()
## +---+---+---+------+
## |  a|  b|  c|result|
## +---+---+---+------+
## |  s|  S| NS|    NS|
## |  ?|  ?|  ?|  null|
## +---+---+---+------+
  • All arguments should be listed (unless you pass data as struct).
  • You should use and not & (you evaluate logical expressions not SQL expressions).
  • Conditions should be expressions not lists (non-empty list are always truthy).

Personally I'd skip all the ifs and use simple dict:

@udf
def return_string(a, b, c):
    mapping = {
        ('s', 'S', 's'): 'S',
        ('s', 'NS' 's'): 'S',
        ('s', 'S', 'NS'): 'NS',
    }
    return mapping.get((a, b, c))

Adjust conditions according to your requirements.

Overall you should prefer SQL expressions as shown in the excellent answer provided by Steven Laan (you can chain multiple conditions with when(..., ...).when(..., ...)).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.