Adding new column in pyspark dataframe

Question

I'm trying to add a new record Timezone to my pysaprk dataframe

from timezonefinder import TimezoneFinder
tf = TimezoneFinder()
df = df.withColumn("longitude",col("longitude").cast("float"))
df = df.withColumn("Latitude",col("Latitude").cast("float"))
df = df.withColumn("timezone",tf.timezone_at(lng=col("longitude"), lat=col("Latitude")))

I'm getting below error.

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Timezonefinder library is used to find timezone by passing geocoordinates.

Latitude, longitude = 20.5061, 50.358
tf.timezone_at(lng=longitude, lat=Latitude)
 -- 'Asia/Riyadh'

mck · Accepted Answer · 2021-04-14 12:58:19Z

2

You need to use a UDF to pass columns to Python functions:

import pyspark.sql.functions as F

@F.udf('string')
def tfUDF(lng, lat):
    from timezonefinder import TimezoneFinder
    tf = TimezoneFinder()
    return tf.timezone_at(lng=lng, lat=lat)

df = df.withColumn("longitude", F.col("longitude").cast("float"))
df = df.withColumn("Latitude", F.col("Latitude").cast("float"))
df = df.withColumn("timezone", tfUDF(F.col("longitude"), F.col("Latitude")))

df.show()
+--------+---------+-----------+
|Latitude|longitude|   timezone|
+--------+---------+-----------+
| 20.5061|   50.358|Asia/Riyadh|
+--------+---------+-----------+

answered Apr 14, 2021 at 12:58

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Leo Over a year ago

Thanks it worked. Just would like to know if there's any alternative to using UDF. Also how's the performance of using UDF.. I'm an absolute beginner using Pyspark.

mck Over a year ago

You probably need to use a UDF for any functions that are not available natively in Spark SQL. Its performance is generally worse than native Spark functions, but you could consider using pandas UDF with pyarrow for improving UDF performance.

Collectives™ on Stack Overflow

Adding new column in pyspark dataframe

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related