1

I'm trying to add a new record Timezone to my pysaprk dataframe

from timezonefinder import TimezoneFinder
tf = TimezoneFinder()
df = df.withColumn("longitude",col("longitude").cast("float"))
df = df.withColumn("Latitude",col("Latitude").cast("float"))
df = df.withColumn("timezone",tf.timezone_at(lng=col("longitude"), lat=col("Latitude")))

I'm getting below error.

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Timezonefinder library is used to find timezone by passing geocoordinates.

Latitude, longitude = 20.5061, 50.358
tf.timezone_at(lng=longitude, lat=Latitude)
 -- 'Asia/Riyadh'

1 Answer 1

2

You need to use a UDF to pass columns to Python functions:

import pyspark.sql.functions as F

@F.udf('string')
def tfUDF(lng, lat):
    from timezonefinder import TimezoneFinder
    tf = TimezoneFinder()
    return tf.timezone_at(lng=lng, lat=lat)

df = df.withColumn("longitude", F.col("longitude").cast("float"))
df = df.withColumn("Latitude", F.col("Latitude").cast("float"))
df = df.withColumn("timezone", tfUDF(F.col("longitude"), F.col("Latitude")))

df.show()
+--------+---------+-----------+
|Latitude|longitude|   timezone|
+--------+---------+-----------+
| 20.5061|   50.358|Asia/Riyadh|
+--------+---------+-----------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks it worked. Just would like to know if there's any alternative to using UDF. Also how's the performance of using UDF.. I'm an absolute beginner using Pyspark.
You probably need to use a UDF for any functions that are not available natively in Spark SQL. Its performance is generally worse than native Spark functions, but you could consider using pandas UDF with pyarrow for improving UDF performance.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.