Auto - Incrementing pyspark dataframe column values

Question

I am trying to generate an additional column in a dataframe with auto incrementing values based on the global value.However all the rows are generated with the same value and the value is not incrementing.

Here is the code

def autoIncrement():
    global rec
    if (rec == 0) : rec = 1 
    else : rec = rec + 1
    return int(rec)

rec=14

UDF

autoIncrementUDF = udf(autoIncrement,  IntegerType())


df1 = hiveContext.sql("select id,name,location,state,datetime,zipcode from demo.target")

df1.withColumn("id2", autoIncrementUDF()).show()

Here is the result df

+---+------+--------+----------+-------------------+-------+---+
| id|  name|location|     state|           datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00|   NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00|   NULL| 15|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00|   NULL| 15|
| 30|Manish| Gurgoan|   Gujarat|2018-03-27 11:00:00|   NULL| 15|
+---+------+--------+----------+-------------------+-------+---+

But i am expecting the below result

+---+------+--------+----------+-------------------+-------+---+
| id|  name|location|     state|           datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00|   NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00|   NULL| 16|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00|   NULL| 17|
| 30|Manish| Gurgoan|   Gujarat|2018-03-27 11:00:00|   NULL| 18|
+---+------+--------+----------+-------------------+-------+---+

Any help is appreciated.

Since a UDF can be executed in different workers, a python global variable makes no sense as globals are bound to processes. — Susensio
– Susensio, Commented May 4, 2018 at 12:02

Susensio · Accepted Answer · 2021-09-28 10:47:19Z

5

Global variables are bounded to a python process. A UDF may be executed in parallel on different workers across some cluster, and should be deterministic.

You should use monotonically_increasing_id() function from pyspark.sql.functions module.

Check the docs for more info.

You should be careful because this function is dynamic and not sticky:

How do I add an persistent column of row ids to Spark DataFrame?

edited Sep 28, 2021 at 10:47

answered May 4, 2018 at 12:11

Susensio

8701 gold badge10 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Josh Herzberg Over a year ago

Your first link is broken.

Collectives™ on Stack Overflow

Auto - Incrementing pyspark dataframe column values

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related