4

I am trying to generate an additional column in a dataframe with auto incrementing values based on the global value.However all the rows are generated with the same value and the value is not incrementing.

Here is the code

def autoIncrement():
    global rec
    if (rec == 0) : rec = 1 
    else : rec = rec + 1
    return int(rec)

rec=14

UDF

autoIncrementUDF = udf(autoIncrement,  IntegerType())


df1 = hiveContext.sql("select id,name,location,state,datetime,zipcode from demo.target")

df1.withColumn("id2", autoIncrementUDF()).show()

Here is the result df

+---+------+--------+----------+-------------------+-------+---+
| id|  name|location|     state|           datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00|   NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00|   NULL| 15|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00|   NULL| 15|
| 30|Manish| Gurgoan|   Gujarat|2018-03-27 11:00:00|   NULL| 15|
+---+------+--------+----------+-------------------+-------+---+

But i am expecting the below result

+---+------+--------+----------+-------------------+-------+---+
| id|  name|location|     state|           datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00|   NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00|   NULL| 16|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00|   NULL| 17|
| 30|Manish| Gurgoan|   Gujarat|2018-03-27 11:00:00|   NULL| 18|
+---+------+--------+----------+-------------------+-------+---+

Any help is appreciated.

1
  • Since a UDF can be executed in different workers, a python global variable makes no sense as globals are bound to processes. Commented May 4, 2018 at 12:02

1 Answer 1

5

Global variables are bounded to a python process. A UDF may be executed in parallel on different workers across some cluster, and should be deterministic.

You should use monotonically_increasing_id() function from pyspark.sql.functions module.

Check the docs for more info.

You should be careful because this function is dynamic and not sticky:

How do I add an persistent column of row ids to Spark DataFrame?

Sign up to request clarification or add additional context in comments.

1 Comment

Your first link is broken.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.