Adding row_number to concatenated column in data frame pyspark

Question

I have a data frame like below in pyspark

df = sqlContext.createDataFrame(
[(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
(2,'N','Y',2,1,2,3,'N','Y','Y','N'),
(3,'Y','N',3,1,0,0,'N','N','N','N'),
(4,'N','Y',5,0,1,0,'N','N','N','Y'),
(5,'Y','N',2,2,0,1,'Y','N','N','Y'),
(6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
(7,'N','N',1,1,3,4,'N','Y','N','Y'),
(8,'Y','Y',1,1,2,0,'Y','Y','N','N')
],
('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)

Now I want to create a new column bt_string in the data frame by concatenating some strings. I have done like below

import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window

# the below values will change as per requirement
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')

con_string = job_id + sess_id + batch_id + time_now + '000000000000000'

df1 = df.withColumn('bt_string', f.lit(con_string))

Now to the data frame I want to assign a unique number for each row. I applied the row_number function like below

df2 = df1.withColumn("row_id",f.row_number().over(Window.partitionBy()))

The output is below

df2.show()  

+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
| id|compatible|product|ios| pc|other|devices|customer|subscriber|circle|smb|           bt_string|row_id|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
|  1|         Y|      Y|  0|  0|    0|      2|       Y|         N|     Y|  Y|12399120210301120...|     1|
|  2|         N|      Y|  2|  1|    2|      3|       N|         Y|     Y|  N|12399120210301120...|     2|
|  3|         Y|      N|  3|  1|    0|      0|       N|         N|     N|  N|12399120210301120...|     3|
|  4|         N|      Y|  5|  0|    1|      0|       N|         N|     N|  Y|12399120210301120...|     4|
|  5|         Y|      N|  2|  2|    0|      1|       Y|         N|     N|  Y|12399120210301120...|     5|
|  6|         Y|      Y|  0|  0|    3|      6|       Y|         N|     Y|  N|12399120210301120...|     6|
|  7|         N|      N|  1|  1|    3|      4|       N|         Y|     N|  Y|12399120210301120...|     7|
|  8|         Y|      Y|  1|  1|    2|      0|       Y|         Y|     N|  N|12399120210301120...|     8|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+

Now I want to add the row_id column to the bt_string column. I mean like below

if bt_string of 1st row is

1239912021030112091500000000000000 then add the corresponding row_id value. 
In the case of first row the value will be 1239912021030112091500000000000001

New column created should have the values like below

1239912021030112091500000000000001
1239912021030112091500000000000002
1239912021030112091500000000000003
1239912021030112091500000000000004
1239912021030112091500000000000005
1239912021030112091500000000000006
1239912021030112091500000000000007
1239912021030112091500000000000008

Also need to make sure that the length of the column should be always 35 characters.

The below string should not exceed 35 characters length at any cost

con_string = job_id + sess_id + batch_id + time_now + '000000000000000'

if it exceeds 35 length characters then we need to trim the number of zeros added in the above statement.

How can I achieve what I want

Perhaps something like: `df2['new_column'] = df.apply(lambda row: str(int(row[bt_string]) + row['row_id'])? That is, convert to an integers, add them, and then convert back to a string? — Docuemada
– Docuemada, Commented Mar 1, 2021 at 20:45

User12345 · Accepted Answer · 2021-03-01 23:17:44Z

Follow the below steps to achieve your result

# import necessary functions
import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window

# assign variables as per requirement 
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')

# Join variables to get desired format of base string
con_string =  job_id + sess_id + batch_id + time_now

# check length of base string and subtract from max length for that column 35 
zero_to_add = 35 - len(con_string)

# Add the numbers of zeros based on the value received above
new_bt_string = con_string + zero_to_add * '0'

# add new column and convert column to decimal and then apply row_number
df1 = df.withColumn('bt_string', f.lit(new_bt_string).cast('decimal(35,0)'))\
    .withColumn("row_id",f.row_number().over(Window.partitionBy()))

# add new column by sum of values from above added columns
df2 = df1.withColumn('bt_id', f.expr('bt_string + row_id'))

Collectives™ on Stack Overflow

Adding row_number to concatenated column in data frame pyspark

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related