I have a data frame like below in pyspark
df = sqlContext.createDataFrame(
[(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
(2,'N','Y',2,1,2,3,'N','Y','Y','N'),
(3,'Y','N',3,1,0,0,'N','N','N','N'),
(4,'N','Y',5,0,1,0,'N','N','N','Y'),
(5,'Y','N',2,2,0,1,'Y','N','N','Y'),
(6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
(7,'N','N',1,1,3,4,'N','Y','N','Y'),
(8,'Y','Y',1,1,2,0,'Y','Y','N','N')
],
('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)
Now I want to create a new column bt_string in the data frame by concatenating some strings. I have done like below
import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window
# the below values will change as per requirement
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')
con_string = job_id + sess_id + batch_id + time_now + '000000000000000'
df1 = df.withColumn('bt_string', f.lit(con_string))
Now to the data frame I want to assign a unique number for each row. I applied the row_number function like below
df2 = df1.withColumn("row_id",f.row_number().over(Window.partitionBy()))
The output is below
df2.show()
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
| id|compatible|product|ios| pc|other|devices|customer|subscriber|circle|smb| bt_string|row_id|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
| 1| Y| Y| 0| 0| 0| 2| Y| N| Y| Y|12399120210301120...| 1|
| 2| N| Y| 2| 1| 2| 3| N| Y| Y| N|12399120210301120...| 2|
| 3| Y| N| 3| 1| 0| 0| N| N| N| N|12399120210301120...| 3|
| 4| N| Y| 5| 0| 1| 0| N| N| N| Y|12399120210301120...| 4|
| 5| Y| N| 2| 2| 0| 1| Y| N| N| Y|12399120210301120...| 5|
| 6| Y| Y| 0| 0| 3| 6| Y| N| Y| N|12399120210301120...| 6|
| 7| N| N| 1| 1| 3| 4| N| Y| N| Y|12399120210301120...| 7|
| 8| Y| Y| 1| 1| 2| 0| Y| Y| N| N|12399120210301120...| 8|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
Now I want to add the row_id column to the bt_string column. I mean like below
if bt_string of 1st row is
1239912021030112091500000000000000 then add the corresponding row_id value.
In the case of first row the value will be 1239912021030112091500000000000001
New column created should have the values like below
1239912021030112091500000000000001
1239912021030112091500000000000002
1239912021030112091500000000000003
1239912021030112091500000000000004
1239912021030112091500000000000005
1239912021030112091500000000000006
1239912021030112091500000000000007
1239912021030112091500000000000008
Also need to make sure that the length of the column should be always 35 characters.
The below string should not exceed 35 characters length at any cost
con_string = job_id + sess_id + batch_id + time_now + '000000000000000'
if it exceeds 35 length characters then we need to trim the number of zeros added in the above statement.
How can I achieve what I want