0

I have a DF as below:

Name    city       starttime               endtime
user1   London      2019-08-02 03:34:45   2019-08-02 03:52:03
user2   Boston      2019-08-13 13:34:10   2019-08-13 15:02:10

I would like to check the endtime and if it crosses into the next hour then update the current record with the last minute/second of current hour and append another row or rows with similar data as shown below(user2). Do I use flapmap or convert the DF to RDD and use map or is another way?

Name    city     starttime               endtime
user1   London   2019-08-02 03:34:45   2019-08-02 03:52:03
user2   Boston   2019-08-13 13:34:10   2019-08-13 13:59:59
user2   Boston   2019-08-13 14:00:00   2019-08-13 14:59:59
user2   Boston   2019-08-13 15:00:00   2019-08-13 15:02:10

Thanks

1

1 Answer 1

1
 >>> from pyspark.sql.functions  import *
 >>> df.show()
    +-----+------+-------------------+-------------------+
    | Name|  city|          starttime|            endtime|
    +-----+------+-------------------+-------------------+
    |user1|London|2019-08-02 03:34:45|2019-08-02 03:52:03|
    |user2|Boston|2019-08-13 13:34:10|2019-08-13 15:02:10|
    +-----+------+-------------------+-------------------+

>>> df1 = df.withColumn("diff", ((hour(col("endtime")) - hour(col("starttime")))).cast("Int"))
            .withColumn("loop", expr("split(repeat(':', diff),':')"))
            .select(col("*"), posexplode(col("loop")).alias("pos", "value"))
            .drop("value", "loop")

>>> df1.withColumn("starttime", when(col("pos") == 0, col("starttime")).otherwise(from_unixtime(unix_timestamp(col("starttime")) + (col("pos") * 3600) - minute(col("starttime"))*60 - second(col("starttime")))))
       .withColumn("endtime", when((col("diff") - col("pos")) == 0, col("endtime")).otherwise(from_unixtime(unix_timestamp(col("endtime")) - ((col("diff") - col("pos")) * 3600) - minute(col("endtime"))*60 - second(col("endtime")) + lit(59) * lit(60) + lit(59))))
       .drop("diff", "pos")
       .show()
    +-----+------+-------------------+-------------------+
    | Name|  city|          starttime|            endtime|
    +-----+------+-------------------+-------------------+
    |user1|London|2019-08-02 03:34:45|2019-08-02 03:52:03|
    |user2|Boston|2019-08-13 13:34:10|2019-08-13 13:59:59|
    |user2|Boston|2019-08-13 14:00:00|2019-08-13 14:59:59|
    |user2|Boston|2019-08-13 15:00:00|2019-08-13 15:02:10|
    +-----+------+-------------------+-------------------+
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. I am still learning use of expr, split, repeat and posexplode. Will research and try to comprehend how the logic constructed.
ok, I ran the code and it works great. 1) What is point in using expr(split...........). It doesn't seem to be adding any value. 2) In the 1st line, why are you casting to decimal(14,2)? Why not just Integer? Just trying to understand the thinking process.
@NetRocks Yes you can cast it to Int, actually in my first solution i was checking of minute difference which was coming in Decimal after it I had changed logic but forgot to replace cast. I am using expr(split... instead of array_repeat, array_repeat also work in python but not with dynamic count so I had used repeat which give simple string and using split to get list so that we can explode it. You can not explode string. : -)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.