Manipulate a complex dataframe in PySpark

Question

i'm preparing a dataset to train machine learning model using PySpark. The dataframe on which i'm working contains a thousands of records about presences registered inside rooms of different buildings and cities in different days. These presences are in this format:

+----+----+----+---+---------+------+--------+-------+---------+
|room|building|city|day|month|inHour|inMinute|outHour|outMinute|
+----+--------+----+---+-----+------+--------+-------+---------+
|   1|       1|   1|  9|   11|     8|      27|     13|       15|
|   1|       1|   1|  9|   11|     8|      28|     13|        5|
|   1|       1|   1|  9|   11|     8|      32|     13|        7|
|   1|       1|   1|  9|   11|     8|      32|      8|       50|
|   1|       1|   1|  9|   11|     8|      32|      8|       48|
+----+--------+----+---+-----+------+--------+-------+---------+

inHour and inMinute stands for the hour and minute of access and, of course, outHour and outMinute refers to time of exit. The hours are considered in a 0-23 format. All the column contains just integer values.

What i'm missing is the target value of my machine learning model which is the number of persons for the combination of room, building, city, day, month and a time interval. I will try to explain better, the first row refers to a presence with access time 8 and exit time 13 so it should be counted in the record with the interval 8-9, 9-10, 10-11, 11-12 and also 13-14. What i want to accomplish is something like the following:

+----+----+----+---+---------+------+-------+-----+
|room|building|city|day|month|timeIn|timeOut|count|
+----+--------+----+---+-----+------+-------+-----+
|   1|       1|   1|  9|   11|     8|      9|    X|   
|   1|       1|   1|  9|   11|     9|     10|    X|  
|   1|       1|   1|  9|   11|    10|     11|    X|   
|   1|       1|   1|  9|   11|    11|     12|    X|   
|   1|       1|   1|  9|   11|    12|     13|    X|     
+----+--------+----+---+-----+------+-------+-----+

So the 4th row of the first table should be counted in the 1st row of this table and so on...

mck · Accepted Answer · 2021-01-09 17:58:05Z

You can explode a sequence of hours (e.g. the first row would have [8,9,10,11,12,13]), group by the hour (and other columns) and get the aggregate count for each group. Here hour refers to timeIn. I think it's not necessary to specify timeOut in the result dataframe because it's always timeIn + 1.

import pyspark.sql.functions as F

df2 = df.withColumn(
    'hour',
    F.explode(F.sequence('inHour', 'outHour'))
).groupBy(
    'room', 'building', 'city', 'day', 'month', 'hour'
).count().orderBy('hour')

df2.show()
+----+--------+----+---+-----+----+-----+
|room|building|city|day|month|hour|count|
+----+--------+----+---+-----+----+-----+
|   1|       1|   1|  9|   11|   8|    5|
|   1|       1|   1|  9|   11|   9|    3|
|   1|       1|   1|  9|   11|  10|    3|
|   1|       1|   1|  9|   11|  11|    3|
|   1|       1|   1|  9|   11|  12|    3|
|   1|       1|   1|  9|   11|  13|    3|
+----+--------+----+---+-----+----+-----+

Collectives™ on Stack Overflow

Manipulate a complex dataframe in PySpark

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related