I have a PySpark dataframe which looks like:
+------+-------------------+
|port | timestamp |
+------+-------------------+
|9200 |2020-06-19 02:12:41|
|9200 |2020-06-19 03:54:23|
|51 |2020-06-19 05:32:11|
|22 |2020-06-20 06:07:43|
|22 |2020-06-20 01:11:12|
|51 |2020-06-20 07:38:49|
+------+-------------------+
I'm trying to find the number of times a distinct port is used per day
For example, the resulting dataframe should look like this:
+------------+----------------+
|window | ports |
+------------+----------------+
|2020-06-19 |{9200: 2, 51: 1}|
|2020-06-20 |{22: 2, 51:1 } |
+------------+----------------+
It definitely does not need to be stored in a dictionary, I'm just not sure how it should look to capture all ports per day.
I've currently tried the following:
df.groupBy(window(df['timestamp'], "1 day")).agg(count('port'))
which results in:
+------------+----------------+
|window | count(port) |
+------------+----------------+
|2020-06-19 |3 |
|2020-06-20 |3 |
+------------+----------------+
This is not what I'm looking for as it only counts the number of ports per day, and does not split by the distinct ports.
Pysparkorpandas?