I have a DataFrame about connection log with columns targetIP, Time. Every record in this DataFrame is a connection event to one system. targetIP means the target IP address this time, Time is the connection time. With Values:
| Time | targetIP |
|---|---|
| 1 | 192.163.0.1 |
| 2 | 192.163.0.2 |
| 3 | 192.163.0.1 |
| 5 | 192.163.0.1 |
| 6 | 192.163.0.2 |
| 7 | 192.163.0.2 |
| 8 | 192.163.0.2 |
I want to create a new column under some condition: count of connections to this time's target IP address in the past 3 connections. So the result DataFrame should be:
| Time | targetIP | count |
|---|---|---|
| 1 | 192.163.0.1 | 0 |
| 2 | 192.163.0.2 | 0 |
| 3 | 192.163.0.1 | 1 |
| 5 | 192.163.0.1 | 2 |
| 6 | 192.163.0.2 | 1 |
| 7 | 192.163.0.2 | 1 |
| 8 | 192.163.0.2 | 2 |
For example, Time=8, the targetIP is 192.163.0.2, connected to system in the past 3 connections, which are Time=5 Time=6 and Time=7. Time=6 and Time=7's targetIP are also 192.163.0.2. So the count about Time=8 is 2.
I have an idea that add a new ID column to this DataFrame:
| ID | Time | targetIP |
|---|---|---|
| 1 | 1 | 192.163.0.1 |
| 2 | 2 | 192.163.0.2 |
| 3 | 3 | 192.163.0.1 |
| 4 | 5 | 192.163.0.1 |
| 5 | 6 | 192.163.0.2 |
| 6 | 7 | 192.163.0.2 |
| 7 | 8 | 192.163.0.2 |
and using Window function:
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy("targetIP").orderBy(F.col("ID").cast("int")).rangeBetween(-3,-1)
df1= df.withColumn("count", F.count("*").over(w)).orderBy("ID")
but if I use monotonically_increasing_id(), the ID is not sequential. So I want to get your help without using ID.
Thank you.