How can a Bayesian network be used to perform attack analysis of web traffic? I read an interesting article on its application. "Web Application Defense with Bayesian Attack Analysis" but I was not clear about its methodology.
2 Answers
Bayesian networks are a form of probabilistic model, whereby a set of conditions can be used to predict whether an assertion is true or false.
For example, let's suppose there are two conditions we are using to predict whether a disk is dying.
- Disk writes are slow.
- Traffic is abnormally high.
We might formulate the probabilities like this:
Slow Writes | High Traffic | Disk failing?
--------------+--------------+-- T --|-- F ---
F | F | 0.05 | 0.95
F | T | 0.01 | 0.99
T | F | 0.90 | 0.10
T | T | 0.45 | 0.55
This can be interpreted as follows:
- If writes are not slow, and traffic is not high, there is a 0.95 probability that the disk is not failing.
- If writes are not slow, and traffic is not high, there is a 0.99 probability that the disk is not failing.
- If writes are slow, and traffic is not high, there is a 0.90 probability that the disk is failing.
- If writes are slow, and traffic is high, there is a 0.55 probability that the disk is not failing.
We can take this type of model and apply self-learning to it. We take a large set of data where the outcomes are known, and use this to build a probability model.
For example, in a network we might have tests such as:
- >10Mbps TCP traffic coming from internet?
- >10Mbps UDP traffic coming from internet?
- >5 failed logins to RDP in last 10 minutes?
- >5 failed logins to SSH in last 10 minutes?
- Number of TCP connections to SQL server is >5?
- Number of TCP connections to HTTP server is >500?
- Firewall has logged >100 events in the last minute?
- Time is currently within office (9-5) hours?
- etc.
We run these tests over a set of known traffic on the network, and inform the model of times where a breach has or has not been attempted. It can then check which tests were most likely to correlate with a particular type of target event, and build a probability model like we did above.
When we detect further breaches, we tell the model "this was a breach", and it can attempt to improve its model. We can also tell it when it falsely alerted us to a breach.
These models can become extremely complex when dealing with large numbers of question and huge datasets, especially when the model contains test questions that are fed from sub-models, or other forms of analysis. As such, they can provide an excellent pattern-matching approach to intrusion detection.
It is classifying the requests / packets / messages into good / bad ones and this classification is based on a database, which is created during training process, so you need to classify the initial batch yourself, and then it goes on self-learning.