I use a Bayesian scoring approach for a lot of my corporate sites. Some of the measures are a bit extreme, but I've found over time that most of the bot writers cut corners. That's not to say they haven't spent some time trying to figure out what's going on with my code.
I keep score per user, and a score per message. Also on some sites I'll keep score for an IP range by looking up the IP's registrant (via ARIN or something like that).
To begin with I set a timestamp. I pass it forward clear and hashed with a salt that I know part of myself (private key) and part of it is generated on the fly (temporary). This way it changes with every post.
On the next page I compare the timestamp with the hashes. If they're within a reasonable range of human interaction, then they're allowed to pass. If they're not, then they're obviously going to get negative points.
On registration I look for things like:
- phone numbers that don't contain numbers
- names that contain strings of numbers
- email addresses for domains that aren't registered (or have no MX record in DNS)
- I check their IP against their supplied country. (this one can be a bit tricky, so don't score it too high in the event they travel)
- Websites that aren't registered - fake websites or websites like google, yahoo, or facebook (without a directory)
- If they setup their profile so they have an offsite link to someplace I don't want or that is off topic they get a negative score
- Additionally I block one-time use e-mail accounts and several of the free accounts where they don't have a clear policy about abuse.
In posts I and in their profiles:
- I do a dictionary attack on posted content to check for variations of words. You have to be careful because a pattern in regex like
!cialis! will match specialist.
- If the posts are of a great length, I'll measure the amount of time the person has been on the site (logged in or not with a cookie) the length of the post and the average typing speed per industry. (Tech people type faster.)
- I don't trust posters who are new to the site
Overall
- I maintain a whitelist and a blacklist - The whitelist looks for exploited accounts as well.
- I check referrers on handlers for forms. If they're not using the site (with the preceding page's hash) then their comment is flushed.
- I check if their SESSION_ID is in the sessions table or not. If it's made up, then they're flushed.
- I check user agents that are empty or things like cURL or Java.
- I also check their contents with a ratio of vowels to consonants. If they don't fall within a normal range then they're flagged.
- I check for words of enormous length without spaces. (This depends on the accepted languages)
- And obviously things like semicolons followed by an exploit. (eg.
; UNION)
- After all of that I do things like setup honeypots to non-visually-linked directories and scripts, so only someone looking at the code or mining for forms would find them. If someone from an address posts a link to one of the hidden forms they go into a database (replicated to my other sites) that contains known bad offenders.
Obviously blocking IPs might get a few false positives (for instance about 10 years ago I positively penetrated a competitor's site at one of the firms I was looking at with a <marquee> tag non-closed that made the content on their page walk off of the screen. They blocked all traffic from the firm's dynamic IP luckily we were using ADSL for that line and smart enough not to use our fixed ISDN line). We were using a secondary marketing approach to review their contents when I noticed an exploit on a hunch. (I told their programmer how to fix it.)
If you're trying to deal with a botnet that's a whole other animal, but most of the practices I've outlined above will work as well.
- Check for proxies if you're only receiving malicious additions from proxies
- Check for random IPs around the world within a reasonable amount of time (like someone has shared a login with a bot-net).
- Try to use unique identifiers in cookies and urls to track people's movements throughout the site.
Normal posters will have a definite, noticeable pattern to their habits. Non-normal posters do as well. You just have to be diligent and spend a little time tracking their habits in databases.
For me, I always notice the patterns when I'm looking at a MySQL log of sessions, IPs, and traffic. I can usually catch the botnet setup session, then expect the worst. Most of the time it's script kiddies.
If you're worried about brute force, run something like iptables or apf. It will also stop some of the faster bots that try to post multiple pages at the same time, or the ones that post over and over to the same form without visiting any other pages.