Bot detection ideas [php]

Question

I am looking for bot detection ideas. I don't mind bots crawling my site, that's cool, the problem is when they start posting stuff. I recently had to erase 400 accounts because of a bad boy like that. I need to disable bots to post any data to my database, that being account registrations, comments, user messages etc. So what are my options?

Captchas aren't really an option, we don't live in 1990 anymore. Another method I found that seemed interesting was a db of known bots, as well as searching the user agent string for bot spider crawler etc, which seemed pretty promising, but then there's that hardcoding names part which I don't like for obvious reasons and I guess that wouldn't work as well.

I'm looking for something that doesn't need to be updated and will pretty much work until the end of time. My idea was to create a check-for-bot method which would be called on all posts (POST requests that insert into a table) and record the time of the post and the IP of the user/bot and whenever a record exists its time of creation will be compared to current time and stuff like that, but that would fail too because there are networks that use the same IP and may cause an issue there, especially with wifi, you know how it is nowadays.

Does anyone have any idea, possibly similar to mine, or even better? I really need to stop those bots from keeping posting stuff to my site but I need it to be a neat way too.

Thank you for your time reading this, even if you cannot help me!

AbsoluteƵERØ · Accepted Answer · 2013-06-07 18:55:48Z

I use a Bayesian scoring approach for a lot of my corporate sites. Some of the measures are a bit extreme, but I've found over time that most of the bot writers cut corners. That's not to say they haven't spent some time trying to figure out what's going on with my code.

I keep score per user, and a score per message. Also on some sites I'll keep score for an IP range by looking up the IP's registrant (via ARIN or something like that).

To begin with I set a timestamp. I pass it forward clear and hashed with a salt that I know part of myself (private key) and part of it is generated on the fly (temporary). This way it changes with every post.

On the next page I compare the timestamp with the hashes. If they're within a reasonable range of human interaction, then they're allowed to pass. If they're not, then they're obviously going to get negative points.

On registration I look for things like:

phone numbers that don't contain numbers
names that contain strings of numbers
email addresses for domains that aren't registered (or have no MX record in DNS)
I check their IP against their supplied country. (this one can be a bit tricky, so don't score it too high in the event they travel)
Websites that aren't registered - fake websites or websites like google, yahoo, or facebook (without a directory)
If they setup their profile so they have an offsite link to someplace I don't want or that is off topic they get a negative score
Additionally I block one-time use e-mail accounts and several of the free accounts where they don't have a clear policy about abuse.

In posts I and in their profiles:

I do a dictionary attack on posted content to check for variations of words. You have to be careful because a pattern in regex like !cialis! will match specialist.
If the posts are of a great length, I'll measure the amount of time the person has been on the site (logged in or not with a cookie) the length of the post and the average typing speed per industry. (Tech people type faster.)
I don't trust posters who are new to the site

Overall

I maintain a whitelist and a blacklist - The whitelist looks for exploited accounts as well.
I check referrers on handlers for forms. If they're not using the site (with the preceding page's hash) then their comment is flushed.
I check if their SESSION_ID is in the sessions table or not. If it's made up, then they're flushed.
I check user agents that are empty or things like cURL or Java.
I also check their contents with a ratio of vowels to consonants. If they don't fall within a normal range then they're flagged.
I check for words of enormous length without spaces. (This depends on the accepted languages)
And obviously things like semicolons followed by an exploit. (eg. ; UNION)
After all of that I do things like setup honeypots to non-visually-linked directories and scripts, so only someone looking at the code or mining for forms would find them. If someone from an address posts a link to one of the hidden forms they go into a database (replicated to my other sites) that contains known bad offenders.

Obviously blocking IPs might get a few false positives (for instance about 10 years ago I positively penetrated a competitor's site at one of the firms I was looking at with a <marquee> tag non-closed that made the content on their page walk off of the screen. They blocked all traffic from the firm's dynamic IP luckily we were using ADSL for that line and smart enough not to use our fixed ISDN line). We were using a secondary marketing approach to review their contents when I noticed an exploit on a hunch. (I told their programmer how to fix it.)

If you're trying to deal with a botnet that's a whole other animal, but most of the practices I've outlined above will work as well.

Check for proxies if you're only receiving malicious additions from proxies
Check for random IPs around the world within a reasonable amount of time (like someone has shared a login with a bot-net).
Try to use unique identifiers in cookies and urls to track people's movements throughout the site.

Normal posters will have a definite, noticeable pattern to their habits. Non-normal posters do as well. You just have to be diligent and spend a little time tracking their habits in databases.

For me, I always notice the patterns when I'm looking at a MySQL log of sessions, IPs, and traffic. I can usually catch the botnet setup session, then expect the worst. Most of the time it's script kiddies.

If you're worried about brute force, run something like iptables or apf. It will also stop some of the faster bots that try to post multiple pages at the same time, or the ones that post over and over to the same form without visiting any other pages.

Thank you so much for this exhaustive answer! Here's what I've decided to do. I will create a new table which will keep track of all post requests from all users ( on my site unregistered users can post stuff too ). Then upon posting I'll pull requests from that user (IP) within the last 3 seconds, if they're greater than 1 I'll send a warning, if greater than 3 - ban. I will also search for bot,crawler etc or empty user_agent and will ignore that post. I will also apply the invisible form field which most bots will fill and humans wont, what do you think about that? Thanks again! — user26861
– user26861, Commented Jun 8, 2013 at 10:28

Stack Exchange Network

Bot detection ideas [php]

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Bot detection ideas [php]

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions