bot detection via browser fingerprinting

Question

I've recently noticed that a few companies have begun to offer bot and scraping protection services based on the idea of browser fingerprinting to detect them, and then blocking the specific fingerprint from accessing the site (rather than blocking the IP).

Here are a few examples:

There are differences between them, but apparently all of those companies use Javascript to get detailed browser specific fields like plugins, fonts and screen size, and resolution, combine them with what can be obtained from the HTTP headers and use this data to classify the client as bot/human.

My question then is: Is this approach robust enough? How hard would it be for an attacker to spoof all of the data fields that the Javascript client sniffs (plugins, fonts, OS, etc.)? What measure of protection does this approach provide - only against not-very-sophisticated bots, or is it really that hard to overcome?

Instead of looking for malicious UAs, baseline all the UAs in your environment and then look for deviation. But this process is not robust as well. Signatures are not robust. Nothing stops the malicious users from changing the parameters. A case in point is meterpreter HTTP(S) where you can configure any UA you want. — void_in
– void_in, Commented Oct 29, 2014 at 9:49
@void_in - if by UA you mean user agent, then browser fingerprinting goes far beyond the user agent string. — WeaselFox
– WeaselFox, Commented Oct 29, 2014 at 14:20

score 7 · Accepted Answer · 2014-11-28 16:18:57Z

7

I've seen similar services which work as a proxy and encode all your webpages in some really obfuscated Javascript, so that a real browser would have no problems browsing that site while it would be really hard, if not impossible (what if the JS was random and different with each request) for a conventional scraper to do the same.

The problem is that it's really easy to defeat all these approaches just by running a real browser and not wasting your time creating a scraper.

Take a look at Selenium WebDriver, which allows you to attach to a real browser and control it programmatically - none of these solutions will detect it since it appears as a clean Firefox (or Chrome, or any of the supported browsers) installation to the outside world.

Rather than wasting your time trying to block the bots, ask yourself why do you want to block them - if they're overloading your web server, implement IP-based rate limits, if they're spamming implement some captchas, otherwise let them be, they aren't doing any harm to you.

edited Nov 28, 2014 at 16:18

answered Nov 28, 2014 at 16:11

user42178

2

Some competitors scrap the data which you spend thousands of hours curating and they essentially get it for free from you by scrapping your website.

Aftab Naveed
– Aftab Naveed

2019-05-08 22:15:30 +00:00
Commented May 8, 2019 at 22:15
@AftabNaveed you should make the data available in github, then it will be easier to access.

Rainb
– Rainb

2021-01-13 14:26:20 +00:00
Commented Jan 13, 2021 at 14:26

Add a comment |

Cristian - ScrapeSentry.com · Accepted Answer · 2015-02-24 10:32:18Z

This procedure is probably helpful in identifying and blocking a large number of bots, but people that want to steal your data, will customize and randomize as much as possible in order to avoid detection. Then No. This approach isn't the most effective against the more sophisticated scrapers.

I've seen scrapers changing entirely their HTTP requests several times per day. These companies are investing money to conduct their activity, and they will try to find a way to avoid these static detections.

The only way you can block this traffic is by adding blocking rules manually, or by developing a big algorithm that elaborate other behaviours, such as: time differences between requests, parameter orders, shared session ids, etc..

schroeder · Accepted Answer · 2014-10-29 14:51:51Z

Reading the marketing copy from the links, the type of 'bot' you're talking about is not a typical 'browser' at all, but often just a simple script or even the venerable wget.

If that is the case, then it is trivial to determine if a script is navigating or if a full-fledged browser is. But, as you suspect, if someone is interested in defeating these bot-blockers, it is also trivial to supply fake data to the server to appear as if a valid browser.

For instance, I have created a Python-based web scraper that supplies a pre-configured UA to the server (announcing itself as a script, in my case). As for the other data (installed fonts), although I have not done it myself, I am confident that if a browser can be configured to respond with the data, then a 'bot' can as well.

Raviraj Hegde · Accepted Answer · 2015-04-13 12:12:08Z

3

As everyone has already answered, it's not possible to detect bots via browser fingerprinting alone.

ShieldSquare, being bot detection company we spend most of the time with bots, I would say detection of bots is possible, along with JS device fingerprint few more things would be considered:

User Behavior [You can analyse what the user is doing on the website, whether the user is doing breadth-first pattern or depth-first pattern. How many minutes user is spending on the website, how many pages did user visit]
IP reputation [By looking at IP history, No of visits from IP or it has patterns and also Network forensics can be done on the received request and identifies if the request is coming from Tor / Proxy IPs.]
Browser Validation

In fact, All these calculations can be done with in 7ms.

edited Apr 13, 2015 at 12:12

answered Apr 9, 2015 at 11:58

Raviraj Hegde

313 bronze badges

5

I respect your answer, but you said all these calculations can be done in 7ms, I don't love someone lying just to seduce people of his technology, I have been creating such defending system as well as bots, detecting a fingerprints take at least 200-700 ms and what you have defined above can be done under 3-5 seconds, as for an advanced bot you can't and will never be able to stop it because behaviors can be studied and created to be as close as possible to a real or normal visitor. So protecting can be useful against beginners to medium bots, but the advanced ones no as I have tested it.

Jeffery ThaGintoki
– Jeffery ThaGintoki

2017-03-09 15:46:59 +00:00
Commented Mar 9, 2017 at 15:46
1

"All these calculations can be done with in 7ms." citation needed. I'm particularly interested in the idea that user behaviour can be analysed within 7ms... That seems ... unlikely.

schroeder
– schroeder ♦

2019-12-24 11:24:07 +00:00
Commented Dec 24, 2019 at 11:24

Add a comment |

AccountantM · Accepted Answer · 2021-01-18 01:45:50Z

1

I have developed bots for 2 well known web-based online games

They fight against bots at

the server side by analyzing the client behavior (the requests that I send)
the client side by putting some traps like :

when the user click login they submit some data about the client like it's screen
when the mouse hover over a button they change some input value from the trapped one to the correct one
they hide a button with trap value, and show a button with correct value(if the bot scraped the hidden button and submitted it, it's a bot)
and many other tricks

At the end, all these client side tricks are just data sent to the server, if I sent them correctly the bot will not be detected.

Theoretically, If I can send every request from my bot exactly as a human sends from his browser, the server can not NEVER detect me.

It is just a matter of the time and effort I will spend in studying the client side code and inspecting every request it is sending to the server.

My advise is put traps as much as you can, and change them at regular bases, so the bot developer gets tired of updating his bot.

answered Jan 18, 2021 at 1:45

AccountantM

3061 silver badge7 bronze badges

1

Changing the traps only works against bots that connect directly to the network. Some of the more clever ones will fire up an actual browser on an actual operating system on some headless virtual machine, and control the virtual inputs like keyboard and mouse. In those cases, you can't detect the bot by looking for at the network connections and are forced to resort to the behavior of their use of the keyboard and mouse.

forest
– forest

2021-01-18 01:48:37 +00:00
Commented Jan 18, 2021 at 1:48
1

@forest yes you are right, but In my experience the bots I developed had to work outside a real browser because they control about 500 accounts in the game, I can't run 500 instances of browsers (even a headless ones)

AccountantM
– AccountantM

2021-01-18 01:52:41 +00:00
Commented Jan 18, 2021 at 1:52
2

You may not be able to run 500 browsers, but a server could certainly run 500 browsers with a fast hypervisor. But these bots tend to be the outliers as most bots are rather dumb.

forest
– forest

2021-01-18 01:55:10 +00:00
Commented Jan 18, 2021 at 1:55
1

@forest Again, in my case, the performance of the bot(more accounts with less resources) was my first priority, so every browser-based solutions can't work with me. Yes it's easier to develop like window.location = "/login" instead of a curl request(with faking every header), but these browser-based bots is not a solution for the performance oriented bots like the ones used to control as much game accounts as possible. On that server a fat bot like the browser one will run 500 accounts, but a python bot for example will run 50000 accounts :)

AccountantM
– AccountantM

2021-01-18 02:44:01 +00:00
Commented Jan 18, 2021 at 2:44
Hi @AccountantM, our team is finding solutions against scraping bots, they're fetching our contents without paying for it! Have you studied or done experiments with Imperva WAF, Cloudflare bot management, etc

Ricky
– Ricky

2023-10-17 04:08:16 +00:00
Commented Oct 17, 2023 at 4:08

Add a comment |

Overclocked Skid · Accepted Answer · 2025-02-06 23:22:25Z

This approach is good enough for detecting automation and block most of the script kiddies trying to get an easy web scraper. Although, the better alternative can be done with fingerprintjs and botd, both tools can detect Playwright and Selenium, even with stealth plugins, notably Cloudflare implements something similar.

How hard would it be for an attacker to spoof all of the data fields that the Javascript client sniffs?

Using selenium or playwright stealth plugins, even replacing the prototypes and properties to spoof a legit browser would be useless, because fingerprintjs and botd use some prototype magic to detect tampered properties. However, using DrissionPage would be enough to bypass this detection, as it doesn't use the webdriver API. So it would be very hard to spoof everything correctly, requiring a custom browser built for this purpose, this would stop automated pre made tools like Scrapy, but not custom built code.

Still this is all dependent on your threat model, if you are working with more advanced developers that are good with scraping, they will probably use DrissionPage or some other webdriver-less framework in order to emulate a real browser, and fingerprintjs or botd won't be able to catch them because they are designed to not expose any API different than a regular browser.

http://drissionpage.cn/

https://github.com/g1879/DrissionPage

An alternative solution is using a captcha, but it can also be bypassed by services like 2captcha and AI easily, so it still doesn't fit the advanced developer threat model.

The preferable approach would be to use some challenge that would put the browser to mine some kind of hash and waste a few CPU cycles, this could be implemented with webassembly (and JS as fallback), which would make scraping considerably harder, as it doesn't "cache" the results in cookie, unlike captchas (solving one captcha might give you multiple requests). Also, CPU-bound anti bot doesn't annoy users as much as captcha, specially for VPNs and Tor, while still granting somewhat of a bot prevention system. It should also be noted that a lot of developers forget to "consume" the captcha in the Google's API, making one resolved captcha valid for billions of requests.

Even though it is possible to hinder scrapers, both captcha and CPU bound challenges are only a measure to delay or annoy scraping, and they are as good as their implementations, the CPU challenge needs to be dynamically generated by a backend and correctly verified, also should be reliant on a good enough RNG (not pseudo random predictable one), the same thing is true about captchas, they need to be implemented correctly, with the correct sitekey and sent to Google servers for consumption (so it invalidates the captcha after use).

It's also possible to block any IP coming from a datacenter, VPN and/ or Tor, however I don't see this as an ideal solution, many people use VPNs and corporate networks might pass through a datacenter or proxy of sorts, also an actor could use a residential proxy with IP rotation to scrape. The better way would be to implement rate limit by IP, making an IP not able to flood requests. This could also be set up by endpoint basis, but it is dependent on the threat model, as implementing the solution this way opens up for an endpoint DDoS attack.

So, in short, there is no bullet proof solution to block all scrapers, but there are methods to hinder it enough that less skilled scrapers will give up early.

Stack Exchange Network

bot detection via browser fingerprinting

6 Answers 6

You must log in to answer this question.

Linked

Hot Network Questions

bot detection via browser fingerprinting

6 Answers 6

You must log in to answer this question.

Linked

Related

Hot Network Questions