Is Scraping Data Wrong?

In 1492, the Old and New Worlds clashed in a very dramatic way. Volumes have been written about what followed, so I won’t cover this here. A similar, if less vital, clash is happening today. You’ve heard about California’s “water wars,” but you may not have heard of the scraper wars.

For better or worse (and I’d say it’s for the better), the world wide web is an open technology. This is partly true in the sense that Apache’s code can be downloaded and modified, but the real sense this is true, is the one in which a client makes a request to the server, and the server sends down a response. It’s true that login prompts, captcha forms, and payment walls abound, but this is far from the default. It’s difficult (or at least cumbersome) to program a web server to refuse to serve a client. Prejudice is not the nature of the web.

People with something to say, get on their soap box and say it. This becomes available to the world by default. It’s assumed that this will be consumed by a target market (some restricted demographic of humans using web browsers), but all computers have the ability to make HTTP calls, and do as they will with the replies they find. And therein lies the problem.

You may have found this site through Google’s web search – if not, surely you’ve found another this way. Google uses a “robot”, a wanderer, a spider that crawls the web, copying down what it finds, indexing it for search. Many people object that this violates copyright law, although both the searcher and the found benefit from the arrangement, so nobody complains (apart from book authors) about Google. When others do the same thing, though, people assume nefarious trickery. At best, it might be spam, but, this typically seems to mean plagiarism where a person steal’s someone else’s informational work to profit from cheap advertisements.

Mashups, though, are the definition of “web 2.0”. Ouseful discovered how to use Google Spreadsheets to translate “foreign” HTML to RSS, and then Yahoo for geocoding. Arachnode is an open source (for SQL Server and C#!) home scraping platform. Whether people like it or not, the ability to make use of this great federated data store we call the web is being brought down to the lowly masses. Democracy over data is the future, and it’s in everybody’s best interest to learn to deal with it. (Like the Census.)

Leave a comment

  1. Attractive element of content. I simply stumbled upon your site and in
    accession capital to say that I acquire actually enjoyed account your weblog
    posts. Anyway I’ll be subscribing on your feeds or even I achievement you get right
    of entry to persistently fast.

  2. Just want to say your article is as amazing.
    The clarity in your post is just cool and i could assume you’re
    an expert on this subject. Well with your permission let me to grab your RSS feed to keep updated with forthcoming post.
    Thanks a million and please keep up the enjoyable work.

  3. Thank you a bunch for sharing this with all people you actually recognise
    what you are talking about! Bookmarked. Please additionally discuss with my site =).
    We may have a hyperlink exchange arrangement between us

  4. 28 oz Delmonte Chunky Diced Zesty Chili Style tomatoes or
    equivalent. Power outlets make sure that you are getting problem free experience even with your old laptops because
    they may turn dark without power outlets. It was
    well known that oats were grown, stored, milled and packaged with wheat products are were typically
    very contaminated with gluten.