Tech

AI is killing the grand bargain at the heart of the web. 'We're in a different world.'

AI screenwriter
AI screenwriter Moor Studio/Getty Images Plus
Read in app

AI is undermining the web's grand bargain, and a decades-old handshake agreement is the only thing standing in the way.

A single bit of code, robots.txt, was proposed in the late 1990's as a way for websites to tell bot crawlers they don't want their data scraped and collected. It was widely accepted as one of the unofficial rules supporting the web.

At the time, the main purpose of these crawlers was to index information so results in search engines would improve. Google, Microsoft's Bing and other search engines have crawlers. They index content so it can be later served up as links to billions of potential consumers. This is the essential deal that created the flourishing web we know today: Creators share abundant information and exchange ideas online freely because they know consumers will visit and either see an ad, subscribe, or buy something.

Now, though, generative AI and large language models are changing the mission of web crawlers radically and rapidly. Instead of working to support content creators, these tools have been turned against them.

The bots feeding Big Tech

Web crawlers now collect online information to feed into giant datasets that are used for free by wealthy tech companies to develop AI models. CCBot feeds Common Crawl, one of the biggest AI datasets. GPTbot feeds data to OpenAI, the company behind ChatGPT and GPT-4, currently the most powerful AI model. Google just calls its LLM training data "Infiniset," without mentioning where the vast majority of the data comes from. Although 12.5% comes from C4, a cleaned up version of Common Crawl.

Related video
What is ChatGPT, and should we be afraid of AI chatbots?

The models use all this free information to learn how to answer user questions immediately. That's a long way from indexing a web site so users can be sent through to the original work.

Without a supply of potential consumers, there's little incentive for content creators to let web crawlers continue to suck up free data online. GPTbot is already being blocked by Amazon, Airbnb, Quora, and hundreds of other websites. Common Crawl's CCBot is beginning to be blocked more, too.

'A crude tool'

What hasn't changed is how to block these crawlers. Implementing robots.txt on a web site, and excluding specific crawlers, is the only option. And it's not very good.

"It's a bit of a crude tool," said Joost de Valk, a former Wordpress executive, tech investor and founder of digital marketing firm Yoast. "It has no basis in law, and is basically maintained by Google, although they say they do that together with other search engines."

It's also open to manipulation, especially given the voracious appetite for quality AI data. The only thing a company like OpenAI has to change is the name of its bot crawler to bypass all the disallow rules people put in place using robots.txt, de Valk explained.

Because robots.txt is voluntary, web crawlers can also simply ignore the blocking instructions and siphon the information from a site anyway. Some crawlers, like that of Brave, a newer search engine, don't bother disclosing the name of their crawler, making it impossible to block.

"Everything online is being sucked up into a vacuum for the models," said Nick Vincent, a computer science professor who studies the relationship between human-generated data and AI. "There's so much going on under the hood. In the next six months, we will look back and want to evaluate these models differently."

AI bot backlash

De Valk warns that owners and creators of online content may already be too late in understanding the risks of allowing these bots to scoop up their data for free and use it indiscriminately to develop AI models.

"Right now, doing nothing means, 'I'm ok with my content being in every AI and LLM in the world,' de Valk said. "That's just plain wrong. A better version of robots.txt could be created, but it'd be very weird if that was done by the search engines and the large AI parties themselves."

Several major companies and websites have responded recently, with some starting to deploy robots.txt for the first time.

As of August 22, 70 of the 1,000 most-popular websites have used robots.txt to block GPTBot since OpenAI revealed the crawler about three weeks ago, according to Originality.ai, a company that checks content to see if it's AI-generated or plagiarized.

The company also found that 62 of the 1,000 most popular websites are blocking Common Crawl's CCBot, with an increasing number doing so only this year as awareness of data crawling for AI has grown.

Still, it is not enforceable. Any crawler could ignore a robots.txt file and collect every last bit of data it found on a webpage, with the owner of the page more than likely having no idea it even happened. Even if robots.txt had any basis in law, its original purpose has little to do with information on the internet being used to create AI models.

"Robots.txt is unlikely to be seen as a legal prohibition on use of data," according to Jason Schultz, director of NYU's Technology Law & Policy Clinic. "It was primarily meant to signal that one did not want one's website to be indexed by search engines, not as a signal that one did not want one's content used for machine learning and AI training."

'This is a minefield'

This activity has been going on for years. OpenAI revealed its first GPT model in 2018, having trained it on BookCorpus, a dataset of thousands of indie or self-published books. Common Crawl started in 2008 and its dataset became publicly available in 2011 through cloud storage provided by AWS.

Although GPTBot is now more widely blocked, Common Crawl is a larger threat to any business that is concerned about its data being used to train another company's AI model. What Google did for internet search, Common Crawl is doing for AI.

"This is a minefield," said Catherine Stihler, CEO of Creative Commons. "We updated our strategy only a few years ago, and now we're in a different world."

Creative Commons started in 2001 as a way for creators and owners to license works for use on the internet through an alternative to strict a copyright framework, known as "copyleft." Creators and owners maintain their rights, while a Commons license let people access the content and create derivative works. Wikipedia operates through a Creative Commons license, as does Flickr, Stack Overflow and ProPublica, along with many other well-known websites.

Under it's new five year strategy, which notes the "problematic use of open content" to train AI technologies, Creative Commons is looking to make the sharing of work online more "equitable," through a "multifrontal, coordinated, broad-based approach that transcends copyright."

The 160 billion-page gorilla

Common Crawl, via CCBot, holds what is perhaps the largest repository of data ever collected from the internet. Since 2011, it has crawled and saved information from 160 billion web pages and counting. Typically it crawls and saves around 3 billion web pages each month.

Its mission statement says the undertaking is an "open data" project aimed at allowing anyone to "indulge their curiosities, analyze the world, and pursue brilliant ideas."

The reality has become very different today. The massive amount of data it holds and continues to collect is being used by some of the world's largest corporations to create mostly proprietary models. If a big tech company isn't already making money off of its AI output (OpenAI has many paid services), there's a plan to do so in the future.

Some big tech companies have stopped disclosing where they get this data. However, Common Crawl has been and continues to be used to develop many powerful AI models. It helped Google create Bard. It helped Meta train Llama. It helped OpenAI build ChatGPT.

Common Crawl also feeds The Pile, which hosts more curated datasets pulled from the work of other bot crawlers. It has been used extensively on AI projects, including Llama and an LLM from Microsoft and Nvidia, called MT-NLG.

Not comical

One of The Pile's most recent downloads from June is a massive collection of comic books, including the entire works of Archie, Batman, X-Men, Star Wars and Superman. Created by DC Comics, now owned by Warner Brothers, and Marvel, now owned by Disney, all of the works remain under copyright. The Pile also hosts a large set of copyrighted books, as The Atlantic recently reported.

"There's a difference between the intent of crawlers and how they are used," said NYU's Schultz. "It is very hard to police or insist that data be used in a particular way."

As far as The Pile is concerned, while it admits its data is full of copyrighted material, it claimed in its founding technical paper that "there is little acknowledgment of the fact that the processing and distribution of data owned by others may also be a violation of copyright law."

Beyond that, the group, part of EleutherAI, argued its use of the material is considered "transformative" under the fair use doctrine, despite the data sets holding relatively unaltered work. It also admitted that it needs to use full-length copyrighted content "in order to produce the best results" when training LLMs.

Such arguments of fair use by crawlers and AI projects are already being put to the test. Authors, visual artists and even source code developers are suing the likes of OpenAI, Microsoft and Meta because their original work has been used without their consent to train something they get no benefit from.

"There's no universe where putting something on the internet grants free, unlimited, commercial use of someone's labor w/o consent," Steven Sinofsky, a former Microsoft executive who's a partner at VC firm Andreessen Horowitz, wrote recently on X.

No resolution in sight

For the moment, there's no clear resolution in sight.

"We are grappling with all of this now," said Stihler, the CEO of Creative Commons. "There are so many issues that keep cropping up: compensation, consent, credit. What does all of that look like with AI? I don't have an answer."

De Valk said Creative Commons, with its method of facilitating broader copyright licenses that allow owned works to be used on the internet, has been suggested as a possible model for consent when it comes to AI model development.

Stihler is not so sure. When it comes to AI, perhaps there is no single solution. Licensing and copyright, even a more flexible Commons-style agreement, likely won't work. How do you license the whole of the internet?

"Every lawyer that I speak to says a license is not going to solve the problem," Stihler said.

She's is talking about this regularly to stakeholders, from authors to executives of AI companies. Stihler met with representatives of OpenAI earlier this year and said the company is discussing how to "reward creators."

Still, it's unclear "what the commons really looks like in the age of AI," she added.

'If we're not careful, we'll end up closing the commons'

Considering just how much data web crawlers have already scraped and handed over to big tech companies, and how little power is in the hands of the creators of that content, the internet as we know it could change dramatically.

If posting information online means giving data for free to an AI model that will compete with you for users, then this activity may simply stop.

There are already signs of this: Fewer human software coders are visiting Q&A web site Stack Overflow to answer questions. Why? Because their previous work was used to train AI models that now answer many of these questions automatically.

Stihler said the future of all created work online could soon look like the current state of streaming, with content locked behind "Plus" subscription fiefdoms that get ever more costly.

"If we're not careful, we'll end up closing the commons," Stihler said. "There'll be more walled gardens, more things people can't access. That is not a successful model for humanity's future of knowledge and creativity."

Read next

Alistair Barr, global tech editor of Business Insider, smiles at the camera while wearing a blue and white striped shirt.
Alistair Barr
Alistair Barr is the author of Business Insider's Tech Memo newsletter. Sign up here. Before that, he was BI's Global Tech Editor and the Big Tech team leader at Bloomberg, following a reporting career at The Wall Street Journal, USA Today, Reuters, and MarketWatch. Alistair won a Gerald Loeb Award in 2007 for coverage of short selling and was a finalist in 2013 for scoops on the Facebook IPO. More recently, he won a 2024 San Francisco Press Club award for commentary. Got a tip? Reach out using the secure messaging app Signal (+1 415-341-4927) or via email on abarr@businessinsider.com.ExpertiseAlistair oversees all things Big Tech, along with startups and venture capital. He writes analysis and columns about topics including generative AI, large language models, cloud computing, semiconductors, online search, e-commerce, EVs, robotics, and autonomous vehicles.Popular StoriesArtificial Intelligence:It's getting harder to make big leaps at the frontier of AIOpenAI's AI-adjusted earnings numbers have echoes of Groupon and WeWorkDeath by LLM: Stack Overflow's decline, and its plan to survive, shows the future of free online data in an AI worldCloud computing:Amazon dominated the first cloud era. The AI boom has kicked off Cloud 2.0, and the company doesn't have a head start this time.In cloud, there's AI (which is hot) and everything else (which is not)Chips:Why Intel is still so important: Real countries have fabsApple's made-in-the-USA chips signal a turnaround for the US's big semiconductor betEVs and Tesla:Tesla's AI supercomputer has a Silicon Valley town rushing to meet surging electricity demandTesla's Cybertruck is outselling almost every other EV in the USOnline Search:Google is losing its status as a verbA simple way to fix search: Bright pink ads
Kali Hays was a Tech Correspondent at Business Insider covering the major social media platforms like Meta, Twitter, and Snap. Her reporting covered major changes and the internal culture at these companies, the founders and executives who run them, and business developments and products. Hays also wrote frequently about AI and emerging trends and shifts in the tech industry overall. Her work has been widely cited, including by the FTC in an investigation into Elon Musk’s takeover of Twitter, and she has appeared as an expert on NBC, CBS, the BBC and elsewhere. She lives in the San Francisco Bay Area and can be contacted directly with information by phone or text at +1-949-280-0267. Reach out using secure messaging app Signal or with a non-work device. Find her on Twitter at @hayskali or on Threads @kalihays1.Her exclusive reporting and scoops include:Meta's Facebook Messenger hit with layoffs amid ongoing 'efficiency' pushLayoff angst looms over Meta employees as they face tough performance reviews and ongoing reorgsMeta aiming to reveal and demo Orion, its first true AR glasses, at its fall developer conferenceMeta's Responsible AI team shrinks amid layoffs and restructuring, even as the company goes all-in on AIMeta updates RTO policy with stricter mandate, saying workers may lose their jobs if they don't show up 3 days a weekLeaked documents from Mark Zuckerberg and Priscilla Chan's charity include a tacit admission that their biggest bet on education reform was a flop'He is in war time': Mark Zuckerberg's desperate, last-ditch attempt to remake himself — and MetaOpenAI is expected to release a 'materially better' GPT-5 for its chatbot mid-year, sources sayOpenAI's employees were given 2 explanations for why Sam Altman was fired. They're unconvinced and furious.AI is killing the grand bargain at the heart of the web. 'We're in a different world.'Jack Dorsey warns Block employees of coming job cuts: 'The growth of our company has far outpaced the growth of our business.'Elon Musk is considering taking X out of Europe amid EU compliance investigationLeak: Elon Musk said he wants X to be a dating app, too, in an all-hands meeting on the anniversary of his Twitter takeoverLinda Yaccarino, Elon Musk, and the most difficult CEO job on earthElon Musk's Twitter races to build a live video service as it woos right-wing media personalitiesElon Musk is moving forward with a new generative-AI project at Twitter after purchasing thousands of GPUsSnap begins a new round of layoffs with staffers expecting more next weekEvan Spiegel proclaims 'social media is dead' in leaked memo, predicts Snap is about to 'transcend' the smartphoneSnap workers say they're being closely 'tracked' to enforce compliance with the RTO mandateHow Snap misread big threats from TikTok and Apple and lost its chance at becoming an advertising giant