All Topics

Technology

Art

News websites blocking Wayback Machine to prevent AI scraping threatens web archiving

Darren Allan

8d ago· 5 min readenNews

97/100

Golden Brown

Bagelometer↗

Sesame, salt, and substance. A flagship bake.

Score97TypenewsSentimentnegative

Summary

The Wayback Machine, run by the non-profit Internet Archive, faces an existential threat as major news websites increasingly block its web crawlers to prevent AI companies from scraping their content for training large language models. This trend, driven by the AI boom and concerns over unauthorized use of content, undermines the Wayback Machine's ability to preserve web history for research and accountability. The article highlights the tension between protecting intellectual property from AI scraping and preserving the public's access to historical web content.

Key quotes

· 3 pulled

The Wayback Machine is under serious threat (and not for the first time), as a growing number of major news websites appear to be blocking the archiving system.

This can be vital when it comes to historical research, for example, or monitoring changes to websites.

There's a growing trend of online news outlets blocking the Wayback Machine to prevent content scraping.

Snippet from the RSS feed

This isn't the first time the Wayback Machine has faced what could be deemed an existential threat.

You might also wanna read

News Publishers Restrict Internet Archive Access Over AI Data Scraping Concerns

News publishers including The Guardian and The New York Times are restricting access to their content in the Internet Archive's Wayback Mach

niemanlab.org·3mo ago

Over 340 local news outlets block Internet Archive's Wayback Machine over AI scraping concerns

Major newspaper chains including McClatchy, Advance Local, and Tribune Publishing have joined The New York Times, The Guardian, and USA Toda

niemanlab.org·20d ago

Publishers Blocking Internet Archive Threaten Web History Preservation

The article discusses how major publishers like The New York Times are blocking the Internet Archive's Wayback Machine from archiving their

eff.org·2mo ago

The Wayback Machine: Preserving Digital History by the Internet Archive

The Wayback Machine, an initiative by the non-profit Internet Archive, serves as a digital library preserving Internet sites and cultural ar

web.archive.org·10mo ago

The Case Against Blocking LLM Crawlers on Websites

The article argues against blocking large-language-model (LLM) crawlers from websites, comparing it to allowing Google to index content. It

johnjianwang.medium.com·10mo ago

Web Infrastructure Companies Fight Back Against Unauthorized AI Data Scraping

The article discusses how major AI companies like OpenAI, Google, Meta, and Anthropic have been scraping web content without permission for

nymag.com·8mo ago