All Topics

Technology

Art

Major News Sites Block Internet Archive's Wayback Machine to Deter AI Scrapers, Threatening Digital History Preservation

Anisha Sircar

8d ago· 6 min readenNews

100/100

Golden Brown

Bagelometer↗

Crisp on the outside, thoughtful on the inside. A keeper.

Score100TypenewsSentimentnegative

Summary

Major news outlets including CNN, The New York Times, and Reuters are blocking the Internet Archive's Wayback Machine from archiving their content, according to an analysis by Originality AI. This move is primarily aimed at preventing AI companies from scraping their content for training large language models, but it also threatens to create gaps in the digital historical record. The Wayback Machine, which has archived over one trillion web pages since its founding nearly 30 years ago, is being blocked via robots.txt files — the same protocol originally designed to prevent server overload, not to restrict access to digital history. The article explores the tension between publishers' legitimate concerns about AI scraping and the unintended consequences of blocking a vital preservation tool that serves journalists, researchers, courts, and the public.

Key quotes

· 3 pulled

The Internet Archive's Wayback Machine has served as a go-to for anyone looking to access its vast treasure trove of archived internet pages.

Its mission of crawling and preserving the public web has made it an indispensable resource for journalists, historians, researchers, courts and beyond.

23 major news sites currently block the Internet Archive's Wayback Machine from crawling and archiving their content.

Snippet from the RSS feed

Major news outlets are blocking the Wayback Machine to fight AI scrapers — and taking three decades of digital history with them.

You might also wanna read

Over 340 local news outlets block Internet Archive's Wayback Machine over AI scraping concerns

Major newspaper chains including McClatchy, Advance Local, and Tribune Publishing have joined The New York Times, The Guardian, and USA Toda

niemanlab.org·20d ago

News Publishers Restrict Internet Archive Access Over AI Data Scraping Concerns

News publishers including The Guardian and The New York Times are restricting access to their content in the Internet Archive's Wayback Mach

niemanlab.org·3mo ago

Publishers Blocking Internet Archive Threaten Web History Preservation

The article discusses how major publishers like The New York Times are blocking the Internet Archive's Wayback Machine from archiving their

eff.org·2mo ago

The Wayback Machine: Preserving Digital History by the Internet Archive

The Wayback Machine, an initiative by the non-profit Internet Archive, serves as a digital library preserving Internet sites and cultural ar

web.archive.org·10mo ago

Internet Archive Reaches 1 Trillion Web Pages Preserved in Wayback Machine

The Internet Archive is celebrating a major milestone of preserving 1 trillion web pages through its Wayback Machine. Since 1996, the organi

blog.archive.org·8mo ago

CNN sues AI startup Perplexity for allegedly scraping and reproducing articles verbatim

CNN has filed a lawsuit against AI startup Perplexity, alleging that its AI tools generate "verbatim" copies of CNN articles and bypass payw

The Verge·13d ago