Major News Sites Block Internet Archive's Wayback Machine to Deter AI Scrapers, Threatening Digital History Preservation
By
Anisha Sircar
Crisp on the outside, thoughtful on the inside. A keeper.
Summary
Major news outlets including CNN, The New York Times, and Reuters are blocking the Internet Archive's Wayback Machine from archiving their content, according to an analysis by Originality AI. This move is primarily aimed at preventing AI companies from scraping their content for training large language models, but it also threatens to create gaps in the digital historical record. The Wayback Machine, which has archived over one trillion web pages since its founding nearly 30 years ago, is being blocked via robots.txt files — the same protocol originally designed to prevent server overload, not to restrict access to digital history. The article explores the tension between publishers' legitimate concerns about AI scraping and the unintended consequences of blocking a vital preservation tool that serves journalists, researchers, courts, and the public.
Key quotes
· 3 pulledThe Internet Archive's Wayback Machine has served as a go-to for anyone looking to access its vast treasure trove of archived internet pages.
Its mission of crawling and preserving the public web has made it an indispensable resource for journalists, historians, researchers, courts and beyond.
23 major news sites currently block the Internet Archive's Wayback Machine from crawling and archiving their content.
You might also wanna read
Over 340 local news outlets block Internet Archive's Wayback Machine over AI scraping concerns
Major newspaper chains including McClatchy, Advance Local, and Tribune Publishing have joined The New York Times, The Guardian, and USA Toda
News Publishers Restrict Internet Archive Access Over AI Data Scraping Concerns
News publishers including The Guardian and The New York Times are restricting access to their content in the Internet Archive's Wayback Mach
Publishers Blocking Internet Archive Threaten Web History Preservation
The article discusses how major publishers like The New York Times are blocking the Internet Archive's Wayback Machine from archiving their
The Wayback Machine: Preserving Digital History by the Internet Archive
The Wayback Machine, an initiative by the non-profit Internet Archive, serves as a digital library preserving Internet sites and cultural ar
Internet Archive Reaches 1 Trillion Web Pages Preserved in Wayback Machine
The Internet Archive is celebrating a major milestone of preserving 1 trillion web pages through its Wayback Machine. Since 1996, the organi
blog.archive.org·8mo ago
CNN sues AI startup Perplexity for allegedly scraping and reproducing articles verbatim
CNN has filed a lawsuit against AI startup Perplexity, alleging that its AI tools generate "verbatim" copies of CNN articles and bypass payw
