News Publishers Restrict Internet Archive Access Over AI Data Scraping Concerns
By
ninjagoo
Hot, fresh, and worth queueing round the block for.
Summary
News publishers including The Guardian and The New York Times are restricting access to their content in the Internet Archive's Wayback Machine due to concerns that AI companies are using the digital archive as a backdoor to scrape copyrighted material for training large language models. The Internet Archive's mission of preserving web content and providing free access conflicts with publishers' copyright protection efforts against AI data scraping.
Key quotes
· 4 pulledAs AI bots scavenge the web for training data to feed their models, the Internet Archive's commitment to free information access has turned its digital library into a potential liability for some news publishers.
Outlets like The Guardian and The New York Times are scrutinizing digital archives as potential backdoors for AI crawlers.
Many of these snapshots are accessible through its public-facing tool, the Wayback Machine.
The Internet Archive operates crawlers that capture webpage snapshots as part of its mission to preserve the web.
You might also wanna read

Open Markets Institute report warns news publishers face 'double bind' in AI content licensing market dominated by Big Tech
A new report from the Open Markets Institute examines the emerging AI content licensing market for news publishers. It argues that news publ
How AI Search Platforms Are Undermining the Web's Information Ecosystem
The article examines how AI-powered search platforms like Google's AI Overviews are extracting and synthesizing content from creator website

New York Proposes Two Bills to Regulate AI in News Media and Data Centers
New York is considering two bills to regulate AI in news media and data centers. The NY FAIR News Act would require disclaimers on AI-genera

Major Publishers Launch Really Simple Licensing Standard for AI Content Scraping
Major web publishers including Reddit, Yahoo, Medium, Quora, and People Inc. have announced support for Really Simple Licensing (RSL), a new
Elsevier joins class action lawsuit against Meta over alleged use of copyrighted content for AI training
Scientific publishing giant Elsevier has joined a class action lawsuit against Meta Platforms, alleging that Meta used Elsevier's copyrighte
Elsevier joins class action lawsuit against Meta over alleged use of copyrighted content for AI training
Scientific publishing giant Elsevier has joined a class action lawsuit against Meta Platforms, alleging that Meta used Elsevier's copyrighte
Google's AI search changes threaten journalism industry, critic warns
Drew Magary argues that Google's shift toward AI-generated search results will devastate the journalism industry by removing the incentive f
