News Publishers Restrict Internet Archive Access Over AI Data Scraping Concerns

ninjagoo

3mo ago· 10 min readenNews

100/100

Golden Brown

Bagelometer↗

Hot, fresh, and worth queueing round the block for.

Score100TypenewsSentimentneutral

Summary

News publishers including The Guardian and The New York Times are restricting access to their content in the Internet Archive's Wayback Machine due to concerns that AI companies are using the digital archive as a backdoor to scrape copyrighted material for training large language models. The Internet Archive's mission of preserving web content and providing free access conflicts with publishers' copyright protection efforts against AI data scraping.

Key quotes

· 4 pulled

As AI bots scavenge the web for training data to feed their models, the Internet Archive's commitment to free information access has turned its digital library into a potential liability for some news publishers.

Outlets like The Guardian and The New York Times are scrutinizing digital archives as potential backdoors for AI crawlers.

Many of these snapshots are accessible through its public-facing tool, the Wayback Machine.

The Internet Archive operates crawlers that capture webpage snapshots as part of its mission to preserve the web.

Snippet from the RSS feed

Outlets like The Guardian and The New York Times are scrutinizing digital archives as potential backdoors for AI crawlers.

You might also wanna read

Open Markets Institute report warns news publishers face 'double bind' in AI content licensing market dominated by Big Tech

A new report from the Open Markets Institute examines the emerging AI content licensing market for news publishers. It argues that news publ

niemanlab.org·4d ago

How AI Search Platforms Are Undermining the Web's Information Ecosystem

The article examines how AI-powered search platforms like Google's AI Overviews are extracting and synthesizing content from creator website

noemamag.com·20h ago

New York Proposes Two Bills to Regulate AI in News Media and Data Centers

New York is considering two bills to regulate AI in news media and data centers. The NY FAIR News Act would require disclaimers on AI-genera

The Verge·3mo ago

Major Publishers Launch Really Simple Licensing Standard for AI Content Scraping

Major web publishers including Reddit, Yahoo, Medium, Quora, and People Inc. have announced support for Really Simple Licensing (RSL), a new

The Verge·8mo ago

Elsevier joins class action lawsuit against Meta over alleged use of copyrighted content for AI training

Scientific publishing giant Elsevier has joined a class action lawsuit against Meta Platforms, alleging that Meta used Elsevier's copyrighte

cenm.ag·1d ago

Elsevier joins class action lawsuit against Meta over alleged use of copyrighted content for AI training

Scientific publishing giant Elsevier has joined a class action lawsuit against Meta Platforms, alleging that Meta used Elsevier's copyrighte

cenm.ag·1d ago

Google's AI search changes threaten journalism industry, critic warns

Drew Magary argues that Google's shift toward AI-generated search results will devastate the journalism industry by removing the incentive f

sfgate.com·1d ago