Technology

Art

US Publishers Demand Common Crawl Stop Scraping Content for AI Training

Matt G. Southern

24d ago· 4 min readenNews

technology business ai training publishing & copyright

Summary

Digital Content Next (DCN), a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation, demanding it stop scraping publisher content and remove protected material from its datasets. Common Crawl has been crawling billions of pages monthly since 2007 to build a free public archive, which has been widely used to train AI models, including OpenAI's GPT-3. DCN CEO Jason Kint announced the legal notice, escalating tensions between publishers and AI companies over the use of copyrighted content for AI training without permission or compensation.

Source

bskyUS Publishers Demand Common Crawl Stop Scraping Content for AI Trainingbuff.ly

Key quotes

· 3 pulled

Digital Content Next, a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation.

The letter demands Common Crawl stop collecting publisher content and remove material already in its datasets.

Common Crawl has crawled several billion new pages each month since 2007 to build a free public archive.

Snippet from the RSS feed

Digital Content Next sent Common Crawl a cease and desist letter demanding it stop scraping publisher content and remove protected material from its datasets.

You might also wanna read

Publishers of Nearly 400 Newspapers Sue OpenAI and Microsoft Over AI Content Scraping

A coalition of publishers owning nearly 400 newspapers has filed a lawsuit against OpenAI and Microsoft, alleging the companies illegally sc

news.bloomberglaw.com·10d ago

News Publishers Restrict Internet Archive Access Over AI Data Scraping Concerns

News publishers including The Guardian and The New York Times are restricting access to their content in the Internet Archive's Wayback Mach

niemanlab.org·4mo ago

The Ethical Dilemma of LLM Training Data and Content Creator Rights

The article discusses the ethical issue of Large Language Models (LLMs) being trained on web content without authors' consent. It criticizes

heydonworks.com·10mo ago

The Case Against Blocking LLM Crawlers on Websites

The article argues against blocking large-language-model (LLM) crawlers from websites, comparing it to allowing Google to index content. It

johnjianwang.medium.com·11mo ago

AI Crawl Control - Introducing Redirects for AI Training

Cloudflare·2mo ago

AI Companies' Copyright Dilemma: Scraping Data vs. Fair Use

The article criticizes AI companies for scraping vast amounts of online content, including text, photos, and videos, to train their models w

jskfellows.stanford.edu·10mo ago

Comments

No comments yet. Be the first.