US Publishers Demand Common Crawl Stop Scraping Content for AI Training
By
Matt G. Southern
Summary
Digital Content Next (DCN), a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation, demanding it stop scraping publisher content and remove protected material from its datasets. Common Crawl has been crawling billions of pages monthly since 2007 to build a free public archive, which has been widely used to train AI models, including OpenAI's GPT-3. DCN CEO Jason Kint announced the legal notice, escalating tensions between publishers and AI companies over the use of copyrighted content for AI training without permission or compensation.
Source
bskyUS Publishers Demand Common Crawl Stop Scraping Content for AI Trainingbuff.lyKey quotes
· 3 pulledDigital Content Next, a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation.
The letter demands Common Crawl stop collecting publisher content and remove material already in its datasets.
Common Crawl has crawled several billion new pages each month since 2007 to build a free public archive.
You might also wanna read
Publishers of Nearly 400 Newspapers Sue OpenAI and Microsoft Over AI Content Scraping
A coalition of publishers owning nearly 400 newspapers has filed a lawsuit against OpenAI and Microsoft, alleging the companies illegally sc
News Publishers Restrict Internet Archive Access Over AI Data Scraping Concerns
News publishers including The Guardian and The New York Times are restricting access to their content in the Internet Archive's Wayback Mach
The Ethical Dilemma of LLM Training Data and Content Creator Rights
The article discusses the ethical issue of Large Language Models (LLMs) being trained on web content without authors' consent. It criticizes
The Case Against Blocking LLM Crawlers on Websites
The article argues against blocking large-language-model (LLM) crawlers from websites, comparing it to allowing Google to index content. It
AI Crawl Control - Introducing Redirects for AI Training

AI Companies' Copyright Dilemma: Scraping Data vs. Fair Use
The article criticizes AI companies for scraping vast amounts of online content, including text, photos, and videos, to train their models w
jskfellows.stanford.edu·10mo ago
Comments
Sign in to join the conversation.
No comments yet. Be the first.