All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

The Ethical Dilemma of LLM Training Data and Content Creator Rights

By

wonger_

8mo ago· 5 min readenInsight

Summary

The article discusses the ethical issue of Large Language Models (LLMs) being trained on web content without authors' consent. It criticizes the naive suggestion that authors should use robots.txt to block LLM crawlers, arguing this approach externalizes costs onto content creators rather than addressing the fundamental problem of unauthorized data consumption. The piece presents this as an experimental strategy for 'contaminating' or poisoning LLMs by making them ingest problematic content.

Key quotes

· 4 pulled
One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn't theirs to consume.
Since most of what they consume is on the open web, it's difficult for authors to withhold consent without also depriving legitimate agents of information.
Some well-meaning but naive developers have implored authors to instate robots.txt rules, intended to block LLM-associated crawlers.
But, as the article Please stop externalizing your costs directly in my
Snippet from the RSS feed
An experimental strategy for contaminating Large Language Models

You might also wanna read