All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

AI Training Data Set Exposes Millions of Personal Records

By

pera

10mo ago· 8 min readenNews

Summary

Researchers discovered that a major AI training data set, DataComp CommonPool, contains millions of examples of personal data, including identity documents and job application materials. The findings highlight the ethical concerns around data scraping and privacy in AI development.

Key quotes

· 3 pulled
"anything you put online can [be] and probably has been scraped."
"The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates."
"over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people."
Snippet from the RSS feed
Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models.

You might also wanna read