All Topics

Technology

Art

AI Training Data Set Exposes Millions of Personal Records

pera

10mo ago· 8 min readenNews

85/100

Golden Brown

Bagelometer↗

Slow-proofed and worth the wait. Worth its weight in flour.

Score85TypenewsSentimentnegative

Summary

Researchers discovered that a major AI training data set, DataComp CommonPool, contains millions of examples of personal data, including identity documents and job application materials. The findings highlight the ethical concerns around data scraping and privacy in AI development.

Key quotes

· 3 pulled

"anything you put online can [be] and probably has been scraped."

"The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates."

"over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people."

Snippet from the RSS feed

Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models.

You might also wanna read

Major AI models fail EU legal compliance tests, Aithos study finds

Nonprofit AI research foundation Aithos developed a tool called LARA (Legal Assessment for Real-world Agents) to evaluate AI models' complia

theregister.com·4d ago

Privacy Risks of Sharing Medical Information with AI Chatbots

The article examines the privacy and security risks of sharing personal health information with AI chatbots like ChatGPT. While millions use

The Verge·4mo ago

User reports Google Search displaying personal data on Google AI Developers Forum

A user is reporting a data protection issue on the Google AI Developers Forum, stating that Google Search is displaying their personal data

discuss.ai.google.dev·5d ago

African gig workers unknowingly train AI for US military operations, investigation finds

An investigation reveals that African gig workers on platforms like Appen have been unknowingly training AI systems for the US military, inc

restofworld.org·1d ago

Enterprise AI Risk Concentrated Among Power Users and Dominant Platforms, Report Finds

LayerX Security's State of AI Usage Report 2026 reveals that enterprise AI risk is heavily concentrated among a small group of "power users"

thehackernews.com·3d ago