AI Training Data Set Exposes Millions of Personal Records
By
pera
Slow-proofed and worth the wait. Worth its weight in flour.
Summary
Researchers discovered that a major AI training data set, DataComp CommonPool, contains millions of examples of personal data, including identity documents and job application materials. The findings highlight the ethical concerns around data scraping and privacy in AI development.
Key quotes
· 3 pulled"anything you put online can [be] and probably has been scraped."
"The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates."
"over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people."
You might also wanna read
Major AI models fail EU legal compliance tests, Aithos study finds
Nonprofit AI research foundation Aithos developed a tool called LARA (Legal Assessment for Real-world Agents) to evaluate AI models' complia

Privacy Risks of Sharing Medical Information with AI Chatbots
The article examines the privacy and security risks of sharing personal health information with AI chatbots like ChatGPT. While millions use
User reports Google Search displaying personal data on Google AI Developers Forum
A user is reporting a data protection issue on the Google AI Developers Forum, stating that Google Search is displaying their personal data
discuss.ai.google.dev·5d agoAfrican gig workers unknowingly train AI for US military operations, investigation finds
An investigation reveals that African gig workers on platforms like Appen have been unknowingly training AI systems for the US military, inc
Enterprise AI Risk Concentrated Among Power Users and Dominant Platforms, Report Finds
LayerX Security's State of AI Usage Report 2026 reveals that enterprise AI risk is heavily concentrated among a small group of "power users"
