All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

OpenAI Appears to Scrape Certificate Transparency Logs for New Websites to Crawl

By

pavel_lishin

5mo ago· 4 min readenInsight

Summary

A technical blog post describes how the author discovered that OpenAI appears to be scraping Certificate Transparency (CT) logs to find new websites to crawl. After minting a new TLS certificate, the author observed near-instantaneous requests from OpenAI's crawler to their server's robots.txt file, suggesting automated monitoring of CT logs for new domains to index.

Key quotes

· 3 pulled
I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:
Dec 12 20:43:04 xxxx xxx[719]: l=debug m="http request" pkg=http httpaccess= handler=(nomatch) method=get url=/robots.txt host=autoconfig.benjojo.uk duration="162.176µs" statuscode=404 proto=http/2.0 remoteaddr=74.7.175.182:38242 tlsinfo=tls1.3 useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome
I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from ...
Snippet from the RSS feed
lol. I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from ...

You might also wanna read