OCRBase: Open-Source PDF to Structured Data Conversion Tool with PaddleOCR-VL
By
adammajcher
4mo ago· 3 min readenCode
95/100
Golden Brown
Bagelometer↗
Baker's choice. Dense with flavour, light on filler.
Score95Typehow-toSentimentpositive
Summary
OCRBase is an open-source tool that converts PDF documents into structured data formats (Markdown or JSON) using PaddleOCR-VL models. It offers both cloud API services with a free tier for up to 1,000 pages and self-hosting options with Docker. The platform provides SDKs for easy integration, supports GPU acceleration for performance, and focuses on extracting structured data from documents at scale.
Key quotes
· 4 pulledTurn PDFs into structured data at scale. Powered by frontier open-weight OCR models.
ocrbase.dev parse and extract data from documents up to 1K page for free.
The API will be available at http://localhost:3000. See the Self-Hosting Guide for PaddleOCR setup, GPU configuration, and all environment variables.
ocrbase has two core operations. Both are asyn
📄 PDF ->.MD/.JSON API & SDK for PaddleOCR-VL with structured data extraction. Self-hostable. - ocrbase-hq/ocrbase
