All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Kapa.ai's approach to indexing images for RAG: describing images at indexing time with cheap vision models

By

mooreds

10d ago· 9 min readenInsight

Summary

Kapa.ai describes their approach to handling images in RAG (Retrieval-Augmented Generation) pipelines for technical documentation. Instead of sending images to the model at query time (which is expensive), they use a cheap vision model to describe each image once at indexing time, store those descriptions as text, and retrieve them alongside regular text chunks. This makes indexing a one-time cost with minimal per-query overhead (1-6%). The article details their technical journey, challenges with different image types (screenshots, diagrams, tables), and their solution for making visual content searchable and useful for AI-powered technical Q&A.

Key quotes

· 3 pulled
We don't send images to the model at query time. We describe each image once, at indexing time, with a cheap vision model, store the descriptions as text, and retrieve them alongside ordinary text chunks.
Indexing is a one-time cost; after that, per-query overhead is 1% to 6% over text-only retrieval.
The knowledge bases we process hold millions of images: screenshots, architecture diagrams, circuit schematics, annotated UI walkthroughs.
Snippet from the RSS feed
Reading the screenshots, diagrams and tables in technical documentation for LLMs

You might also wanna read