All Topics

Technology

Art

Building a web search engine from scratch: 3 billion neural embeddings in two months

wilsonzlin

9mo ago· 40 min readenInsight

100/100

Golden Brown

Bagelometer↗

Fresh out the oven, still warm. Top of the tray.

Score100TypeanalysisSentimentpositive

Summary

A developer documents their personal challenge of building a web search engine from scratch over two months, using 3 billion neural embeddings, a large GPU cluster, distributed RocksDB, and terabytes of sharded HNSW. The project was motivated by frustrations with existing search engines prioritizing engagement bait over quality content and relying too heavily on keyword matching rather than human-level understanding.

Key quotes

· 3 pulled

A simple question I had was: why couldn't a search engine always result in top quality content?

Such content may be rare, but the Internet's tail is long, and better quality results should rank higher than the prolific inorganic content and engagement bait you see today.

Another pain point was that search engines often felt underpowered, closer to keyword matching than human-level

Snippet from the RSS feed

End-to-end deep dive of the project, spanning a large GPU cluster, distributed RocksDB, and terabytes of sharded HNSW.

You might also wanna read

DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference

DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to

artgor.medium.com·8h ago

Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory

This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware

arxiv.org·1d ago

LinkedIn cuts GPU training hours by 65% with Generative Recommender system optimizations

LinkedIn has developed a Generative Recommender (GR) system that models user activity as token sequences, offering richer long-context perso

startuphub.ai·3d ago

Rank-Aware Decomposition Technique Reduces Computation in Recommender Systems by 87.5%

This paper presents a rank-aware decomposition technique for deep ranking models in industrial recommender systems. The key insight is that

arxiv.org·3d ago

Hands-on evaluation of MiniMax M2.7 via API on ML and coding workflows

The author evaluates MiniMax M2.7 by using it through Claude Code on three real-world ML and coding workflows: scaffolding a Kaggle competit

andlukyane.com·12d ago

Orthrus: A Dual-Architecture Framework for Fast, Lossless LLM Inference via Diffusion Decoding

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to enable fast, lossless parallel token gen

github.com·16d ago