All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Scaling Bloom Filter-Based Full-Text Search to Large Document Collections

By

birdculture

6mo ago· 8 min readenInsight

Summary

The article discusses scaling a space-efficient full-text search technique using bloom filters from small document collections to large web-scale corpuses. It builds on a 2013 blog post that described creating bloom filters for each document's words and querying by checking each document's filter. While acknowledging the O(number-of-documents) query time complexity makes it unsuitable for large collections, the author proposes a method to scale the technique to web-scale document collections, while also discussing why this might be a bad idea. The content appears to be a technical blog post exploring algorithmic trade-offs in search technology.

Key quotes

· 4 pulled
The algorithm is simple: Per document, create a bloom filter of all its words.
To query, simply check each document's bloom filter for the query terms.
With a query time complexity of O(number-of-documents), we can forget about using this on big corpuses, right?
In this blog post I propose a way of scaling the technique to large document corpuses (e.g. the web) and discuss why that is a bad idea.
Snippet from the RSS feed
Jonathan Arns's personal blog

You might also wanna read

Modified Raft Consensus Protocol Enables Progress with Minority Node Participation

This article describes a modified version of the Raft consensus protocol that allows progress to be made even when fewer than a majority of

padhye.org·5d ago

Binary GCD Algorithm: A Faster Alternative to Standard Euclidean Algorithm

The article explains the binary GCD algorithm, a faster variant of the Euclidean algorithm for computing greatest common divisors. It covers

en.algorithmica.org·1mo ago

Optimizing Prime Number Generation: Creating a High-Performance C Program for 32-Bit Primes

This article documents the author's technical journey to create an optimized C program for Linux that generates all prime numbers that fit w

hnlyman.github.io·2mo ago

Implementing Hybrid Semantic Search in SQLite with Binary Embeddings and Hamming Distance

This technical article demonstrates how to implement semantic search in SQLite using binary embeddings and Hamming distance, enabling hybrid

notnotp.com·3mo ago

Understanding Speculative Sampling: Using Draft Distributions to Match Target Sampling Results

Speculative sampling is a technique that uses a draft sampling distribution to achieve the same results as a target sampling distribution. T

saibo-creator.github.io·3mo ago

Programmable Search Engine Updates: Modernized Experience and API Transition to Vertex AI

This article is a blog post from the Programmable Search Engine team announcing updates and improvements to their search products. It covers

programmablesearchengine.googleblog.com·4mo ago