Scaling Bloom Filter-Based Full-Text Search to Large Document Collections
By
birdculture
Sesame, salt, and substance. A flagship bake.
Summary
The article discusses scaling a space-efficient full-text search technique using bloom filters from small document collections to large web-scale corpuses. It builds on a 2013 blog post that described creating bloom filters for each document's words and querying by checking each document's filter. While acknowledging the O(number-of-documents) query time complexity makes it unsuitable for large collections, the author proposes a method to scale the technique to web-scale document collections, while also discussing why this might be a bad idea. The content appears to be a technical blog post exploring algorithmic trade-offs in search technology.
Key quotes
· 4 pulledThe algorithm is simple: Per document, create a bloom filter of all its words.
To query, simply check each document's bloom filter for the query terms.
With a query time complexity of O(number-of-documents), we can forget about using this on big corpuses, right?
In this blog post I propose a way of scaling the technique to large document corpuses (e.g. the web) and discuss why that is a bad idea.
You might also wanna read
Modified Raft Consensus Protocol Enables Progress with Minority Node Participation
This article describes a modified version of the Raft consensus protocol that allows progress to be made even when fewer than a majority of
Binary GCD Algorithm: A Faster Alternative to Standard Euclidean Algorithm
The article explains the binary GCD algorithm, a faster variant of the Euclidean algorithm for computing greatest common divisors. It covers
Optimizing Prime Number Generation: Creating a High-Performance C Program for 32-Bit Primes
This article documents the author's technical journey to create an optimized C program for Linux that generates all prime numbers that fit w
Implementing Hybrid Semantic Search in SQLite with Binary Embeddings and Hamming Distance
This technical article demonstrates how to implement semantic search in SQLite using binary embeddings and Hamming distance, enabling hybrid
Understanding Speculative Sampling: Using Draft Distributions to Match Target Sampling Results
Speculative sampling is a technique that uses a draft sampling distribution to achieve the same results as a target sampling distribution. T
Programmable Search Engine Updates: Modernized Experience and API Transition to Vertex AI
This article is a blog post from the Programmable Search Engine team announcing updates and improvements to their search products. It covers
