KaLM-Reranker-V1: A Decoupled Encoder-Decoder Reranker for Efficient Document Retrieval
By
[Submitted on 22 Jun 2026]
Summary
KaLM-Reranker-V1 is a new reranking model for retrieval systems that decouples query and passage computation using an encoder-decoder architecture. Unlike traditional rerankers that jointly encode query and passage (tightly coupling computation), this model pre-encodes passages with Matryoshka embedding pooling via the encoder, while the decoder handles query intent. Cross-attention then captures relevance between query and passage representations. The model comes in three sizes (Nano 0.27B, Small 1B, Large 4B parameters) and achieves state-of-the-art performance on BEIR benchmarks, competitive with industrial models like Qwen3-Reranker, while offering superior efficiency. Even the smallest 0.27B Nano model remains competitive with 7-12B embedding models on the LMEB benchmark.
Source
Key quotes
· 4 pulledWe present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling.
This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention.
On BEIR, KaLM-Reranker-V1 achieves state-of-the-art performance, on par with strong industrial models such as the Qwen3-Reranker series.
On LMEB, reranking models demonstrate a clear advantage, with even the 0.27B Nano model remaining competitive with 7-12B embedding models.
You might also wanna read
Rank-Aware Decomposition Technique Reduces Computation in Recommender Systems by 87.5%
This paper presents a rank-aware decomposition technique for deep ranking models in industrial recommender systems. The key insight is that
Siamese LLM Dual-Encoder with ROAR for Semantic Product Search in E-Commerce
This paper presents a Siamese LLM dual-encoder for semantic retrieval in e-commerce search, addressing challenges of short, noisy queries ov
LLM Rerankers Can Self-Assess Ranking Quality Through Self-Consistency and Supervised Calibration Methods
This paper investigates whether LLM rerankers can predict their own ranking quality (reranker-internal Query Performance Prediction). The au
Expected Attention: KV Cache Compression Method for Efficient LLM Inference
This research paper introduces Expected Attention, a training-free method for compressing Key-Value (KV) cache in large language models to r
Chonky_mmbert_small_multilingual_v1: Transformer Model for Semantic Text Segmentation in RAG Systems
Chonky_mmbert_small_multilingual_v1 is a transformer model designed for intelligent text segmentation into meaningful semantic chunks. The m
Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

Comments
Sign in to join the conversation.
No comments yet. Be the first.