KaLM-Reranker-V1: A Decoupled Encoder-Decoder Reranker for Efficient Document Retrieval

[Submitted on 22 Jun 2026]

4d ago· 2 min readenInsight

Summary

KaLM-Reranker-V1 is a new reranking model for retrieval systems that decouples query and passage computation using an encoder-decoder architecture. Unlike traditional rerankers that jointly encode query and passage (tightly coupling computation), this model pre-encodes passages with Matryoshka embedding pooling via the encoder, while the decoder handles query intent. Cross-attention then captures relevance between query and passage representations. The model comes in three sizes (Nano 0.27B, Small 1B, Large 4B parameters) and achieves state-of-the-art performance on BEIR benchmarks, competitive with industrial models like Qwen3-Reranker, while offering superior efficiency. Even the smallest 0.27B Nano model remains competitive with 7-12B embedding models on the LMEB benchmark.

Source

Twitter / XKaLM-Reranker-V1: A Decoupled Encoder-Decoder Reranker for Efficient Document Retrievalarxiv.org

Key quotes

· 4 pulled

We present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling.

This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention.

On BEIR, KaLM-Reranker-V1 achieves state-of-the-art performance, on par with strong industrial models such as the Qwen3-Reranker series.

On LMEB, reranking models demonstrate a clear advantage, with even the 0.27B Nano model remaining competitive with 7-12B embedding models.

Snippet from the RSS feed

As retrieval systems scale, high-quality reranking becomes increasingly important. However, most existing rerankers, whether encoder-based or decoder-based, jointly encode the query and passage, tightly coupling their computation and limiting deployment e

You might also wanna read

Rank-Aware Decomposition Technique Reduces Computation in Recommender Systems by 87.5%

This paper presents a rank-aware decomposition technique for deep ranking models in industrial recommender systems. The key insight is that

arxiv.org·1mo ago

Siamese LLM Dual-Encoder with ROAR for Semantic Product Search in E-Commerce

This paper presents a Siamese LLM dual-encoder for semantic retrieval in e-commerce search, addressing challenges of short, noisy queries ov

arxiv.org·25d ago

LLM Rerankers Can Self-Assess Ranking Quality Through Self-Consistency and Supervised Calibration Methods

This paper investigates whether LLM rerankers can predict their own ranking quality (reranker-internal Query Performance Prediction). The au

arxiv.org·23d ago

Expected Attention: KV Cache Compression Method for Efficient LLM Inference

This research paper introduces Expected Attention, a training-free method for compressing Key-Value (KV) cache in large language models to r

arxiv.org·8mo ago

Chonky_mmbert_small_multilingual_v1: Transformer Model for Semantic Text Segmentation in RAG Systems

Chonky_mmbert_small_multilingual_v1 is a transformer model designed for intelligent text segmentation into meaningful semantic chunks. The m

huggingface.co·8mo ago

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·8mo ago

Comments

No comments yet. Be the first.