LLM Rerankers Can Self-Assess Ranking Quality Through Self-Consistency and Supervised Calibration Methods
By
[Submitted on 2 Jun 2026]
Summary
This paper investigates whether LLM rerankers can predict their own ranking quality (reranker-internal Query Performance Prediction). The authors explore training-free methods (self-consistency across sampled rankings and verbalized confidence) and training-based approaches (Verb-Num and Verb-List). Experiments on TREC Deep Learning 2019-2022 with four LLMs show that self-consistency is competitive with state-of-the-art QPP methods and better calibrated, while direct verbalized confidence is severely overconfident. The proposed supervised methods (Verb-Num and Verb-List) enable LLM rerankers to produce calibrated ranking-quality estimates with minimal additional output tokens.
Source
Key quotes
· 3 pulledRetrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available.
Self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident.
We propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.
You might also wanna read
Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments
This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three
Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments
This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three
Self-RAG: A Self-Reflective Framework for Improving LLM Factuality and Output Quality
Self-RAG is a framework that enhances large language models by training them to retrieve relevant information, generate responses, and criti
KaLM-Reranker-V1: A Decoupled Encoder-Decoder Reranker for Efficient Document Retrieval
KaLM-Reranker-V1 is a new reranking model for retrieval systems that decouples query and passage computation using an encoder-decoder archit
R-Zero: A Self-Evolving LLM Framework That Generates Its Own Training Data Without Human Input
R-Zero is a fully autonomous framework for training self-evolving Large Language Models (LLMs) that generates its own training data from scr
DeepConf: Enhancing LLM Reasoning Through Confidence-Based Inference Methods
DeepConf is a novel test-time inference method that enhances Large Language Models' reasoning capabilities by using internal log-probabiliti
Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration
Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-

Comments
Sign in to join the conversation.
No comments yet. Be the first.