LLM Rerankers Can Self-Assess Ranking Quality Through Self-Consistency and Supervised Calibration Methods

[Submitted on 2 Jun 2026]

23d ago· 2 min readenInsight

technology science natural language processing information retrieval

Summary

This paper investigates whether LLM rerankers can predict their own ranking quality (reranker-internal Query Performance Prediction). The authors explore training-free methods (self-consistency across sampled rankings and verbalized confidence) and training-based approaches (Verb-Num and Verb-List). Experiments on TREC Deep Learning 2019-2022 with four LLMs show that self-consistency is competitive with state-of-the-art QPP methods and better calibrated, while direct verbalized confidence is severely overconfident. The proposed supervised methods (Verb-Num and Verb-List) enable LLM rerankers to produce calibrated ranking-quality estimates with minimal additional output tokens.

Source

bskyLLM Rerankers Can Self-Assess Ranking Quality Through Self-Consistency and Supervised Calibration Methodsarxiv.org

Key quotes

· 3 pulled

Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available.

Self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident.

We propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.

Snippet from the RSS feed

Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predi

You might also wanna read

Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments

This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three

arxiv.org·4d ago

Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments

This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three

arxiv.org·4d ago

Self-RAG: A Self-Reflective Framework for Improving LLM Factuality and Output Quality

Self-RAG is a framework that enhances large language models by training them to retrieve relevant information, generate responses, and criti

selfrag.github.io·1d ago

KaLM-Reranker-V1: A Decoupled Encoder-Decoder Reranker for Efficient Document Retrieval

KaLM-Reranker-V1 is a new reranking model for retrieval systems that decouples query and passage computation using an encoder-decoder archit

arxiv.org·4d ago

R-Zero: A Self-Evolving LLM Framework That Generates Its Own Training Data Without Human Input

R-Zero is a fully autonomous framework for training self-evolving Large Language Models (LLMs) that generates its own training data from scr

arxiv.org·9mo ago

DeepConf: Enhancing LLM Reasoning Through Confidence-Based Inference Methods

DeepConf is a novel test-time inference method that enhances Large Language Models' reasoning capabilities by using internal log-probabiliti

arxiviq.substack.com·10mo ago

Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration

Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-

arxiv.org·5mo ago

Comments

No comments yet. Be the first.