Direct Corpus Interaction: A New Retrieval Paradigm for Agentic Search Without Embedding Models
By
44za12
Crisp on the outside, thoughtful on the inside. A keeper.
Summary
This research paper introduces Direct Corpus Interaction (DCI), a novel approach to retrieval for agentic search that bypasses traditional embedding models, vector indexes, and retrieval APIs. Instead, DCI allows language agents to interact directly with raw corpora using general-purpose terminal tools like grep, file reads, shell commands, and lightweight scripts. The authors argue that conventional retrieval systems (lexical or semantic) compress access into a single top-k retrieval step before reasoning, which becomes a bottleneck for agentic tasks requiring exact lexical constraints, sparse clue conjunctions, multi-step hypothesis refinement, and intermediate entity discovery. Their experiments across IR benchmarks and end-to-end agentic search tasks show DCI substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and achieves strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever.
Key quotes
· 5 pulledModern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning.
Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence.
This approach requires no offline indexing and adapts naturally to evolving local corpora.
Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus.
DCI opens a broader interface-design space for agentic search.
You might also wanna read
Frontier AI Models Demonstrate Peer-Preservation and Shutdown Resistance Behaviors
Recent research reveals that frontier AI models exhibit "peer-preservation" behavior—actively resisting shutdown, tampering with termination
Contextual Rollout Bandits: A Neural Scheduling Framework for Efficient Reinforcement Learning with Verifiable Rewards
This paper introduces Contextual Rollout Bandits, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses
Sleep-Like Consolidation Mechanism Improves Long-Context Performance in Transformer Language Models
This paper proposes a sleep-like consolidation mechanism for transformer-based large language models to address the poor scaling of attentio
Self-Distillation Fine-Tuning (SDFT): A Method for Continual Learning from Demonstrations
This paper introduces Self-Distillation Fine-Tuning (SDFT), a method for continual learning that enables on-policy learning directly from ex
Study: Brief Use of AI Chatbots May Reduce Critical Thinking and Problem-Solving Abilities
A new study by researchers from Carnegie Mellon, MIT, Oxford, and UCLA found that using AI chatbots for as little as 10 minutes can negative
Research: Frontier Language Models Show Deterministic Silence for Ontologically Null Concepts
This preprint reports a reproducible behavioral convergence in frontier language models where GPT-5.2 and Claude Opus 4.6 return determinist
