Research on LLM Output Drift in Financial Workflows: Quantifying Consistency Across Model Sizes
By
raffisk
Lightly toasted, lightly seasoned, mostly correct.
Summary
This research paper examines the critical issue of output drift in Large Language Models (LLMs) deployed for financial workflows. The study quantifies how nondeterministic outputs undermine auditability and trust in regulated financial tasks like reconciliations and regulatory reporting. Key findings reveal an inverse relationship between model size and output consistency: smaller models (7B-8B parameters) achieve 100% consistency, while larger models (120B parameters) show only 12.5% consistency. The research introduces a finance-calibrated deterministic test harness, task-specific invariant checking, a three-tier model classification system, and an audit-ready attestation system with dual-provider validation. The framework maps to major financial regulatory requirements (FSB, BIS, CFTC) to enable compliance-ready AI deployments.
Key quotes
· 4 pulledFinancial institutions deploy Large Language Models (LLMs) for reconciliations, regulatory reporting, and client communications, but nondeterministic outputs (output drift) undermine auditability and trust.
We quantify drift across five model architectures (7B-120B parameters) on regulated financial tasks, revealing a stark inverse relationship: smaller models achieve 100% output consistency at T=0.0, while GPT-OSS-120B exhibits only 12.5% consistency.
This finding challenges conventional assumptions that larger models are universally superior for production deployment.
We map our framework to Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC) requirements, demonstrating practical pathways for compliance-ready AI deployments.
You might also wanna read
DecompR: A Method for Reducing Weighting Noise in Multi-Stakeholder LLM Alignment
This paper addresses the challenge of aligning large language models (LLMs) with multiple stakeholders who have conflicting preferences. It
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
LLMTest: Automated LLM Model Selection and Fallback Tool for Developers
LLMTest is a tool created by maker Tom to help developers and "vibe coders" automatically select the best LLM models for AI-powered features
Study finds LLMs persist in treating false claims as true despite explicit warnings
A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont
arstechnica.com·23h ago