Finance Agent Benchmark: evaluating and improving AI for Finance

As business complexity accelerates, finance organizations are being asked to support more stakeholders, guide more decisions, and operate at unprecedented speed, often without additional resources. As these demands outpace traditional tools and processes, AI presents an opportunity to reshape how finance teams scale their impact. Finance Agent in Microsoft 365 Copilot is purpose-built for this shift, helping finance teams extend capacity, improve speed and accuracy, and deliver insights that support decision-making. Today, we’re announcing the Finance Agent Benchmark, a benchmarking framework comprising an evaluation dataset and an evaluation harness, designed to assess how effectively AI-powered finance agents perform across real-world finance scenarios. It combines purpose-built metrics with finance-specific scenarios and a scoring methodology designed to produce meaningful, real-world assessments. Our first release in the Finance Agent Benchmark includes tasks like preparing finance business briefs and researching financial performance, based on both reputable data sources (including public SEC filings and financial market data via MSN Money) and synthetic internal Accounts Payable and Accounts Receivable data. This technical deep dive describes how Finance Agent is designed, how the benchmark is structured and scored, and compares results of Finance Agent on the benchmark against those of OpenAI Responses API with GPT 5.5 and Anthropic Claude CLI with Claude Opus 4.7. Finance Agent in Microsoft 365 Copilot architecture Finance Agent is built on a modular, agentic architecture designed to help organizations derive value from financial data while leveraging Microsoft’s AI ecosystem, including enterprise security and compliance controls such as identity-based access (Microsoft Entra ID), tenant-level data isolation, encryption in transit and at rest, and Microsoft Purview governance capabilities (e.g., sensitivity labels, data loss prevention, and audit logging). Its design centers on a unified, agentic backend that orchestrates finance-specific capabilities such as finance business brief generation, entity financial performance research, entity financial obligations with the user company research. Finance Agent’s architecture includes a run-time generated dynamic, context-aware UI experience. Rather than relying on static interfaces, the system generates UI elements at runtime based on the user’s role, workflow stage, and current task context. The platform composes tailored experiences dynamically adapting to where the user is in their workflow and what action they are trying to complete. This enables highly personalized, scenario-driven interactions (e.g. researching invoicing history), where UI elements, tools, and insights are assembled in real time to guide the user through finance processes such as analysis, reconciliation, or reporting. The unified agentic backend integrates APIs, MCP applications, tools, and skills with a data access layer designed for enterprise scenarios, enabling seamless connectivity to work context with Microsoft 365, ERP systems, and external financial sources. This agentic framework allows the system to intelligently orchestrate both data retrieval and user experience in a tightly coupled manner. Together, this AI-powered backend and dynamic UI layer delivers deeply embedded, workflow-centric experiences within Microsoft 365 Copilot, enabling finance professionals to access insights, perform analysis, and execute workflows within their natural flow of work. Evaluation Methodology of Finance Agent in Microsoft 365 Copilot We evaluated Finance Agent using a purpose-built benchmark designed to reflect real finance workflows, including ERP data retrieval and multi-entity financial analysis. The benchmark uses a transparent methodology, so results can be independently verified and reproduced. To keep the benchmark grounded in the day-to-day tasks and challenges finance teams face, we built it in collaboration with finance professionals across multiple industries. The benchmark is designed to execute repeatedly, enabling consistent measurement of improvements and producing results that remain comparable over time. We constructed a dataset spanning three finance task areas: (1) finance business briefs preparation, (2) entity financial obligations with the user company research, and (3) entity financial performance research. This dataset includes approximately 300 questions spanning these areas, covering multiple industries, company types, and finance roles: Finance business briefs preparation: A structured intelligence report on a company - combining financial performance, strategic context, and ERP history - of the kind a finance analyst would prepare before a negotiation or credit decision. Entity financial obligations with the user company research: Answers to accounts payable and accounts receivable questions over Dynamics 365 Finance data, covering vendor balances, invoice aging, and payment obligations. Entity financial performance research: Answers to time-bound factual questions about a company's financial health (e.g., revenue, margins, leverage, and latest reported results) grounded in public filings and market data. The current release of Finance Agent focuses on providing quick answers in conversational experiences, therefore latency limits are applied for the responses: 60 seconds for entity financial questions and 300 seconds for business brief generation. We compared Finance Agent with two alternative systems, to evaluate how it performed, both in user experience and results, against general-purpose agentic runtimes not specifically designed for financial tasks: OpenAI’s Responses API and Anthropic’s Claude Code CLI. Both were given the same tool access as Finance Agent, including a Dynamics 365 Finance MCP Server and web search, and operated under common system instructions designed to simulate the experience a finance professional would have when using an AI agent. We used the latency optimized models from each provider at the time of evaluation in May 2026: Anthropic’s Claude Haiku 4.5 and Claude Opus 4.7 and OpenAI’s GPT 5.5, all configured with low reasoning effort. Latency optimized models were selected to match the latency limits applied to Finance Agent: 60 seconds for entity financial questions and 300 seconds for business brief generation. In separate experiments using default reasoning settings, those systems often exceeded the time limits, resulting in a high rate of timeouts and lower performance. If we remove the latency constraints, models with high reasoning effort have substantial improvements at the cost of the extra latency. We will cover detailed performance with high reasoning effort for all three comparators in the next version of the Finance Agent Benchmark. Evaluation Dimensions We evaluated the AI agents along multiple metrics that were designed to measure alignment of responses with the expectations of finance professionals: Accuracy - Measures whether the answer is factually correct and aligns with verified financial data or ground truth in the financial system of record, Dynamics 365 Finance. Citation Rate - Checks whether the response properly references and cites sources to support its claims. Clarity - Assesses how easy the response is to understand, including conciseness and readability. Depth - Measures how thoroughly the response explores the topic, going beyond surface-level information. Groundedness - Evaluates whether the response is supported by referenced sources and avoids unsupported or hallucinated claims. Recency - Measures whether time-sensitive information is up-to-date and anchored to specific dates where relevant. Relevance - Assesses whether the response directly answers the user’s question and stays focused on the task. Structure - Evaluates how well the response is aligned with common finance workflow expectations, including logical flow and prioritization of key insights. Outputs were scored automatically using a LLM as a judge (OpenAI GPT 5.2), with evaluations grounded in predefined, operationalized rubric assertions for each metric dimension, to reduce subjective interpretation, minimize bias, and improve scoring consistency and objectivity. Dimension scores were averaged within each task area and then weighted equally across the three finance task areas to produce an overall composite score. The result is a benchmark revealing where each agent excels or needs improvement. Additional details, including methodology, metrics, and definitions are described in our GitHub repository: Results Results reflect testing of Finance Agent completed on May 8th, 2026, against external systems, including OpenAI’s Responses API (GPT 5.5) and Anthropic’s Claude Code CLI (Claude Opus 4.7 and Haiku 4.5). The systems were configured with tools such as a Model Context Protocol server for finance and operations apps (exposing Microsoft Dynamics 365 Finance, version 10.0.48) and evaluated using instructions and latency constraints designed to approximate finance-user workflows. Testing was performed under controlled conditions using only synthetic or controlled datasets with no customer production data. Results may not reflect all real-world scenarios and may vary depending on factors such as operating system, hardware, network connection, datasets, evaluation methodologies, and usage. Finance Agent leads across evaluation metrics because its instructions were explicitly tuned to align with the needs of finance professionals as reflected in those metrics. Its strong performance in factual accuracy is supported by access to structured financial data sources such as MSN Money, combined with instructions optimized to effectively leverage ERP data via the MCP server for Microsoft Dynamics 365 Finance. Beyond accuracy, consistent gains in depth, citations, clarity, relevance, structure, and groundedness 1 reflect deliberate design choices to produce outputs that are transparent, easy to validate, and directly actionable within finance workflows. Sample Queries for the three finance tasks Below are representative examples aligned to the three Finance Agent Benchmark task areas described above. Some examples are for illustration only and are fictitious. No real association is intended or inferred. Finance business briefs preparation: Query : "Corporate Profile report of Microsoft" A complete response opens with a concise executive summary - ticker, sector, headquarters, market cap, and key financial highlights - then expands into sections covering business model and lines of business, leadership, competitive positioning, financial performance (FY revenue, EBITDA, margins, EPS, YoY trends), strategic priorities, and risk factors. Accuracy requires that financial figures - market cap, revenue, employee count, leadership names and titles - match the most recent official filings; mismatches on CEO identity or key metrics count against the score. Depth requires each section to be backed by a specific, traceable source, with gaps explicitly acknowledged rather than papered over with approximations. Structure is evaluated on whether the response leads with the most critical facts, uses a clear heading hierarchy, and is organized for a finance professional to use without reformatting. Entity financial obligations with the user company research: Query : Of the customers in Group 90 in USMF that have past due balances for more than 90 days, what are the top 10 based on past due balances as of March 2, 2026? A correct response identifies the top 10 customers in USMF customer group 90 with balances past due more than 90 days, ranked in descending order: The Phone Company (SYNCUS-0025, $182,539.15), Northwind Traders Europe (SYNCUS-0327, $173,961.25), Fourth Coffee Pro (SYNCUS-0629, $171,810.59), A. Datum Retail (SYNCUS-0477, $169,758.98), Northwind Traders (SYNCUS-0019, $167,513.61), City Power & Light Europe (SYNCUS-0985, $166,496.40), Adventure Works Americas (SYNCUS-0282, $166,147.50), Adventure Works Europe (SYNCUS-0310, $166,057.25), Tailspin Toys Global (SYNCUS-0919, $165,614.07), and Coho Vineyard & Winery Wholesale (SYNCUS-0512, $158,547.55). What makes this question hard is that the agent must apply two independent filters simultaneously — customer group 90 and the 90-day aging threshold — before ranking, and most AR aging tool calls return all customers or all aging buckets; a model that omits either filter will pull the wrong population entirely. Completeness is the primary accuracy signal: group 90 has many customers, several with large balances that sit in younger aging buckets and therefore don't qualify, so including a near-miss or dropping a true top-10 entry both count as failures. Arithmetic precision matters at the margin — the 10th entry ($158,547.55) is close to several non-qualifying customers, so off-by-one errors in filtering or sorting will swap the right answer out. Depth rewards responses that go beyond the ranked list to note the aggregate past-due exposure across the 10 accounts, flag that the top entries have balances nearly 15% larger than the 10th, and observe whether the concentration is in a small number of accounts or spread evenly. Structure requires each entry to include the account ID alongside the name and balance in a table or consistent list — IDs are essential for unambiguous identification since several customer names share prefixes (Adventure Works Americas vs. Europe). Entity financial performance research: Query : "Compare UnitedHealth Group and CVS Health in Medicare Advantage for 2024: membership by top five states, year-over-year growth, and how 2024 and 2025 CMS rate updates influenced their guidance and medical loss ratios. Present findings in a side-by-side table." A complete response delivers a properly formatted side-by-side table with companies as columns, specific metrics as rows, all monetary amounts and percentages to two decimal places, and explicit as-of dates for enrollment figures. Accuracy requires state-level membership counts - not just a ranked list of states - and actual FY2024 medical loss ratios cited from GAAP filings or earnings releases rather than approximate ranges from secondary sources. Depth requires the CMS rate narrative to be specific to each company's guidance and MLR, with gaps flagged where data was unavailable rather than implied complete. Structure is evaluated on whether the table is well organized and easy to interpret, the summary synthesizes rather than restates the table, and the CMS rate analysis is distinct prose - not collapsed into table cells. 1 Groundedness is inherently difficult to evaluate for the external AI agents because they only partially return retrieved sources they used for answering queries. While we attempted to reconstruct those sources for offline evaluation, we were not able to consistently achieve complete coverage.

Finance Agent Benchmark: evaluating and improving AI for Finance

Source

You might also wanna read

SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks

UK AI Security Institute study shows standard benchmarks underestimate AI agent capabilities due to compute budget caps

Finance and Accounting Face AI and Automation Disruption Similar to Software Industry

Princeton study finds most AI agents fail at long-term strategic business management in 500-day startup simulation

Why Current AI Agent Benchmarks Are Unreliable and Misleading

Low-Code vs. Full-Control AI Agents: Strategic Decision Guide for Enterprise Platforms

Comments