Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose ChainEval, a dynamic alignment metric that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Evaluating 26 leading LLMs reveals that even frontier proprietary systems exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.
The benchmark is built upon a comprehensive financial taxonomy of 58 distinct topics across 12 financial domains. For each topic, we developed five parameterized templates of varying complexity (two easy, two intermediate, and one advanced). This tiered structure is designed to systematically evaluate how a model's reasoning capabilities hold up as the number of required steps and the depth of financial knowledge increase. Each template generates problems with symbolic variables and includes an executable Python code trace for the entire solution. This design is crucial as it ensures full machine-verifiability, allowing every intermediate calculation to be automatically checked to instantly detect flawed or incomplete reasoning.
To guarantee the highest quality and professional accuracy, every template underwent a rigorous multi-stage verification process led by a team of ten financial experts from both industry and academia. The process began with a pilot study where all annotators reviewed a common set of templates to align their evaluation standards. Using a custom annotation platform, each expert was then tasked with validating the financial logic, mathematical correctness, and clarity of the remaining templates. If a template was deemed incorrect, the expert had to categorize the flaw using specific issue tags and provide a minimal code correction. This meticulous human-in-the-the-loop process resulted in the identification and fixing of flaws in 10% of the initial templates.
Figure 2: Symbolic template for generating compound interest problems.
To enable a rigorous and holistic evaluation, we propose CHAINEVAL, a framework that assesses model outputs on two critical axes: the correctness of the final answer and the faithfulness of the intermediate reasoning steps. At its core, CHAINEVAL uses Dynamic Time Warping (DTW), an algorithm that intelligently finds the optimal alignment between the sequence of steps in a model's solution and the ground-truth trace. This approach is highly robust, as it can naturally handle common variations where a model might combine, skip, or slightly reorder steps while still maintaining the overall logical flow.
The quality of the alignment is driven by a unique gated score calculated for each pair of steps. To be considered a strong match, a generated step must satisfy two conditions simultaneously: its textual explanation must be semantically similar to the gold step (measured via cosine similarity), and its intermediate numerical result must match the correct value (allowing for a 5% relative tolerance to account for rounding errors). This "gating" mechanism is strict hence a step with a perfect explanation but a flawed calculation receives a score of zero for that alignment. The final CHAINEVAL score is a normalized value representing the quality of the best alignment path, offering a single, holistic measure of reasoning coherence from start to finish. This step-by-step verification provides a much deeper insight into a model's true reasoning capabilities than simply checking if the final answer is correct.
Our experiments on 26 LLMs reveal a clear performance hierarchy. Frontier proprietary models like Gemini 2.5 Pro, Claude Sonnet 4.5, and Claude Sonnet 4 consistently achieve the highest accuracy, demonstrating superior and more balanced reasoning across all financial domains and difficulty levels. However, fine-tuning proves to be a highly effective strategy for smaller, open-source models. Domain-specific models like Fin-R1 and math-enhanced models like Mathstral significantly narrow the performance gap, sometimes outperforming larger systems in their specialized areas. Despite these gains, all models struggle as task complexity increases, with the performance of fine-tuned models degrading more sharply than their proprietary counterparts, highlighting that a significant challenge remains in achieving robust, multi-step symbolic reasoning.
Table 1: Zero-shot performance across financial, mathematical, and general reasoning benchmarks.
Table 2: Performance of various LLMs over 200 random samples using ChainEval and ROUGE R₂.
Figure 3: CHAINEVAL score across difficulty levels.
The application provides an interactive interface to explore the FINCHAIN benchmark. Users can generate unique question-answer pairs from the library of predefined templates. Any generated question can then be passed to a language model to test its reasoning capabilities, allowing for a direct, side-by-side comparison of the model's generated solution against the original, verifiable solution from the benchmark. You can explore the interactive application here.
Figure 4: A preview of the interactive FINCHAIN application.
@article{xie2025finchain,
title={FINCHAIN: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning},
author={Xie, Zhuohan and Orel, Daniil and Thareja, Rushil and Sahnan, Dhruv and Madmoun, Hachem and Zhang, Fan and Banerjee, Debopriyo and Georgiev, Georgi and Peng, Xueqing and Qian, Lingfei and others},
journal={arXiv preprint},
year={2025}
}