Background
CLEF 2026 - Call for Participation

FinMMEval Lab 2026

Multilingual and Multimodal Evaluation of Financial AI Systems

About the Lab

FinMMEval Lab integrates financial reasoning, multilingual understanding, and decision-making into a unified evaluation suite designed to promote robust, transparent, and globally competent financial AI. The 2026 edition introduces three interconnected tasks spanning five languages.

  • Multi-modal inputs: news, filings, macro indicators, tests.
  • Multiple languages with low-resource representations:
    English, Chinese, Arabic, Hindi, Greek, Japanese, Spanish.
  • Tasks spanning Q&A, and decision making.
  • Metrics centered on Accuracy, ROUGE-1, BLEURT
    and performance quantitative metrics (e.g. CR, SR, MD).
Financial analytics dashboard
How can I tailor my setup to make an LLM exceptionally good at finance?

Tasks

Choose one or more tasks. Each submission must provide calibrated confidence scores and an evidence trace.

Task 1 - Financial Exam Q&A

Given a stand-alone multiple-choice question Q with four candidate options { A1, A2, A3, A4 }, the system must select the correct answer A. Questions cover valuation, accounting, ethics, corporate finance, and regulatory knowledge. The focus is on conceptual understanding and precise financial reasoning rather than surface pattern recognition.

  • Motivation: Professional financial qualification exams (e.g., CFA, EFPA) require the integration of theoretical and regulatory knowledge with applied reasoning. Existing LLMs often rely on factual recall without demonstrating the analytical rigor expected from human candidates. This task evaluates whether models can achieve domain-level understanding and reasoning consistency across multilingual financial contexts.
  • Data: Combination of existing multilingual financial exam datasets with newly collected materials:
    EFPA (Spanish):
    50 exam-style financial questions on investment and regulation.
    GRFinQA (Greek):
    225 multiple-choice finance questions from university-level exams.
    CFA (English):
    600 exam-style multiple-choice questions covering nine core domains - Ethical and Professional Standards, Quantitative Methods, Economics, Financial Reporting and Analysis, Corporate Finance, Equity Investments, Fixed Income, Derivatives, and Portfolio Management.
    CPA (Chinese):
    300 exam-style financial questions focusing on major modules - Accounting, Auditing, Financial Management, Taxation, Economic Law, and Strategy.
    BBF (Hindi):
    500-1000 exam-style financial multiple-choice questions covering over 30 domains of the Indian financial landscape. The questions are drawn from about 25 financial and institutional exams across India, covering areas such as problem solving, mathematics for finance, and governance.
    All questions were reviewed by financial professionals to ensure correctness and conceptual balance.
  • Evaluation: Models are required to output the correct answer label. Performance is measured by accuracy, defined as the proportion of correctly identified options in the test set.

Task 2 - Multilingual Financial Q&A

Given a financial report R (English 10-K/10-Q excerpts) and multilingual news articles N = { Nen, Nzh, Nja, Nes, Nel } related to the same company filings, the model receives a question Q and must generate an answer A supported by multilingual textual evidence.

Two difficulty tiers are included:

PolyFiQA-Easy:
Factual or numerical trend questions (e.g., revenue growth, cash flow irregularities);
PolyFiQA-Expert:
Complex analytical questions requiring multi-document reasoning (e.g., investment strategies, capital allocation).
  • Motivation: Global finance increasingly demands multilingual reasoning across documents such as English financial filings and foreign-language news. However, most existing benchmarks are monolingual. This task introduces the first cross-lingual financial reasoning benchmark, testing a model’s ability to integrate multilingual textual signals and perform analytical reasoning grounded in authentic financial data.
  • Data: Combination of the PolyFiQA-Easy and PolyFiQA-Expert datasets, which combine U.S. SEC filings with multilingual news (English, Chinese, Japanese,Spanish, Greek). Each tier includes 172 Q&A instances (344 total). All questions and answers were manually written and validated by financial experts, with inter-annotator agreement above 89%. The dataset is released under the MIT License.
  • Evaluation: Models must produce concise, evidence-grounded textual answers (≤100 words). Performance is evaluated using ROUGE-1 as the primary metric, measuring unigram overlap with expert references. Optional metrics such as BLEURT and factual consistency may be used for secondary analysis.

Task 3 - Financial Decision Making

Given a market context C = { Pt, Nt }, where Pt represents the historical price series, Nt denotes contemporaneous textual information (e.g., news, reports), models are required to predict one of three discrete actions: Buy, Hold, or Sell. Each prediction must be accompanied by a concise textual rationale (≤50 words) that explicitly cites the supporting evidence or reasoning process.

  • Motivation: Real-world investment decisions demand the integration of heterogeneous information sources—such as textual news, market price dynamics, and current portfolio positions—to form actionable insights. Unlike static question-answering tasks, this task centers on reasoning-to-action: evaluating how models synthesize complex, time-varying market contexts to generate trading strategies that are both evidence-grounded and risk-aware.
  • Data: Combination of two curated JSON datasets that provide daily market contexts for BTC and TSLA. Each context is indexed by ISO date (YYYY-MM-DD) and aggregates heterogeneous signals for that trading day:
    Assets and coverage:
    a) BTC: 2025-07-21 to 2025-10-09, 81 daily contexts (aggregated daily snapshots despite 24/7 trading).
    b) TSLA: 2024-07-23 to 2025-10-09, 444 daily contexts.
    Fields per date d:
    a) prices (float): a single representative USD value per day (not OHLC and not intra-day series).
    b) news (array[string]): one or more long-form textual syntheses summarizing that day’s market/newsflow.
    c) momentum (categorical): { bullish, neutral, bearish }; a daily market-momentum label manually annotated based on the accompanying news (not a computed technical indicator).
    d) future_price_diff (float | null): one-day-ahead price change Pd+1 - Pd; null on the last available date.
    e) 10k, 10q (array): fundamental filings associated with the asset. For BTC, all entries are empty arrays. For TSLA, most dates are empty, but 2025-09-29 includes one 10-K and two 10-Q entries.
    Format:
    UTF-8 JSON; keys are calendar dates (no intra-day timestamps).
    This dataset provides the daily market context C = { Pt, Nt } for decision-making; models produce actions (Buy/Hold/Sell), while future_price_diff serves as the supervisory signal for next-day PnL evaluation in a continuous live-trading setup.
  • Evaluation: Model performance is jointly assessed for profitability, stability, and risk control using established quantitative metrics:
    Primary:
    Cumulative Return (CR).
    Secondary:
    Sharpe Ratio (SR), Maximum Drawdown (MD), Daily Volatility (DV), and Annualized Volatility (AV).
    Together, these indicators capture the model's ability to balance reward and risk, maintain behavioral consistency, and adapt to varying market regimes in a dynamic trading environment.

Timeline (Tentative)

  1. Task Definitions Release
    Fall, 2025
  2. Registration Opens
    January, 2026
  3. Submission Deadline (Experiments)
    May, 2026
  4. Submission Deadline (Reports)
    June, 2026
  5. Camera Ready Submission
    July, 2026
  6. CLEF 2026 Workshop
    Sep, 2026

HOW TO PARTICIPATE

Engage with the challenges in a way that suits you - from a quick, one-time experiment to a detailed research project. While we invite you to share your findings in our workshop notes, you are also free to develop promising results into a full paper for an archival journal. The workshop itself is a perfect opportunity to refine your ideas through discussion with peers.

Ready to join? Sign up now by emailing the FinMMEval organizers.

Packaging Checklist

  • Results JSONL (per task)
  • System Card (architecture, data usage, risks)
  • Reproducibility (seed, versions, hardware)
  • License compliance acknowledgements (if applicable)

Organizers

We thank our supporters from academia and industry.

MBZUAI
University of Tokyo
University of Sofia
The Fin AI
University of Arizona
INSAIT

FAQ

Who can participate?

Researchers and practitioners from academia and industry. Student teams are particularly welcome.

How is data licensed?

Research-only license; redistribution of raw sources may be restricted.

Can we submit to multiple tasks?

Yes. Submit independent result bundles per task.

Are ensembles allowed?

Yes, but disclose all components in the system card.