SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning

Abstract

We present SAHM, the first large-scale benchmark and instruction-tuning dataset for Arabic financial NLP. Despite advances in English benchmarks such as FiQA, FinQA, and FinChain, there is no comprehensive resource for evaluating large language models (LLMs) on Arabic financial and Shari’ah-compliant reasoning. SAHM fills this gap through seven expert-verified tasks covering Islamic Finance Shari’ah Standards QA, Fatwa QA and MCQ, Business and Accounting MCQ, Financial Report Sentiment Analysis, Extractive Summarization, and Event–Cause Reasoning QA. All data are derived from authentic Arabic regulatory, juristic, and corporate sources and curated via a hybrid LLM–human pipeline ensuring linguistic fidelity and legal–financial accuracy. We benchmark 13 proprietary, open-source, and Arabic-centric LLMs, showing that proprietary models (e.g., GPT-5) outperform others in open-ended reasoning, while Arabic-tuned models achieve higher accuracy on domain-specific MCQs but struggle with causal and doctrinal reasoning. By combining financial, legal, and linguistic perspectives, SAHM provides the first holistic evaluation suite for Arabic financial NLP, establishing a foundation for reproducible, culturally grounded research and future instruction-tuned model development.

Figure 2: Examples of the diverse tasks included in SAHM, covering juristic Q&A, business and accounting MCQs, financial sentiment analysis, report summarization, & event causal reasoning.

Figure 3: Pipeline for constructing the Islamic Finance Shari’ah Standards QA dataset. A hybrid LLMs-human pipeline converts AAOIFI standards into QA pairs through OCR and generation stages, each followed by expert verification to ensure linguistic accuracy and legal fidelity.

Tasks Covered

Islamic Finance Shari’ah Standards QA
Islamic Financial Fatwa QA
Islamic Financial Fatwa MCQ
Business MCQ
Accounting MCQ
Financial Report Sentiment Analysis (MCQ)
Report Extractive Summarization
Event–Cause Reasoning QA

Highlights

First comprehensive Arabic financial NLP benchmark unifying modern finance with Shari’ah-compliant reasoning.
Native-speaker–verified datasets with normalization scripts and standardized evaluation protocols.
Benchmarks 13 LLMs and reveals gaps in causal inference, hybrid legal–financial reasoning, and doctrinal fidelity.

Paper & Materials

arXiv · Poster (soon) · Slides (soon)
Citation · BibTeX

Datasets & Models

Code & evaluation: github.com/mbzuai-nlp/SAHM
Models & Datasets: huggingface.co/SahmBenchmark

Citation

If you find this work useful, please cite:

@misc{elbadry2026sahmbenchmarkarabicfinancial,
      title={SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning},
      author={Rania Elbadry and Sarfraz Ahmad and Ahmed Heakl and Dani Bouch and Momina Ahsan and Muhra AlMahri and Marwa Elsaid khalil and Yuxia Wang and Salem Lahlou and Sophia Ananiadou and Veselin Stoyanov and Jimin Huang and Xueqing Peng and Preslav Nakov and Zhuohan Xie},
      year={2026},
      eprint={2604.19098},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.19098},
}