SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning

Holistic Arabic financial NLP benchmark and instruction-tuning resources

Authors:

Rania Elbadry1, Sarfraz Ahmad1, Ahmed Heakl1, Dani Bouch1, Momina Ahsan1, Muhra AlMahri1, Marwa Elsaid Khalil1, Mohamed Anwar1, Yuxia Wang2, Salem Lahlou1, Sophia Ananiadou3, Veselin Stoyanov1, Jimin Huang4, Xueqing Peng4, Preslav Nakov1, Zhuohan Xie1

Affiliations:

1MBZUAI, 2 INSAIT, 3 The University of Manchester, 4 The Fin AI

Contact: rania.elbadry@mbzuai.ac.ae

PDF Paper GitHub Code Hugging Face Datasets Leaderboard

Abstract

We present SAHM, the first large-scale benchmark and instruction-tuning dataset for Arabic financial NLP. Despite advances in English benchmarks such as FiQA, FinQA, and FinChain, there is no comprehensive resource for evaluating large language models (LLMs) on Arabic financial and Shari’ah-compliant reasoning. SAHM fills this gap through seven expert-verified tasks covering Islamic Finance Shari’ah Standards QA, Fatwa QA and MCQ, Business and Accounting MCQ, Financial Report Sentiment Analysis, Extractive Summarization, and Event–Cause Reasoning QA. All data are derived from authentic Arabic regulatory, juristic, and corporate sources and curated via a hybrid LLM–human pipeline ensuring linguistic fidelity and legal–financial accuracy. We benchmark 13 proprietary, open-source, and Arabic-centric LLMs, showing that proprietary models (e.g., GPT-5) outperform others in open-ended reasoning, while Arabic-tuned models achieve higher accuracy on domain-specific MCQs but struggle with causal and doctrinal reasoning. By combining financial, legal, and linguistic perspectives, SAHM provides the first holistic evaluation suite for Arabic financial NLP, establishing a foundation for reproducible, culturally grounded research and future instruction-tuned model development.

SAHM illustrative examples

Figure 2: Examples of the diverse tasks included in SAHM, covering juristic Q&A, business and accounting MCQs, financial sentiment analysis, report summarization, & event causal reasoning.

SAHM data pipeline

Figure 3: Pipeline for constructing the Islamic Finance Shari’ah Standards QA dataset. A hybrid LLMs-human pipeline converts AAOIFI standards into QA pairs through OCR and generation stages, each followed by expert verification to ensure linguistic accuracy and legal fidelity.

Tasks Covered

Highlights

Paper & Materials

Datasets & Models

Citation

If you find this work useful, please cite:

@misc{elbadry2026sahmbenchmarkarabicfinancial, title={SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning}, author={Rania Elbadry and Sarfraz Ahmad and Ahmed Heakl and Dani Bouch and Momina Ahsan and Muhra AlMahri and Marwa Elsaid khalil and Yuxia Wang and Salem Lahlou and Sophia Ananiadou and Veselin Stoyanov and Jimin Huang and Xueqing Peng and Preslav Nakov and Zhuohan Xie}, year={2026}, eprint={2604.19098}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.19098}, }