Holistic Arabic financial NLP benchmark and instruction-tuning resources
Authors:
Rania Elbadry1, Sarfraz Ahmad1*, Dani Bouch1*, Momina Ahsan1, Xueqing Peng1, Jimin Huang1, Muhra AlMahri1, Marwa Elsaid Khalil1, Yuxia Wang1, Salem Lahlou1, Veselin Stoyanov1, Sophia Ananiadou1, Preslav Nakov1, Zhuohan Xie1
*Equal contribution.
Affiliations:
1 Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Contact: rania.elbadry@mbzuai.ac.ae
Figure 1: Overview of the SAHM benchmark. The diagram illustrates the seven core Arabic financial NLP tasks and their thematic coverage across Islamic finance, accounting, and business domains.
We present SAHM, the first large-scale benchmark and instruction-tuning dataset for Arabic financial NLP. Despite advances in English benchmarks such as FiQA, FinQA, and FinChain, there is no comprehensive resource for evaluating large language models (LLMs) on Arabic financial and Shari’ah-compliant reasoning. SAHM fills this gap through seven expert-verified tasks covering Islamic Finance Shari’ah Standards QA, Fatwa QA and MCQ, Business and Accounting MCQ, Financial Report Sentiment Analysis, Extractive Summarization, and Event–Cause Reasoning QA. All data are derived from authentic Arabic regulatory, juristic, and corporate sources and curated via a hybrid LLM–human pipeline ensuring linguistic fidelity and legal–financial accuracy. We benchmark 13 proprietary, open-source, and Arabic-centric LLMs, showing that proprietary models (e.g., GPT-5) outperform others in open-ended reasoning, while Arabic-tuned models achieve higher accuracy on domain-specific MCQs but struggle with causal and doctrinal reasoning. By combining financial, legal, and linguistic perspectives, SAHM provides the first holistic evaluation suite for Arabic financial NLP, establishing a foundation for reproducible, culturally grounded research and future instruction-tuned model development.
Figure 2: Illustrative examples — juristic QA, fatwa QA/MCQ, business & accounting MCQs, financial sentiment, extractive summarization, and event–cause QA.
Figure 3: Data pipeline—AAOIFI book → OCR → text files → LLM draft QA → expert verification and legal compliance.
If you find this work useful, please cite:
@inproceedings{sahm2025,
title={SAHM: Arabic Financial Instruction-Tuning Dataset and Models},
author={Elbadry, Rania and Ahmad, Sarfraz and Bouch, Dani and Ahsan, Momina and Peng, Xueqing and Huang, Jimin and AlMahri, Muhra and Khalil, Marwa Elsaid and Wang, Yuxia and Lahlou, Salem and Stoyanov, Veselin and Ananiadou, Sophia and Nakov, Preslav and Xie, Zhuohan},
booktitle={},
year={2025},
url={https://github.com/mbzuai-nlp/SAHM}
}