SAHM Leaderboard

Back

Recreated from the paper for quick comparison. Scores are as reported.

Unified Leaderboard (MCQ Accuracy % and Open-Ended QA Score 0–10)

Model MCQ (Accuracy % ↑) Open-Ended QA (Score 0–10 ↑)
Datasets Datasets
Accounting Business Fatwā Sentiment Mean Event-Cause QA Islamic-Standards-QA Fatwa-QA Mean
Open-source Models: ≥ 70B Parameters
Qwen2.5-72B-Instruct 65.8774.8684.6575.0075.10 8.10005.63305.39126.3747
LLaMA-3.1-70B 52.1077.6084.9080.0073.65 6.6233.72454.76075.036
Open-source Models: < 70B Parameters
Qwen2.5-14B-Instruct 49.1063.3976.0557.5061.51 7.49754.88064.05765.4786
Qwen2.5-7B-Instruct 48.5059.5670.0055.0058.27 6.10383.40392.68154.0631
Gemma-2-9B-IT 49.1063.3966.6055.0058.52 7.14384.23063.42664.9336
Gemma-3-27B-IT 53.8973.2280.6580.0071.94 8.71886.17085.19296.6942
Gemma-3-4B-IT 38.3267.7661.3575.0060.61 7.40752.89852.47674.2609
LLaMA-3.1-8B 41.9260.6664.0573.7560.60 4.92312.51681.40252.9475
Mixtral-8x7B-Instruct 32.9360.6662.1570.0056.44 4.55382.49801.78962.9471
Proprietary Models: Reasoning-Enhanced
GPT-5 65.2772.6890.7578.7576.86 9.68318.79658.05158.8437
GPT-4o 60.4878.1487.7077.5075.96 8.31256.65986.52197.1647
Proprietary Models: General-Purpose
Claude-Opus-4.5 77.8476.5091.7575.0080.27 9.68188.04388.809068.8449
Claude-Sonnet-4.5 78.4476.5088.1577.5080.15 9.33888.25887.60498.4008
Claude-Haiku-4.5 67.6673.7784.9077.5075.96 9.10507.00026.53417.5464
Gemini-3-Flash (preview) 76.0574.8689.9081.2580.52 9.83699.16499.15719.0798
GPT-4o-mini 58.0877.6081.7575.0073.61 7.96135.60945.30876.2931
Arabic Models
ALLAM-7B 44.9168.3174.4058.7561.59 6.88754.93644.21855.3475
Fanar-1-9B 47.3166.1274.4558.7561.66 7.58504.96074.46005.6686
SILMA-9B 50.9069.4062.5530.0053.21 1.89693.35472.07112.4409
Jais-2-8B 35.3360.3066.1046.2552.00 4.69224.2452.51474.88

Table 1: Unified leaderboard comparing MCQ tasks (Accuracy %) and open-ended QA tasks (Score 0–10). Open-ended QA scores are averaged over Event-Cause QA, Islamic-Standards-QA, and Fatwa-QA.

Extractive Summarization on Arabic Financial Reports (ROUGE F1 %)

Model ROUGE-1 ROUGE-2 ROUGE-L
Proprietary Models – Reasoning-Enhanced
Claude-Opus-4.578.2263.1764.14
GPT-541.1963.7064.11
Claude-Sonnet-4.579.8664.9865.13
Proprietary Models – General-Purpose
Claude-Haiku-4.579.3961.4063.62
GPT-4o-mini77.7962.9064.08
GPT-4o78.9163.1663.71
Gemini-3-Flash49.3635.8343.02
Gemini-2.5-Flash39.4627.1736.81
Open-source Models: ≥ 70B Parameters
Gemma-3-27B-IT79.2563.5763.42
Qwen2.5-72B-Instruct40.5229.5034.04
Meta-LLaMA-3.1-70B39.6431.4032.65
Open-source Models: < 70B Parameters
Qwen2.5-14B-Instruct44.4230.9035.82
Gemma-3-4B-IT76.5262.0660.93
Meta-LLaMA-3.1-8B66.6747.9256.10
Mixtral-8x7B-Instruct32.7113.0723.78
Qwen2.5-7B-Instruct25.1512.0121.86
Arabic Models
Jais-2-8B73.6856.5461.17
Fanar-1-9B-Instruct60.5135.9746.96
ALLaM-7B-Instruct35.9722.6128.24
SILMA-9B-Instruct27.9216.6625.99

Table 2: Extractive summarization performance on Arabic financial reports evaluated using ROUGE F1 (%).

Domain Adaptation Across Arabic LLMs

Model MCQ (Accuracy % ↑) Open-Ended QA (Score 0–10 ↑)
Accounting Business Fatwā Sentiment Mean Event-Cause Fatwa-QA Islamic-Std Mean
Base Models
ALLAM-7B 44.9168.3174.4058.7561.59 6.894.944.225.35
Jais-2-8B 35.3360.3066.1046.2552.00 4.692.514.244.88
SILMA-9B 50.9069.4062.5530.0053.21 1.903.352.072.44
Fine-tuned Models
ALLAM-7B 71.40 (+26.5) 93.99 (+25.7) 74.45 (+0.1) 61.25 (+2.5) 75.27 (+13.7) 6.79 (-0.1) 6.48 (+1.5) 4.12 (-0.1) 5.80 (+0.5)
Jais-2-8B 40.72 (+5.4) 62.30 (+2.0) 70.14 (+4.0) 57.97 (+11.7) 57.78 (+5.8) 5.25 (+0.6) 4.69 (+2.2) 4.97 (+0.7) 4.97 (+0.1)
SILMA-9B 43.11 (-7.8) 75.96 (+6.6) 60.60 (-2.0) 53.75 (+23.8) 58.36 (+5.2) 2.01 (+0.1) 3.67 (+0.3) 3.67 (+1.6) 3.12 (+0.7)

Table 3: Domain adaptation across Arabic LLMs. MCQ accuracy (%) and Open-Ended QA scores (0–10) before and after fine-tuning on SAHM.