Recreated from the paper for quick comparison. Scores are as reported.
| Model | MCQ (Accuracy % ↑) | Open-Ended QA (Score 0–10 ↑) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Datasets | Datasets | ||||||||
| Accounting | Business | Fatwā | Sentiment | Mean | Event-Cause QA | Islamic-Standards-QA | Fatwa-QA | Mean | |
| Open-source Models: ≥ 70B Parameters | |||||||||
| Qwen2.5-72B-Instruct | 65.87 | 74.86 | 84.65 | 75.00 | 75.10 | 8.1000 | 5.6330 | 5.3912 | 6.3747 |
| LLaMA-3.1-70B | 52.10 | 77.60 | 84.90 | 80.00 | 73.65 | 6.623 | 3.7245 | 4.7607 | 5.036 |
| Open-source Models: < 70B Parameters | |||||||||
| Qwen2.5-14B-Instruct | 49.10 | 63.39 | 76.05 | 57.50 | 61.51 | 7.4975 | 4.8806 | 4.0576 | 5.4786 |
| Qwen2.5-7B-Instruct | 48.50 | 59.56 | 70.00 | 55.00 | 58.27 | 6.1038 | 3.4039 | 2.6815 | 4.0631 |
| Gemma-2-9B-IT | 49.10 | 63.39 | 66.60 | 55.00 | 58.52 | 7.1438 | 4.2306 | 3.4266 | 4.9336 |
| Gemma-3-27B-IT | 53.89 | 73.22 | 80.65 | 80.00 | 71.94 | 8.7188 | 6.1708 | 5.1929 | 6.6942 |
| Gemma-3-4B-IT | 38.32 | 67.76 | 61.35 | 75.00 | 60.61 | 7.4075 | 2.8985 | 2.4767 | 4.2609 |
| LLaMA-3.1-8B | 41.92 | 60.66 | 64.05 | 73.75 | 60.60 | 4.9231 | 2.5168 | 1.4025 | 2.9475 |
| Mixtral-8x7B-Instruct | 32.93 | 60.66 | 62.15 | 70.00 | 56.44 | 4.5538 | 2.4980 | 1.7896 | 2.9471 |
| Proprietary Models: Reasoning-Enhanced | |||||||||
| GPT-5 | 65.27 | 72.68 | 90.75 | 78.75 | 76.86 | 9.6831 | 8.7965 | 8.0515 | 8.8437 |
| GPT-4o | 60.48 | 78.14 | 87.70 | 77.50 | 75.96 | 8.3125 | 6.6598 | 6.5219 | 7.1647 |
| Proprietary Models: General-Purpose | |||||||||
| Claude-Opus-4.5 | 77.84 | 76.50 | 91.75 | 75.00 | 80.27 | 9.6818 | 8.0438 | 8.80906 | 8.8449 |
| Claude-Sonnet-4.5 | 78.44 | 76.50 | 88.15 | 77.50 | 80.15 | 9.3388 | 8.2588 | 7.6049 | 8.4008 |
| Claude-Haiku-4.5 | 67.66 | 73.77 | 84.90 | 77.50 | 75.96 | 9.1050 | 7.0002 | 6.5341 | 7.5464 |
| Gemini-3-Flash (preview) | 76.05 | 74.86 | 89.90 | 81.25 | 80.52 | 9.8369 | 9.1649 | 9.1571 | 9.0798 |
| GPT-4o-mini | 58.08 | 77.60 | 81.75 | 75.00 | 73.61 | 7.9613 | 5.6094 | 5.3087 | 6.2931 |
| Arabic Models | |||||||||
| ALLAM-7B | 44.91 | 68.31 | 74.40 | 58.75 | 61.59 | 6.8875 | 4.9364 | 4.2185 | 5.3475 |
| Fanar-1-9B | 47.31 | 66.12 | 74.45 | 58.75 | 61.66 | 7.5850 | 4.9607 | 4.4600 | 5.6686 |
| SILMA-9B | 50.90 | 69.40 | 62.55 | 30.00 | 53.21 | 1.8969 | 3.3547 | 2.0711 | 2.4409 |
| Jais-2-8B | 35.33 | 60.30 | 66.10 | 46.25 | 52.00 | 4.6922 | 4.245 | 2.5147 | 4.88 |
Table 1: Unified leaderboard comparing MCQ tasks (Accuracy %) and open-ended QA tasks (Score 0–10). Open-ended QA scores are averaged over Event-Cause QA, Islamic-Standards-QA, and Fatwa-QA.
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Proprietary Models – Reasoning-Enhanced | |||
| Claude-Opus-4.5 | 78.22 | 63.17 | 64.14 |
| GPT-5 | 41.19 | 63.70 | 64.11 |
| Claude-Sonnet-4.5 | 79.86 | 64.98 | 65.13 |
| Proprietary Models – General-Purpose | |||
| Claude-Haiku-4.5 | 79.39 | 61.40 | 63.62 |
| GPT-4o-mini | 77.79 | 62.90 | 64.08 |
| GPT-4o | 78.91 | 63.16 | 63.71 |
| Gemini-3-Flash | 49.36 | 35.83 | 43.02 |
| Gemini-2.5-Flash | 39.46 | 27.17 | 36.81 |
| Open-source Models: ≥ 70B Parameters | |||
| Gemma-3-27B-IT | 79.25 | 63.57 | 63.42 |
| Qwen2.5-72B-Instruct | 40.52 | 29.50 | 34.04 |
| Meta-LLaMA-3.1-70B | 39.64 | 31.40 | 32.65 |
| Open-source Models: < 70B Parameters | |||
| Qwen2.5-14B-Instruct | 44.42 | 30.90 | 35.82 |
| Gemma-3-4B-IT | 76.52 | 62.06 | 60.93 |
| Meta-LLaMA-3.1-8B | 66.67 | 47.92 | 56.10 |
| Mixtral-8x7B-Instruct | 32.71 | 13.07 | 23.78 |
| Qwen2.5-7B-Instruct | 25.15 | 12.01 | 21.86 |
| Arabic Models | |||
| Jais-2-8B | 73.68 | 56.54 | 61.17 |
| Fanar-1-9B-Instruct | 60.51 | 35.97 | 46.96 |
| ALLaM-7B-Instruct | 35.97 | 22.61 | 28.24 |
| SILMA-9B-Instruct | 27.92 | 16.66 | 25.99 |
Table 2: Extractive summarization performance on Arabic financial reports evaluated using ROUGE F1 (%).
| Model | MCQ (Accuracy % ↑) | Open-Ended QA (Score 0–10 ↑) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Accounting | Business | Fatwā | Sentiment | Mean | Event-Cause | Fatwa-QA | Islamic-Std | Mean | |
| Base Models | |||||||||
| ALLAM-7B | 44.91 | 68.31 | 74.40 | 58.75 | 61.59 | 6.89 | 4.94 | 4.22 | 5.35 |
| Jais-2-8B | 35.33 | 60.30 | 66.10 | 46.25 | 52.00 | 4.69 | 2.51 | 4.24 | 4.88 |
| SILMA-9B | 50.90 | 69.40 | 62.55 | 30.00 | 53.21 | 1.90 | 3.35 | 2.07 | 2.44 |
| Fine-tuned Models | |||||||||
| ALLAM-7B | 71.40 (+26.5) | 93.99 (+25.7) | 74.45 (+0.1) | 61.25 (+2.5) | 75.27 (+13.7) | 6.79 (-0.1) | 6.48 (+1.5) | 4.12 (-0.1) | 5.80 (+0.5) |
| Jais-2-8B | 40.72 (+5.4) | 62.30 (+2.0) | 70.14 (+4.0) | 57.97 (+11.7) | 57.78 (+5.8) | 5.25 (+0.6) | 4.69 (+2.2) | 4.97 (+0.7) | 4.97 (+0.1) |
| SILMA-9B | 43.11 (-7.8) | 75.96 (+6.6) | 60.60 (-2.0) | 53.75 (+23.8) | 58.36 (+5.2) | 2.01 (+0.1) | 3.67 (+0.3) | 3.67 (+1.6) | 3.12 (+0.7) |
Table 3: Domain adaptation across Arabic LLMs. MCQ accuracy (%) and Open-Ended QA scores (0–10) before and after fine-tuning on SAHM.