Recreated from the paper for quick comparison. Scores are as reported.
| Model | #Examples | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|
| Open-source Models | ||||
| Gemma-2-9B-IT | 40 | 75.84 | 61.09 | 64.67 |
| LLaMA-3.1-8B | 40 | 65.93 | 43.12 | 54.45 |
| Qwen2.5-14B-Instruct | 40 | 78.68 | 62.14 | 63.93 |
| Qwen2.5-7B-Instruct | 40 | 59.19 | 37.13 | 50.76 |
| Falcon-H1-7B-Instruct | 40 | 11.24 | 4.38 | 10.04 |
| Proprietary Models | ||||
| GPT-5 | 40 | 81.08 | 66.90 | 65.81 |
| GPT-4o | 40 | 79.60 | 61.69 | 63.98 |
| GPT-4 | 40 | 79.66 | 61.78 | 63.00 |
| Claude-4 Sonnet | 40 | 82.35 | 65.39 | 64.78 |
| Gemini-2.5 Pro | 40 | 36.35 | 26.40 | 28.84 |
| Arabic Models | ||||
| ALLAM-7B | 40 | 67.67 | 45.16 | 53.98 |
| Fanar-1-9B-Instruct | 40 | 55.64 | 30.73 | 44.72 |
| SILMA-9B | 40 | 18.51 | 12.16 | 17.14 |
Table 1: ROUGE-1/2/L F1 (%). 40 examples per model.
| Model | MCQ (Accuracy % ↑) | Open-Ended QA (Score 0–10 ↑) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accounting | Business | Fatwā | Sentiment | Acc. Mean | ECR-QA | SS-QA | Fatwa-QA | Score Mean | ||
| Open-source Models | ||||||||||
| Qwen2.5-14B-Instruct | 20.83 | 67.11 | 64.40 | 57.50 | 52.96 | 7.553 | 4.777 | 3.748 | 5.651 | |
| Qwen2.5-7B-Instruct | 29.17 | 67.11 | 60.40 | 60.00 | 54.67 | 6.606 | 3.494 | 2.772 | 4.689 | |
| Gemma-2-9B-IT | 25.00 | 48.68 | 68.80 | 60.00 | 50.62 | 6.781 | 4.020 | 3.464 | 5.123 | |
| LLaMA-3.1-8B | 29.17 | 61.84 | 62.80 | 65.00 | 54.70 | 4.889 | 2.441 | 1.689 | 3.289 | |
| Falcon-H1-7B-Instruct | 37.50 | 52.63 | 66.40 | 57.50 | 53.51 | 7.151 | 3.968 | 3.009 | 4.709 | |
| Proprietary Models | ||||||||||
| GPT-5 | 87.50 | 72.37 | 80.80 | 55.00 | 73.92 | 9.658 | 9.068 | 8.227 | 8.984 | |
| GPT-4o | 54.17 | 75.00 | 76.40 | 62.50 | 67.52 | 8.436 | 6.776 | 6.046 | 7.086 | |
| GPT-4 | 25.00 | 73.68 | 72.00 | 52.50 | 55.79 | 6.766 | 5.807 | 5.209 | 5.927 | |
| Claude-4 Sonnet | 45.83 | 59.21 | 65.20 | 57.50 | 56.44 | 8.812 | 7.934 | 6.586 | 7.777 | |
| Gemini-2.5 Pro | 25.00 | 47.37 | 68.80 | 57.50 | 49.67 | 6.472 | 5.617 | 5.104 | 5.731 | |
| Arabic Models | ||||||||||
| ALLAM-7B | 20.83 | 68.42 | 70.00 | 65.00 | 56.06 | 6.340 | 4.710 | 4.110 | 5.053 | |
| Fanar-1-9B | 33.33 | 71.05 | 68.80 | 42.50 | 53.92 | 6.950 | 3.780 | 3.740 | 4.823 | |
| SILMA-9B | 14.29 | 67.11 | 57.60 | 45.00 | 46.00 | 1.820 | 3.300 | 2.580 | 2.567 | |
Table 2: Unified leaderboard with unweighted means across tasks.