SAHM Leaderboard

Back

Recreated from the paper for quick comparison. Scores are as reported.

Extractive Summarization on Arabic Financial Reports (ROUGE F1 %)

Model #Examples ROUGE-1 ROUGE-2 ROUGE-L
Open-source Models
Gemma-2-9B-IT4075.8461.0964.67
LLaMA-3.1-8B4065.9343.1254.45
Qwen2.5-14B-Instruct4078.6862.1463.93
Qwen2.5-7B-Instruct4059.1937.1350.76
Falcon-H1-7B-Instruct4011.244.3810.04
Proprietary Models
GPT-54081.0866.9065.81
GPT-4o4079.6061.6963.98
GPT-44079.6661.7863.00
Claude-4 Sonnet4082.3565.3964.78
Gemini-2.5 Pro4036.3526.4028.84
Arabic Models
ALLAM-7B4067.6745.1653.98
Fanar-1-9B-Instruct4055.6430.7344.72
SILMA-9B4018.5112.1617.14

Table 1: ROUGE-1/2/L F1 (%). 40 examples per model.

Unified Leaderboard (MCQ Accuracy % and Open-Ended QA Score 0–10)

Model MCQ (Accuracy % ↑) Open-Ended QA (Score 0–10 ↑)
AccountingBusinessFatwāSentimentAcc. Mean ECR-QASS-QAFatwa-QAScore Mean
Open-source Models
Qwen2.5-14B-Instruct20.8367.1164.4057.5052.967.5534.7773.7485.651
Qwen2.5-7B-Instruct29.1767.1160.4060.0054.676.6063.4942.7724.689
Gemma-2-9B-IT25.0048.6868.8060.0050.626.7814.0203.4645.123
LLaMA-3.1-8B29.1761.8462.8065.0054.704.8892.4411.6893.289
Falcon-H1-7B-Instruct37.5052.6366.4057.5053.517.1513.9683.0094.709
Proprietary Models
GPT-587.5072.3780.8055.0073.929.6589.0688.2278.984
GPT-4o54.1775.0076.4062.5067.528.4366.7766.0467.086
GPT-425.0073.6872.0052.5055.796.7665.8075.2095.927
Claude-4 Sonnet45.8359.2165.2057.5056.448.8127.9346.5867.777
Gemini-2.5 Pro25.0047.3768.8057.5049.676.4725.6175.1045.731
Arabic Models
ALLAM-7B20.8368.4270.0065.0056.066.3404.7104.1105.053
Fanar-1-9B33.3371.0568.8042.5053.926.9503.7803.7404.823
SILMA-9B14.2967.1157.6045.0046.001.8203.3002.5802.567

Table 2: Unified leaderboard with unweighted means across tasks.