SAHM Leaderboard

Recreated from the paper for quick comparison. Scores are as reported.

Extractive Summarization on Arabic Financial Reports (ROUGE F1 %)

Model	#Examples	ROUGE-1	ROUGE-2	ROUGE-L
Open-source Models
Gemma-2-9B-IT	40	75.84	61.09	64.67
LLaMA-3.1-8B	40	65.93	43.12	54.45
Qwen2.5-14B-Instruct	40	78.68	62.14	63.93
Qwen2.5-7B-Instruct	40	59.19	37.13	50.76
Falcon-H1-7B-Instruct	40	11.24	4.38	10.04
Proprietary Models
GPT-5	40	81.08	66.90	65.81
GPT-4o	40	79.60	61.69	63.98
GPT-4	40	79.66	61.78	63.00
Claude-4 Sonnet	40	82.35	65.39	64.78
Gemini-2.5 Pro	40	36.35	26.40	28.84
Arabic Models
ALLAM-7B	40	67.67	45.16	53.98
Fanar-1-9B-Instruct	40	55.64	30.73	44.72
SILMA-9B	40	18.51	12.16	17.14

Table 1: ROUGE-1/2/L F1 (%). 40 examples per model.

Unified Leaderboard (MCQ Accuracy % and Open-Ended QA Score 0–10)

Model	MCQ (Accuracy % ↑)						Open-Ended QA (Score 0–10 ↑)
Model	Accounting	Business	Fatwā	Sentiment	Acc. Mean		ECR-QA	SS-QA	Fatwa-QA	Score Mean
Open-source Models
Qwen2.5-14B-Instruct	20.83	67.11	64.40	57.50	52.96		7.553	4.777	3.748	5.651
Qwen2.5-7B-Instruct	29.17	67.11	60.40	60.00	54.67		6.606	3.494	2.772	4.689
Gemma-2-9B-IT	25.00	48.68	68.80	60.00	50.62		6.781	4.020	3.464	5.123
LLaMA-3.1-8B	29.17	61.84	62.80	65.00	54.70		4.889	2.441	1.689	3.289
Falcon-H1-7B-Instruct	37.50	52.63	66.40	57.50	53.51		7.151	3.968	3.009	4.709
Proprietary Models
GPT-5	87.50	72.37	80.80	55.00	73.92		9.658	9.068	8.227	8.984
GPT-4o	54.17	75.00	76.40	62.50	67.52		8.436	6.776	6.046	7.086
GPT-4	25.00	73.68	72.00	52.50	55.79		6.766	5.807	5.209	5.927
Claude-4 Sonnet	45.83	59.21	65.20	57.50	56.44		8.812	7.934	6.586	7.777
Gemini-2.5 Pro	25.00	47.37	68.80	57.50	49.67		6.472	5.617	5.104	5.731
Arabic Models
ALLAM-7B	20.83	68.42	70.00	65.00	56.06		6.340	4.710	4.110	5.053
Fanar-1-9B	33.33	71.05	68.80	42.50	53.92		6.950	3.780	3.740	4.823
SILMA-9B	14.29	67.11	57.60	45.00	46.00		1.820	3.300	2.580	2.567

Table 2: Unified leaderboard with unweighted means across tasks.