UrduMMLU Leaderboard

Zero-Shot Evaluation — Accuracy (%) ↑

Prompt language:

#	Model	Overall	STEM	Humanities	Social Sci.	Profession	Other (GK)	Inv % ↓
Open-source Models — > 25B Parameters
	DeepSeek-V4-Flash	82.41	97.49	68.71	90.05	89.85	86.44	0.1
	LLaMA-4-Maverick-17B-128E	75.47	91.98	63.27	80.11	81.64	80.66	<0.1
	Gemma-4-31B-IT	76.38	93.33	63.62	81.70	83.69	79.49	<0.1
	Qwen3.6-27B	69.22	90.46	55.95	73.19	74.77	69.60	0
	Qwen3.6-35B-A3B	69.39	88.77	56.81	73.52	75.49	69.82	0
	LLaMA-3.3-70B	68.56	81.32	56.30	75.45	76.51	73.77	0
	LLaMA-4-Scout-17B-16E	67.82	84.14	57.03	71.20	74.15	69.52	0
	Gemma-4-26B-A4B-IT	69.23	85.92	57.43	73.83	75.49	70.55	<0.1
Open-source Models — ≤ 25B Parameters
	Ministral-3-8B	54.77	67.81	45.27	58.99	59.59	54.43	0
	Gemma-2-9B-IT	55.28	67.08	47.21	58.40	60.00	54.58	0
	Qwen3-8B	50.97	70.70	39.21	53.99	56.62	50.18	0
	Qwen3-4B-Instruct-2507	50.84	68.61	42.33	51.48	52.62	47.84	0
	Ministral-3-3B	47.90	55.04	43.69	48.99	49.08	47.91	<0.1
	Gemma-3-4B-IT	43.93	49.97	37.87	47.68	47.79	45.57	0
	LLaMA-3.1-8B	43.30	46.96	36.38	49.66	48.51	44.54	0
	LLaMA-3.2-3B	32.98	37.12	26.78	38.06	36.00	35.60	0
	Phi-4-mini	33.85	37.07	28.85	38.10	40.41	32.67	<0.1
	Phi-3.5-mini	28.03	33.76	22.89	30.90	32.21	28.28	0
	BLOOMZ-7B	29.73	27.75	27.81	33.56	31.83	28.11	11.2
	BLOOMZ-3B	30.68	27.89	30.25	33.26	32.34	28.39	6.5
	BLOOMZ-1.7B	30.19	28.94	27.70	33.33	37.17	31.43	2.5
	BLOOMZ-1.1B	25.52	23.52	27.21	24.39	26.06	25.63	0.5
Proprietary Models
1	Gemini-3.5-Flash	90.20	97.75	84.98	92.15	92.10	91.43	0.1
2	Gemini-3.1-Flash-Lite	84.56	96.85	74.20	90.01	90.26	86.08	<0.1
3	Claude-Sonnet-4.6	82.91	96.34	72.69	87.36	87.18	86.01	0
	GPT-5.4	80.81	95.13	69.29	86.35	85.85	84.10	0
	GPT-5.4-mini	73.43	88.34	62.82	77.52	79.08	75.24	0
	Claude-Haiku-4.5	71.40	90.49	58.57	75.86	75.90	74.14	0.1
Urdu-targeted Models
	Qalb-1.0-8B	34.77	38.18	29.99	37.68	40.31	39.56	0
	Alif-1.0-8B	34.72	41.09	25.74	41.40	39.87	41.04	0.6

H = Humanities; SS = Social Sciences; P = Profession; O = Other (General Knowledge). Boxed/bold values mark the best overall score per column. Inv%: percentage of unparsable or malformed outputs (lower is better). Use the toggle above to switch between English and Urdu prompt scores — domain columns update accordingly.

STEM–Humanities Gap (Urdu Prompt, sorted by STEM accuracy)

Model	STEM %	Humanities %	Gap (pts)
Gemini-3.5-Flash	97.81	85.31	12.50
DeepSeek-V4-Flash	97.57	67.32	30.25
GPT-5.4	97.40	74.82	22.58
Gemini-3.1-Flash-Lite	97.09	74.38	22.71
Qwen3.6-35B-A3B	96.32	58.12	38.20
Claude-Sonnet-4.6	96.26	72.69	23.57
Gemma-4-31B-IT	93.86	63.25	30.61
LLaMA-4-Maverick-17B	92.38	63.25	29.13
Claude-Haiku-4.5	91.96	59.31	32.65
Qwen3.6-27B	91.12	55.71	35.41
GPT-5.4-mini	88.25	62.35	25.90
Gemma-4-26B-A4B-IT	87.21	57.73	29.48
LLaMA-4-Scout-17B	85.59	56.55	29.04
LLaMA-3.3-70B	78.39	56.10	22.29
Qwen3-8B	74.37	30.87	43.50
Ministral-3-8B	71.37	45.74	25.63
Gemma-2-9B-IT	69.02	48.08	20.94
Qwen3-4B-Instruct	68.75	43.00	25.75
Ministral-3-3B	57.25	43.07	14.18
Gemma-3-4B-IT	51.79	38.27	13.52
LLaMA-3.1-8B	46.49	37.61	8.88
LLaMA-3.2-3B	37.24	29.32	7.92
Phi-4-mini	37.08	28.70	8.38
Qalb-1.0-8B	36.26	32.72	3.54
Phi-3.5-mini	33.83	27.25	6.58
Alif-1.0-8B	33.27	29.00	4.27
BLOOMZ-7B	29.24	30.88	−1.64
BLOOMZ-1.7B	28.74	28.76	−0.02
BLOOMZ-3B	26.56	27.70	−1.14
BLOOMZ-1.1B	24.53	25.83	−1.30

Sorted by STEM accuracy (Urdu prompt). Negative gaps for BLOOMZ indicate near-random performance on both domains. Urdu-targeted models (Qalb, Alif) show small gaps due to low overall ability rather than STEM strength.

Few-Shot Evaluation — Accuracy (%) ↑

4 open-source models evaluated at 0-, 1-, 3-, and 5-shot with 200 held-out demonstrations. Deltas are relative to the 0-shot baseline of the same model under the same prompt.

English Prompt

Model	0-shot	1-shot	3-shot	5-shot
LLaMA-3.1-8B	43.30	44.93 (+1.63)	46.09 (+2.79)	46.59 (+3.29)
Gemma-3-4B-IT	43.93	45.91 (+1.98)	46.07 (+2.14)	46.27 (+2.34)
Qwen3-8B	50.97	50.23 (−0.74)	52.27 (+1.30)	53.21 (+2.24)
Qwen3-4B-Instruct-2507	50.84	52.56 (+1.72)	53.40 (+2.56)	53.65 (+2.81)
Mean	47.26	48.41 (+1.15)	49.46 (+2.20)	49.93 (+2.67)

Urdu Prompt

Model	0-shot	1-shot	3-shot	5-shot
LLaMA-3.1-8B	43.84	45.08 (+1.24)	46.32 (+2.48)	46.10 (+2.26)
Gemma-3-4B-IT	44.88	46.40 (+1.52)	46.36 (+1.48)	45.99 (+1.11)
Qwen3-8B	48.97	51.86 (+2.89)	53.24 (+4.27)	53.49 (+4.52)
Qwen3-4B-Instruct-2507	51.70	52.06 (+0.36)	52.89 (+1.19)	52.94 (+1.24)
Mean	47.35	48.85 (+1.50)	49.70 (+2.35)	49.63 (+2.28)

23 of 24 few-shot configurations outperform their zero-shot baseline. Qwen3-8B under Urdu prompt shows the largest gain (+4.52 at 5-shot). Despite gains, all models remain far below the ≥25B open-source tier.

MCQ Distribution across Pakistani Examination Levels

UrduMMLU covers four curriculum levels of the Pakistani secondary school system: SSC-I (Grade 9), SSC-II (Grade 10), HSSC-I (Grade 11), and HSSC-II (Grade 12). The stacked bars below show both absolute MCQ counts (left) and within-level domain share (right). SSC-I is the largest level with 11,601 items, where Humanities accounts for 54% — reflecting the heavy literature and language focus in early secondary curricula. As students progress to HSSC, they specialize into science, commerce, or humanities tracks: Social Sciences grows from 11% at SSC-I to 71% at HSSC-II, while Humanities contracts from 54% to 19%. STEM peaks at SSC-II (32%) before students who continue in arts tracks shift out of science subjects.

Figure 8. Distribution of UrduMMLU items across four Pakistani examination levels, grouped by domain. Left: absolute MCQ counts per level. Right: within-level domain share (%). Humanities dominates SSC levels; Social Sciences becomes the majority domain at HSSC-I and HSSC-II as students specialise. This curriculum structure — not a collection artefact — drives the level distribution seen in the benchmark.