UrduMMLU Leaderboard

30 models · 60 zero-shot evaluations · English & Urdu prompts

Back to Paper

Zero-Shot Evaluation — Accuracy (%) ↑

Prompt language:
# Model Overall STEM Humanities Social Sci. Profession Other (GK) Inv % ↓
Open-source Models — > 25B Parameters
DeepSeek-V4-Flash 82.4197.4968.7190.0589.8586.44 0.1
LLaMA-4-Maverick-17B-128E 75.4791.9863.2780.1181.6480.66 <0.1
Gemma-4-31B-IT 76.3893.3363.6281.7083.6979.49 <0.1
Qwen3.6-27B 69.2290.4655.9573.1974.7769.60 0
Qwen3.6-35B-A3B 69.3988.7756.8173.5275.4969.82 0
LLaMA-3.3-70B 68.5681.3256.3075.4576.5173.77 0
LLaMA-4-Scout-17B-16E 67.8284.1457.0371.2074.1569.52 0
Gemma-4-26B-A4B-IT 69.2385.9257.4373.8375.4970.55 <0.1
Open-source Models — ≤ 25B Parameters
Ministral-3-8B 54.7767.8145.2758.9959.5954.43 0
Gemma-2-9B-IT 55.2867.0847.2158.4060.0054.58 0
Qwen3-8B 50.9770.7039.2153.9956.6250.18 0
Qwen3-4B-Instruct-2507 50.8468.6142.3351.4852.6247.84 0
Ministral-3-3B 47.9055.0443.6948.9949.0847.91 <0.1
Gemma-3-4B-IT 43.9349.9737.8747.6847.7945.57 0
LLaMA-3.1-8B 43.3046.9636.3849.6648.5144.54 0
LLaMA-3.2-3B 32.9837.1226.7838.0636.0035.60 0
Phi-4-mini 33.8537.0728.8538.1040.4132.67 <0.1
Phi-3.5-mini 28.0333.7622.8930.9032.2128.28 0
BLOOMZ-7B 29.7327.7527.8133.5631.8328.11 11.2
BLOOMZ-3B 30.6827.8930.2533.2632.3428.39 6.5
BLOOMZ-1.7B 30.1928.9427.7033.3337.1731.43 2.5
BLOOMZ-1.1B 25.5223.5227.2124.3926.0625.63 0.5
Proprietary Models
1 Gemini-3.5-Flash 90.2097.7584.9892.1592.1091.43 0.1
2Gemini-3.1-Flash-Lite 84.5696.8574.2090.0190.2686.08 <0.1
3Claude-Sonnet-4.6 82.9196.3472.6987.3687.1886.01 0
GPT-5.4 80.8195.1369.2986.3585.8584.10 0
GPT-5.4-mini 73.4388.3462.8277.5279.0875.24 0
Claude-Haiku-4.5 71.4090.4958.5775.8675.9074.14 0.1
Urdu-targeted Models
Qalb-1.0-8B 34.7738.1829.9937.6840.3139.56 0
Alif-1.0-8B 34.7241.0925.7441.4039.8741.04 0.6

H = Humanities; SS = Social Sciences; P = Profession; O = Other (General Knowledge). Boxed/bold values mark the best overall score per column. Inv%: percentage of unparsable or malformed outputs (lower is better). Use the toggle above to switch between English and Urdu prompt scores — domain columns update accordingly.

STEM–Humanities Gap (Urdu Prompt, sorted by STEM accuracy)

Model STEM % Humanities % Gap (pts)
Gemini-3.5-Flash97.8185.3112.50
DeepSeek-V4-Flash97.5767.3230.25
GPT-5.497.4074.8222.58
Gemini-3.1-Flash-Lite97.0974.3822.71
Qwen3.6-35B-A3B96.3258.1238.20
Claude-Sonnet-4.696.2672.6923.57
Gemma-4-31B-IT93.8663.2530.61
LLaMA-4-Maverick-17B92.3863.2529.13
Claude-Haiku-4.591.9659.3132.65
Qwen3.6-27B91.1255.7135.41
GPT-5.4-mini88.2562.3525.90
Gemma-4-26B-A4B-IT87.2157.7329.48
LLaMA-4-Scout-17B85.5956.5529.04
LLaMA-3.3-70B78.3956.1022.29
Qwen3-8B74.3730.8743.50
Ministral-3-8B71.3745.7425.63
Gemma-2-9B-IT69.0248.0820.94
Qwen3-4B-Instruct68.7543.0025.75
Ministral-3-3B57.2543.0714.18
Gemma-3-4B-IT51.7938.2713.52
LLaMA-3.1-8B46.4937.618.88
LLaMA-3.2-3B37.2429.327.92
Phi-4-mini37.0828.708.38
Qalb-1.0-8B36.2632.723.54
Phi-3.5-mini33.8327.256.58
Alif-1.0-8B33.2729.004.27
BLOOMZ-7B29.2430.88−1.64
BLOOMZ-1.7B28.7428.76−0.02
BLOOMZ-3B26.5627.70−1.14
BLOOMZ-1.1B24.5325.83−1.30

Sorted by STEM accuracy (Urdu prompt). Negative gaps for BLOOMZ indicate near-random performance on both domains. Urdu-targeted models (Qalb, Alif) show small gaps due to low overall ability rather than STEM strength.

Few-Shot Evaluation — Accuracy (%) ↑

4 open-source models evaluated at 0-, 1-, 3-, and 5-shot with 200 held-out demonstrations. Deltas are relative to the 0-shot baseline of the same model under the same prompt.

English Prompt

Model 0-shot 1-shot 3-shot 5-shot
LLaMA-3.1-8B 43.30 44.93 (+1.63) 46.09 (+2.79) 46.59 (+3.29)
Gemma-3-4B-IT 43.93 45.91 (+1.98) 46.07 (+2.14) 46.27 (+2.34)
Qwen3-8B 50.97 50.23 (−0.74) 52.27 (+1.30) 53.21 (+2.24)
Qwen3-4B-Instruct-2507 50.84 52.56 (+1.72) 53.40 (+2.56) 53.65 (+2.81)
Mean 47.26 48.41 (+1.15) 49.46 (+2.20) 49.93 (+2.67)

Urdu Prompt

Model 0-shot 1-shot 3-shot 5-shot
LLaMA-3.1-8B 43.84 45.08 (+1.24) 46.32 (+2.48) 46.10 (+2.26)
Gemma-3-4B-IT 44.88 46.40 (+1.52) 46.36 (+1.48) 45.99 (+1.11)
Qwen3-8B 48.97 51.86 (+2.89) 53.24 (+4.27) 53.49 (+4.52)
Qwen3-4B-Instruct-2507 51.70 52.06 (+0.36) 52.89 (+1.19) 52.94 (+1.24)
Mean 47.35 48.85 (+1.50) 49.70 (+2.35) 49.63 (+2.28)

23 of 24 few-shot configurations outperform their zero-shot baseline. Qwen3-8B under Urdu prompt shows the largest gain (+4.52 at 5-shot). Despite gains, all models remain far below the ≥25B open-source tier.

MCQ Distribution across Pakistani Examination Levels

UrduMMLU covers four curriculum levels of the Pakistani secondary school system: SSC-I (Grade 9), SSC-II (Grade 10), HSSC-I (Grade 11), and HSSC-II (Grade 12). The stacked bars below show both absolute MCQ counts (left) and within-level domain share (right). SSC-I is the largest level with 11,601 items, where Humanities accounts for 54% — reflecting the heavy literature and language focus in early secondary curricula. As students progress to HSSC, they specialize into science, commerce, or humanities tracks: Social Sciences grows from 11% at SSC-I to 71% at HSSC-II, while Humanities contracts from 54% to 19%. STEM peaks at SSC-II (32%) before students who continue in arts tracks shift out of science subjects.

Figure 8. Distribution of UrduMMLU items across four Pakistani examination levels, grouped by domain. Left: absolute MCQ counts per level. Right: within-level domain share (%). Humanities dominates SSC levels; Social Sciences becomes the majority domain at HSSC-I and HSSC-II as students specialise. This curriculum structure — not a collection artefact — drives the level distribution seen in the benchmark.