30 models · 60 zero-shot evaluations · English & Urdu prompts
| # | Model | Overall | STEM | Humanities | Social Sci. | Profession | Other (GK) | Inv % ↓ |
|---|---|---|---|---|---|---|---|---|
| Open-source Models — > 25B Parameters | ||||||||
| DeepSeek-V4-Flash | 82.41 | 97.49 | 68.71 | 90.05 | 89.85 | 86.44 | 0.1 | |
| LLaMA-4-Maverick-17B-128E | 75.47 | 91.98 | 63.27 | 80.11 | 81.64 | 80.66 | <0.1 | |
| Gemma-4-31B-IT | 76.38 | 93.33 | 63.62 | 81.70 | 83.69 | 79.49 | <0.1 | |
| Qwen3.6-27B | 69.22 | 90.46 | 55.95 | 73.19 | 74.77 | 69.60 | 0 | |
| Qwen3.6-35B-A3B | 69.39 | 88.77 | 56.81 | 73.52 | 75.49 | 69.82 | 0 | |
| LLaMA-3.3-70B | 68.56 | 81.32 | 56.30 | 75.45 | 76.51 | 73.77 | 0 | |
| LLaMA-4-Scout-17B-16E | 67.82 | 84.14 | 57.03 | 71.20 | 74.15 | 69.52 | 0 | |
| Gemma-4-26B-A4B-IT | 69.23 | 85.92 | 57.43 | 73.83 | 75.49 | 70.55 | <0.1 | |
| Open-source Models — ≤ 25B Parameters | ||||||||
| Ministral-3-8B | 54.77 | 67.81 | 45.27 | 58.99 | 59.59 | 54.43 | 0 | |
| Gemma-2-9B-IT | 55.28 | 67.08 | 47.21 | 58.40 | 60.00 | 54.58 | 0 | |
| Qwen3-8B | 50.97 | 70.70 | 39.21 | 53.99 | 56.62 | 50.18 | 0 | |
| Qwen3-4B-Instruct-2507 | 50.84 | 68.61 | 42.33 | 51.48 | 52.62 | 47.84 | 0 | |
| Ministral-3-3B | 47.90 | 55.04 | 43.69 | 48.99 | 49.08 | 47.91 | <0.1 | |
| Gemma-3-4B-IT | 43.93 | 49.97 | 37.87 | 47.68 | 47.79 | 45.57 | 0 | |
| LLaMA-3.1-8B | 43.30 | 46.96 | 36.38 | 49.66 | 48.51 | 44.54 | 0 | |
| LLaMA-3.2-3B | 32.98 | 37.12 | 26.78 | 38.06 | 36.00 | 35.60 | 0 | |
| Phi-4-mini | 33.85 | 37.07 | 28.85 | 38.10 | 40.41 | 32.67 | <0.1 | |
| Phi-3.5-mini | 28.03 | 33.76 | 22.89 | 30.90 | 32.21 | 28.28 | 0 | |
| BLOOMZ-7B | 29.73 | 27.75 | 27.81 | 33.56 | 31.83 | 28.11 | 11.2 | |
| BLOOMZ-3B | 30.68 | 27.89 | 30.25 | 33.26 | 32.34 | 28.39 | 6.5 | |
| BLOOMZ-1.7B | 30.19 | 28.94 | 27.70 | 33.33 | 37.17 | 31.43 | 2.5 | |
| BLOOMZ-1.1B | 25.52 | 23.52 | 27.21 | 24.39 | 26.06 | 25.63 | 0.5 | |
| Proprietary Models | ||||||||
| 1 | Gemini-3.5-Flash | 90.20 | 97.75 | 84.98 | 92.15 | 92.10 | 91.43 | 0.1 |
| 2 | Gemini-3.1-Flash-Lite | 84.56 | 96.85 | 74.20 | 90.01 | 90.26 | 86.08 | <0.1 |
| 3 | Claude-Sonnet-4.6 | 82.91 | 96.34 | 72.69 | 87.36 | 87.18 | 86.01 | 0 |
| GPT-5.4 | 80.81 | 95.13 | 69.29 | 86.35 | 85.85 | 84.10 | 0 | |
| GPT-5.4-mini | 73.43 | 88.34 | 62.82 | 77.52 | 79.08 | 75.24 | 0 | |
| Claude-Haiku-4.5 | 71.40 | 90.49 | 58.57 | 75.86 | 75.90 | 74.14 | 0.1 | |
| Urdu-targeted Models | ||||||||
| Qalb-1.0-8B | 34.77 | 38.18 | 29.99 | 37.68 | 40.31 | 39.56 | 0 | |
| Alif-1.0-8B | 34.72 | 41.09 | 25.74 | 41.40 | 39.87 | 41.04 | 0.6 | |
H = Humanities; SS = Social Sciences; P = Profession; O = Other (General Knowledge). Boxed/bold values mark the best overall score per column. Inv%: percentage of unparsable or malformed outputs (lower is better). Use the toggle above to switch between English and Urdu prompt scores — domain columns update accordingly.
| Model | STEM % | Humanities % | Gap (pts) |
|---|---|---|---|
| Gemini-3.5-Flash | 97.81 | 85.31 | 12.50 |
| DeepSeek-V4-Flash | 97.57 | 67.32 | 30.25 |
| GPT-5.4 | 97.40 | 74.82 | 22.58 |
| Gemini-3.1-Flash-Lite | 97.09 | 74.38 | 22.71 |
| Qwen3.6-35B-A3B | 96.32 | 58.12 | 38.20 |
| Claude-Sonnet-4.6 | 96.26 | 72.69 | 23.57 |
| Gemma-4-31B-IT | 93.86 | 63.25 | 30.61 |
| LLaMA-4-Maverick-17B | 92.38 | 63.25 | 29.13 |
| Claude-Haiku-4.5 | 91.96 | 59.31 | 32.65 |
| Qwen3.6-27B | 91.12 | 55.71 | 35.41 |
| GPT-5.4-mini | 88.25 | 62.35 | 25.90 |
| Gemma-4-26B-A4B-IT | 87.21 | 57.73 | 29.48 |
| LLaMA-4-Scout-17B | 85.59 | 56.55 | 29.04 |
| LLaMA-3.3-70B | 78.39 | 56.10 | 22.29 |
| Qwen3-8B | 74.37 | 30.87 | 43.50 |
| Ministral-3-8B | 71.37 | 45.74 | 25.63 |
| Gemma-2-9B-IT | 69.02 | 48.08 | 20.94 |
| Qwen3-4B-Instruct | 68.75 | 43.00 | 25.75 |
| Ministral-3-3B | 57.25 | 43.07 | 14.18 |
| Gemma-3-4B-IT | 51.79 | 38.27 | 13.52 |
| LLaMA-3.1-8B | 46.49 | 37.61 | 8.88 |
| LLaMA-3.2-3B | 37.24 | 29.32 | 7.92 |
| Phi-4-mini | 37.08 | 28.70 | 8.38 |
| Qalb-1.0-8B | 36.26 | 32.72 | 3.54 |
| Phi-3.5-mini | 33.83 | 27.25 | 6.58 |
| Alif-1.0-8B | 33.27 | 29.00 | 4.27 |
| BLOOMZ-7B | 29.24 | 30.88 | −1.64 |
| BLOOMZ-1.7B | 28.74 | 28.76 | −0.02 |
| BLOOMZ-3B | 26.56 | 27.70 | −1.14 |
| BLOOMZ-1.1B | 24.53 | 25.83 | −1.30 |
Sorted by STEM accuracy (Urdu prompt). Negative gaps for BLOOMZ indicate near-random performance on both domains. Urdu-targeted models (Qalb, Alif) show small gaps due to low overall ability rather than STEM strength.
4 open-source models evaluated at 0-, 1-, 3-, and 5-shot with 200 held-out demonstrations. Deltas are relative to the 0-shot baseline of the same model under the same prompt.
| Model | 0-shot | 1-shot | 3-shot | 5-shot |
|---|---|---|---|---|
| LLaMA-3.1-8B | 43.30 | 44.93 (+1.63) | 46.09 (+2.79) | 46.59 (+3.29) |
| Gemma-3-4B-IT | 43.93 | 45.91 (+1.98) | 46.07 (+2.14) | 46.27 (+2.34) |
| Qwen3-8B | 50.97 | 50.23 (−0.74) | 52.27 (+1.30) | 53.21 (+2.24) |
| Qwen3-4B-Instruct-2507 | 50.84 | 52.56 (+1.72) | 53.40 (+2.56) | 53.65 (+2.81) |
| Mean | 47.26 | 48.41 (+1.15) | 49.46 (+2.20) | 49.93 (+2.67) |
| Model | 0-shot | 1-shot | 3-shot | 5-shot |
|---|---|---|---|---|
| LLaMA-3.1-8B | 43.84 | 45.08 (+1.24) | 46.32 (+2.48) | 46.10 (+2.26) |
| Gemma-3-4B-IT | 44.88 | 46.40 (+1.52) | 46.36 (+1.48) | 45.99 (+1.11) |
| Qwen3-8B | 48.97 | 51.86 (+2.89) | 53.24 (+4.27) | 53.49 (+4.52) |
| Qwen3-4B-Instruct-2507 | 51.70 | 52.06 (+0.36) | 52.89 (+1.19) | 52.94 (+1.24) |
| Mean | 47.35 | 48.85 (+1.50) | 49.70 (+2.35) | 49.63 (+2.28) |
23 of 24 few-shot configurations outperform their zero-shot baseline. Qwen3-8B under Urdu prompt shows the largest gain (+4.52 at 5-shot). Despite gains, all models remain far below the ≥25B open-source tier.
UrduMMLU covers four curriculum levels of the Pakistani secondary school system: SSC-I (Grade 9), SSC-II (Grade 10), HSSC-I (Grade 11), and HSSC-II (Grade 12). The stacked bars below show both absolute MCQ counts (left) and within-level domain share (right). SSC-I is the largest level with 11,601 items, where Humanities accounts for 54% — reflecting the heavy literature and language focus in early secondary curricula. As students progress to HSSC, they specialize into science, commerce, or humanities tracks: Social Sciences grows from 11% at SSC-I to 71% at HSSC-II, while Humanities contracts from 54% to 19%. STEM peaks at SSC-II (32%) before students who continue in arts tracks shift out of science subjects.
Figure 8. Distribution of UrduMMLU items across four Pakistani examination levels, grouped by domain. Left: absolute MCQ counts per level. Right: within-level domain share (%). Humanities dominates SSC levels; Social Sciences becomes the majority domain at HSSC-I and HSSC-II as students specialise. This curriculum structure — not a collection artefact — drives the level distribution seen in the benchmark.