Modality Gap: Does Image Input Help?
Δ Accuracy (pp) = VLM-TEXT avg − VLM avg [avg over HTML / LaTeX / Markdown]
Δ > 0 (text is better)
Δ ≤ 0 (image helps)
← Image helps 0 Text is better →