PRISM-Bench: Where Did the Reasoning Go Wrong?

Puzzle Reasoning with In-Sequence Mistakes — Where Did the Reasoning Go Wrong?

Yusu Qian^*1, Cheng Wan^*2,3, Chao Jia¹, Yinfei Yang¹, Qingyu Zhao³, Zhe Gan¹
¹ Apple · ² Cornell · ³ Weill Cornell Medicine

📄 Paper 💾 Code & Dataset 📚 Data & Stats 📈 MLLMs Leaderboard 📎 BibTeX

Overview Teaser Dataset Evaluation Protocol Error Taxonomy Results Correlation Appendix Figures

Overview

PRISM-Bench is a puzzle-based visual reasoning benchmark that evaluates multimodal LLMs along two complementary tracks:

Answer Prediction (VQA-style) — choose the correct option for a structured visual puzzle.
First-Error Detection — given a step-by-step chain-of-thought (CoT) with exactly one injected mistake, identify the first incorrect step.

Why this matters: PRISM-Bench decouples answer generation from reasoning verification, exposing where models’ explanations become inconsistent even when final answers look plausible.

Teaser

Overview of PRISM-Bench: solve puzzles and diagnose the first mistake in a corrupted CoT.

What’s in the benchmark

1,044 curated visual puzzles across 6 categories.
Each item includes image, question, ground-truth answer, correct CoT, and a single-error CoT.
Human review ensures the first error is unambiguous.

Intended use

Benchmark step-wise verification, not just final answers.
Train/evaluate verifiers and critic models.
Probe error types and failure modes.

Dataset & Examples

Examples from all six puzzle categories.

Evaluation Protocol

(A) Answer Evaluation

Model sees the puzzle + question and outputs a final choice. Metric: answer accuracy (exact match).

(B) Error Diagnosis

Model sees the puzzle + question + a multi-step CoT with one error and must return the first incorrect step. Metric: error-detection accuracy.

Error Taxonomy & Statistics

24 corruption types grouped into 7 families (e.g., AFM, CPE, LDE, SPE, VSM).

Distributions of reasoning lengths and first-error positions.

MLLMs Leaderboards

First-Error Detection leaderboard（overall + category accuracy）。

VQA leaderboard（macro avg., overall + per-category）。

VQA vs. First-Error Detection

Correlation between VQA and first-error detection rankings

Only moderate correlation between VQA macro accuracy and first-error detection — the two tracks capture complementary skills.

BibTeX

@inproceedings{qian2025prism,
  title={PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection},
  author={Yusu Qian and Cheng Wan and Chao Jia and Yinfei Yang and Qingyu Zhao and Zhe Gan},
  booktitle={arxiv},
  year={2025}
}