PRISM-Bench: Where Did the Reasoning Go Wrong?

Puzzle Reasoning with In-Sequence Mistakes — Where Did the Reasoning Go Wrong?

Yusu Qian*1, Cheng Wan*2,3, Chao Jia1, Yinfei Yang1, Qingyu Zhao3, Zhe Gan1
1 Apple  ·  2 Cornell  ·  3 Weill Cornell Medicine

📄 Paper 💾 Code & Dataset 📚 Data & Stats 📈 MLLMs Leaderboard 📎 BibTeX
Overview Teaser Dataset Evaluation Protocol Error Taxonomy Results Correlation Appendix Figures

Overview

PRISM-Bench is a puzzle-based visual reasoning benchmark that evaluates multimodal LLMs along two complementary tracks:

Why this matters: PRISM-Bench decouples answer generation from reasoning verification, exposing where models’ explanations become inconsistent even when final answers look plausible.

Teaser

Overview of PRISM-Bench: solve puzzles and diagnose the first mistake in a corrupted CoT

Overview of PRISM-Bench: solve puzzles and diagnose the first mistake in a corrupted CoT.

What’s in the benchmark

  • 1,044 curated visual puzzles across 6 categories.
  • Each item includes image, question, ground-truth answer, correct CoT, and a single-error CoT.
  • Human review ensures the first error is unambiguous.

Categories

  • Special Patterns (SP)
  • Black & White Blocks (BWB)
  • Spatial Reasoning (SR)
  • Position–Style–Attribute–Count (PSAC)
  • Shape Reasoning (SRO)
  • Text–Letter–Number (TLN)

Intended use

  • Benchmark step-wise verification, not just final answers.
  • Train/evaluate verifiers and critic models.
  • Probe error types and failure modes.

Dataset & Examples

Examples from all six puzzle categories

Examples from all six puzzle categories.

Evaluation Protocol

(A) Answer Evaluation

Model sees the puzzle + question and outputs a final choice. Metric: answer accuracy (exact match).

(B) Error Diagnosis

Model sees the puzzle + question + a multi-step CoT with one error and must return the first incorrect step. Metric: error-detection accuracy.

Error Taxonomy & Statistics

Distribution of error types

24 corruption types grouped into 7 families (e.g., AFM, CPE, LDE, SPE, VSM).

Total steps per reasoning Index of first error step

Distributions of reasoning lengths and first-error positions.

MLLMs Leaderboards

First-Error Detection leaderboard(overall + category accuracy)。
VQA leaderboard(macro avg., overall + per-category)。

VQA vs. First-Error Detection

Correlation between VQA and first-error detection rankings

Only moderate correlation between VQA macro accuracy and first-error detection — the two tracks capture complementary skills.

BibTeX

@inproceedings{qian2025prism,
  title={PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection},
  author={Yusu Qian and Cheng Wan and Chao Jia and Yinfei Yang and Qingyu Zhao and Zhe Gan},
  booktitle={arxiv},
  year={2025}
}