PRISM-Bench: Where Did the Reasoning Go Wrong?
Puzzle Reasoning with In-Sequence Mistakes — Where Did the Reasoning Go Wrong?
Overview
PRISM-Bench is a puzzle-based visual reasoning benchmark that evaluates multimodal LLMs along two complementary tracks:
- Answer Prediction (VQA-style) — choose the correct option for a structured visual puzzle.
- First-Error Detection — given a step-by-step chain-of-thought (CoT) with exactly one injected mistake, identify the first incorrect step.
Why this matters: PRISM-Bench decouples answer generation from reasoning verification, exposing where models’ explanations become inconsistent even when final answers look plausible.
Teaser
Overview of PRISM-Bench: solve puzzles and diagnose the first mistake in a corrupted CoT.
What’s in the benchmark
- 1,044 curated visual puzzles across 6 categories.
- Each item includes image, question, ground-truth answer, correct CoT, and a single-error CoT.
- Human review ensures the first error is unambiguous.
Categories
- Special Patterns (SP)
- Black & White Blocks (BWB)
- Spatial Reasoning (SR)
- Position–Style–Attribute–Count (PSAC)
- Shape Reasoning (SRO)
- Text–Letter–Number (TLN)
Intended use
- Benchmark step-wise verification, not just final answers.
- Train/evaluate verifiers and critic models.
- Probe error types and failure modes.
Dataset & Examples
Examples from all six puzzle categories.
Evaluation Protocol
(A) Answer Evaluation
Model sees the puzzle + question and outputs a final choice. Metric: answer accuracy (exact match).
(B) Error Diagnosis
Model sees the puzzle + question + a multi-step CoT with one error and must return the first incorrect step. Metric: error-detection accuracy.
Error Taxonomy & Statistics
24 corruption types grouped into 7 families (e.g., AFM, CPE, LDE, SPE, VSM).
Distributions of reasoning lengths and first-error positions.
MLLMs Leaderboards
VQA vs. First-Error Detection
Only moderate correlation between VQA macro accuracy and first-error detection — the two tracks capture complementary skills.
BibTeX
@inproceedings{qian2025prism,
title={PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection},
author={Yusu Qian and Cheng Wan and Chao Jia and Yinfei Yang and Qingyu Zhao and Zhe Gan},
booktitle={arxiv},
year={2025}
}