Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

1Hangzhou Institute for Advanced Study, UCAS, 2Zhejiang University 3Zhejiang Lab

Abstract

Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs’ spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domainspecific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the 90%+ performance achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs.

Jigsaw-Puzzles Dataset

To investigate whether current vision-language models (VLMs) exhibit human-like spatial reasoning capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs’ spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domainspecific knowledge to better isolate and assess the general spatial reasoning capability.

Interpolate start reference image.

Task examples of Jigsaw-Puzzles. Note: the questions above are slightly simplified for clarity and brevity, and the blue option indicates the correct answer. Complete task-specific question can be found in the paper.

Dataset Curation

Our dataset curation pipeline consists of two main stages: data collection and QA generation.

Interpolate start reference image.

Main Evaluation Results

We evaluate 24 VLMs on Jigsaw-Puzzles, covering a diverse range of model scales and training paradigms. Jigsaw-Puzzles effectively distinguishes VLMs across a spectrum of spatial reasoning capability—from basic understanding to complex multi-step reasoning. As shown by the results in Table 1, substantial room for improvement remains, particularly in multi-step spatial reasoning.

Interpolate start reference image.
Table 2: Full Evaluation Results of 24 VLMs on Jigsaw-Puzzles. VLMs are grouped into proprietary and open-source categories. Dark Green and Light Green indicate the top-1 and top-2 performance within each group, respectively. Results of reasoning-enhanced are marked in bold. We also highlight the top three models based on their overall performance, using Dark Blue, Medium Blue, and Light Blue, respectively.

Human vs Model Performance

To evaluate human performance, we construct a subset called Jigsaw-Puzzles-Lite by sampling 220 images from the full dataset. Three human participants complete all tasks on this subset under the same conditions as VLMs—without access to any external tools or the internet. Their performance serves as an empirical upper bound for spatial reasoning capability.

Human participants consistently outperform VLMs, achieving an overall accuracy of 96.36%. By comparison, current VLMs perform considerably worse, with even the strongest models Gemini-2.5-Pro lagging more than 20 percentage points behind human accuracy across all tasks. The persistent gap between humans and VLMs highlights the demanding nature of Jigsaw-Puzzles and affirms its utility as a robust benchmark for spatial reasoning evaluation.

Interpolate start reference image.
Table 3: Comparing Top-Performing VLMs with Human Performance on Jigsaw-Puzzles-Lite. The human performance is highlighted in Dark Green. Results of reasoning-enhanced are marked in bold. The top three overall performance are highlighted in Dark Blue, Medium Blue, and Light Blue, respectively.

Order Generation Results

To further evaluate VLMs’ multistep spatial reasoning beyond the constraints of predefined choices, we introduce the Order Generation task based on Jigsaw-Puzzles-Lite. In this setting, VLMs must directly generate the correct sequence of puzzle pieces without relying on answer options, thereby more authentically simulating open-ended spatial reasoning. Current VLMs consistently struggle with this task—Gemini-2.5-Pro, the best-performing model, achieves only 30.00% accuracy, in stark contrast to 94.09% by human participants. This finding reveals that, despite exhibiting strong self-correction behavior under option constraints, existing VLMs face considerable challenges in autonomously constructing coherent spatial reasoning chains.

Interpolate start reference image.

BibTeX

@article{lyu2025jigsaw,
  title={Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models},
  author={Lyu, Zesen and Zhang, Dandan and Ye, Wei and Li, Fangdi and Jiang, Zhihang and Yang, Yao},
  journal={arXiv preprint arXiv:2505.20728},
  year={2025}
}