Measuring Spatial Intelligence in LLMs

Today we’re open-sourcing Spatial Benchmark, a benchmark for measuring how well language models reason about 3D space. The benchmark, dataset, and evaluation code are available on GitHub, and you can browse the latest detailed results on our results summary page. It evaluates the kinds of spatial reasoning that matter in AR, XR, simulation, games, and embodied tools: placement, transforms, visibility, hierarchy, layout, and vector-based reasoning.

Large language models are starting to look genuinely useful for spatial computing. That is exciting for us at Specs, because the kinds of applications we care about are fundamentally spatial. We built Spatial Benchmark since we felt a gap in the evaluation landscape for this space. Existing benchmarks tell us a lot about general reasoning, coding, and language fluency but much less about the 3D reasoning that matters in XR, games, simulation, scene generation, and embodied intelligence. Our goal was to create a benchmark that helps us track progress over time and choose models that are genuinely useful for spatial work.

Along the way, we learned two things that can both be true at once. The current models are more capable than we expected, some of them scoring above 90% in accuracy. At the same time, there are still clear gaps, especially around 3D engine-specific conventions and engine-native spatial code patterns. Models often fail not because they cannot do arithmetic, but because they get handedness wrong, reverse a cross product, place something at the wrong depth, over-edit a layout, or emit code that is conceptually plausible but not aligned with how real-time 3D engines expect these operations to be expressed.

That combination is exactly why this area is interesting right now. Spatial intelligence is no longer absent. But it is also not solved yet. It is emerging as a distinct frontier.

Summary of Results

On the latest run of our current 80-question public benchmark, the strongest models remained highly reliable on core spatial reasoning categories. Nine of the sixteen completed models were perfect on Relational & Topological reasoning, and thirteen were either perfect or near-perfect. Constraint Satisfaction was similarly strong: nine models were perfect, and fifteen of sixteen were either perfect or within one question of perfect. The overall leaderboard was led by GPT-5.5 at 93.8%, followed by Gemini 3 Flash Preview at 92.5% and Kimi K2.6 at 90.0%. Claude Fable 5 and Gemini 3.5 Flash both reached 88.8%, with Claude Opus 4.8 close behind at 87.5%, suggesting that strong spatial performance is now emerging across both open and closed model families.

We chose a mix of models that represent different segments of the LLM space: top and mid-tier frontier models, open models served through Fireworks, and Gemma 4 E4B as an example of a smaller deployable model. We will continue rerunning the benchmark periodically as new models become available.

Methodology

Spatial Benchmark is built as a single Inspect AI task with a custom typed scorer. The benchmark currently contains 80 questions across 7 top-level categories: Coordinate & Geometric Math, Transformations & Perspective, Constraint Satisfaction, Spatial Arrangement & Layout, Relational & Topological, Hierarchical Structuring and Linear Algebra & Vector Methods.

The benchmark is mixed-format. Some questions ask for: a scalar, a boolean, a 3D coordinate, a list of coordinates, a JSON object. Others inject full scene context into the prompt. We use bundled scene exports such as sample-scene.json and sampe-ui-scene.json, so the model has to reason over realistic scene structure rather than an abstract toy description.

We have intentionally kept it focused on the various aspects of spatial reasoning, rather than tool calling capabilities or features that are specific to particular engineers (eg. Lens Studio or Unity) to make it generally applicable and useful.

In the current version, 10 questions use bundled scene JSON as runtime context and 11 questions require a structured JSON DSL rather than a directly computed answer. We wanted the benchmark to cover both "What is the answer?" and "Can the model express the solution in a compact, engine-like sequence of 3D vector utility operations?"

The most important methodological choice, though, is the JSON DSL used for vector-method questions. Instead of only asking for a computed answer, these prompts require the model to express the solution as a small sequence of reusable spatial ops. The scorer then: parses the structured spatial ops, verifies that required ops are present, validates the result against bundled test cases

This matters because it pushes the benchmark closer to real spatial tooling and real-time engine programming style. In production 3D code, developers do not usually hand-expand every calculation inline. They rely on utility functions and math types such as vec3, quat, normalization helpers, projection helpers, and angle utilities. It is not enough for a model to vaguely say "take the cross product" or "project onto the rail." We want to know whether it can express the operation cleanly, efficiently, and in a form that mirrors good engine practice.

We also score with typed validators rather than a single string comparison. Depending on the question, the scorer may evaluate:

  • exact booleans

  • floating-point answers with tolerance

  • vector equality

  • JSON object structure

  • constrained coordinate ranges

  • layout validity

  • structured vector-method correctness

This gives us a more realistic view of model capability than generic text matching would.

What did we learn
1. Linear algebra is still a weak spot

The strongest single pattern in the results is that Linear Algebra & Vector Methods remains substantially weaker than the rest of the benchmark.

On the current 80-question public rerun:

  • GPT-5.5, Gemini 3 Flash Preview, Kimi K2.6, and Claude Fable 5 led this category at 7/12

  • Gemini 3.5 Flash, Claude Opus 4.8, and Kimi K2.7 Code scored 5/12

  • Claude Opus 4.7 scored 4/12

  • Gemini 3.1 Pro Preview, Qwen 3.6 Plus, and DeepSeek V4 Pro scored 3/12

That is a very different picture from Constraint Satisfaction or Relational & Topological, where most models are near the ceiling.

The failure modes are also revealing. The hardest questions were not generic "hard math" prompts. They were questions about engine-style vector reasoning under explicit conventions.

Examples:

  • On the signed yaw angle task, all 16/16 models failed.

  • On the left strafe vector task, all 16/16 models failed.

  • On the ricochet direction task, 12/16 models failed.

  • On the slide vector task, 11/16 models failed.

  • On the project point onto rail task, 8/16 models failed.

The common pitfalls were remarkably consistent:

  • Most models used the cross product in the wrong order.

  • They returned the right vector instead of the left vector.

  • They flipped the sign convention for signed yaw.

  • They forgot to normalize the direction or normal vector, even when normalization was required by the evaluator.

Across the full six-model rerun, the scorer's error traces were also highly repetitive. The most common failure reasons in this category were missing required spatial utility operations, especially normalize, missing the final ANSWER: block, and sign or direction reversals. In other words, the models often had the right intuition class but still failed to express the solution in a convention-correct, engine-appropriate way.

This is a clean signal that current models are still brittle on orientation-sensitive 3D math. In a text-only setting, a sign error might look minor. In AR, robotics, or game systems, a left/right sign error is catastrophic. A model that turns an avatar right when it should turn left is not "almost correct." It is wrong in the exact way that breaks the interaction.

2. More reasoning did not help, and sometimes seemed to hurt

One result that surprised us is that "more thinking" did not translate into better spatial performance.

In our earlier reasoning-on versus reasoning-off runs, we did not see meaningful gains from turning reasoning up:

  • Gemini 3.1 Pro Preview stayed flat at 90.0%

  • Gemini 3 Flash Preview also stayed flat at 90.0%

  • Claude Sonnet 4.6 stayed flat

  • Claude Opus 4.6 dropped by 2 points

  • GPT-5.2 dropped by 2 points

So the honest read is not that heavy reasoning unlocked new spatial competence. It mostly did not.

It is also worth noting a more nuanced point: we did not observe a large "Flash beats Pro" gap in the Gemini comparison. What we observed was arguably more interesting. Gemini 3 Flash Preview matched Gemini 3.1 Pro Preview on accuracy, but consumed dramatically more reasoning tokens for the same outcome. That suggests that on this benchmark, the bottleneck is not simply "more deliberate chain-of-thought." The tasks are often short, exact, and convention-sensitive. Extra reasoning budget does not rescue a wrong handedness assumption.

This is consistent with the hardest questions in the suite. If the task is "return the left strafe vector" or "produce a signed yaw with positive meaning target-on-the-left," then a longer trace does not help if the model starts from the wrong coordinate convention. Spatial failures often come from committing to the wrong frame, not from failing to think for long enough.

This is an important lesson for spatial product work. If a workload is dominated by compact, deterministic spatial transformations, more internal deliberation may be less valuable than:

  • tighter scene representations

  • stronger convention grounding

  • better output discipline

3. Thinking about depth is harder than it looks

We also saw repeated weaknesses in questions that required reasoning about z placement or layered depth in a scene.

The World-Space UI Layout category exposed this well. One of the hardest full-scene tasks asked the model to adjust a UI layout so all movable elements were visible, no two overlapped, and only the necessary objects were changed. 5 of 6 models failed that prompt.

The strong common pattern was:

  • wrong z placement

  • moving objects unnecessarily

  • changing more objects than requested

  • producing a layout that looked cleaner, but violated the "minimal valid edit" constraint

This points out that the LLM are struggling at instruction-following under spatial constraints.

That matters a lot for real tools. In an AR authoring workflow, it is often not enough for the model to produce a valid arrangement. We need it to produce the smallest valid patch, preserve the rest of the scene, and respect which objects are allowed to move.

4. Large scene hierarchies hurts performance

Another strong lesson is that lots of context can make models worse, not better.

Some of our context-backed prompts inject real scene JSON into the prompt so the model can reason over a concrete scene graph. This is necessary, because real tools do not operate on toy descriptions. But the benchmark also makes clear that once the scene JSON gets large, models often become less reliable.

The symptoms are familiar:

  • they miss the specific object relationships that matter

  • they confuse which parts of the scene are relevant

  • they over-fit to irrelevant detail

  • they return structurally noisy or overcomplicated outputs

We see this especially in full-scene construction and editing prompts such as crate generation, cube generation, and UI layout repair.

One of the clearest examples is the hardest world-space UI repair prompt, where the model has to inspect a scene, preserve immovable objects, adjust only a small set of movable ones, maintain visibility, and avoid overlap. Most failures were not absurd outputs. They were near-miss edits: wrong z, one unnecessary object move, or a slightly over-broad patch. That is exactly the kind of brittleness that large scene payloads can amplify.

We also adjusted the benchmark itself as we learned this. Earlier in the process, some prompt shapes carried more scene detail than was actually useful for measuring the underlying spatial skill. As we saw how quickly excess scene JSON could create overload, we intentionally kept the public benchmark tighter. The goal is not to test whether a model can survive arbitrary prompt bloat. It is to measure spatial reasoning while still using scene context realistically.

This suggests an important product direction: models need tighter scene representations and better retrieval-like inspection behavior. Instead of dumping the full scene every time, it may be better to:

  • provide a compact scene summary first

  • expose exact subtrees or object metadata on demand

  • let the model inspect specifics only when needed

In other words, good spatial tooling may depend as much on how the scene is represented to the model as on the raw underlying model capability.

5. Open models are making real progress

One of the most encouraging outcomes of this work is how competitive the latest open models are.

We are happy to see not just that open models are competitive, but that they are improving quickly in this field. Kimi 2.6 led the benchmark, and both Kimi 2.5 and DeepSeek V3.1 were firmly in the top tier.

For spatial computing, that matters. The ecosystem will be healthier if strong spatial reasoning is not limited to one or two closed model families. Open models give developers more room to experiment with latency, deployment, privacy, fine-tuning, and product integration. For AR and XR tooling, that flexibility is likely to matter a great deal.

Where this leaves us

We came away from this benchmark more optimistic than when we started. Current models are already capable enough to be genuinely useful in some spatial workflows, especially in: topological reasoning, basic constraints, world-space arrangement.

But the benchmark also showed us exactly where the cracks still are:

  • 3D engine-specific conventions

  • left/right and handedness bugs

  • exact depth placement

  • engine-style vector and rotation code patterns

  • scene hierarchy understanding in large scenes, and minimal-edit discipline

These are all concepts central to real spatial tools.

That is why we think spatial intelligence deserves to be treated as its own benchmarkable domain. If AI is going to be useful in AR, robotics, simulation, games, and scene generation, it needs to be evaluated on the kinds of failures that actually break those systems.

We built a Spatial Benchmark to make those strengths and weaknesses visible. The good news is that the models are already farther along than many people think. The hard part, and the interesting part, is now very clear.

We also plan to keep updating the benchmark and publishing results as new models are released. Spatial AI is on an upwards trajectory, and part of the value of a benchmark like this is watching where progress is real, where it is uneven, and which failure modes persist across generations.

We would love feedback from the community, and we encourage anyone using the benchmark to raise issues on GitHub, suggest new question types, and share failure cases we should cover. We want the whole AI ecosystem to improve when it comes to spatial applications, especially across the open model community, and we hope this benchmark can be one useful tool in that process.