Can AI assemble
real machines?

We give different AIs the same parts and the same goal, then grade what each one builds with one automatic checker. It's the Mechanical Assembly Readiness Benchmark — and it runs on CADCLAW, our open-source verification engine.

Tool-independent · automated · mapped to TRL / MRL / IRL · the engine is open source (MIT)

The target machine the AIs were asked to build

The goal above is the only picture each AI received, plus ~100 authored STEP part files. No build steps. We measured whether they could figure out the rest.

The first results

Three AI workflows — Claude (Fusion and CadQuery) and OpenAI Codex (CadQuery) — each placed all ~100 parts with zero human help. None is buildable yet: every run still landed about 50 mm off in relative position and mis-oriented roughly half the asymmetric parts. Different models led different dimensions: Claude · CadQuery came closest on interface gaps (0 mm median gap error, about 54% of interfaces within 1 mm); OpenAI Codex got the most rotations right (69% aligned).

Even with that, the underlying result is real. From one picture and ~100 authored part files, with no build sequence, each model produced a coherent two‑metre machine — and Codex did it in 13 minutes on the first attempt. Tolerances are tight by design (≤5 mm = located, ≤5° = aligned) because real equipment runs on fractions of a millimetre. Placing the parts is the easy part now; getting the gaps and rotations correct is the hard part, and that is what these scores measure.

Gap correctness versus build time for three AI runs; Claude·CadQuery hit the answer key's gaps in 49 minutes, OpenAI Codex was loosest on gaps but fastest at 13 minutes

Gap correctness vs. speed. Claude · CadQuery led on interface gaps (0 mm median error, about 54% within 1 mm) but took 49 minutes; OpenAI Codex one‑shot it in 13 minutes with the loosest gaps and the strongest rotation. More iteration buys precision at the cost of time.

Until recently there was no way to measure this. The two pieces — an AI that drives CAD, and an automatic checker that confirms whether the result is correct — only recently became usable together. With both in place, every attempt gets a score that can be compared and tracked over time. Read the full comparison →
3AI workflows tested
~100parts placed, 0 human help
13 minfastest build (Codex)
15 / 100MARB full-stack score (L0–L7; a clean L1 build ≈ 15)

What MARB measures

A capability ladder, L0–L7, from "place one part correctly" up to "design, build, and certify a machine autonomously." Today's frontier sits at L1 — assemble the kit. The benchmark is graded on what the exported geometry proves, so any tool — Fusion, CadQuery, or an AI agent — is judged on equal footing, and scored against the readiness scales industry already uses.

Prior art exists for AI generating CAD; MARB is, to our knowledge, the first tool-independent benchmark for whether a whole assembled machine is correct and buildable.

The engine: CADCLAW open source

MARB runs on CADCLAW — an open-source check suite for STEP assemblies. It's the same engine you can pip install and run in your own CI: a chain of automated gates that pass only when every configured check passes. Like pytest for mechanical design — in spirit.

Inventory

Missing or extra parts, by bounding-box signature, against expected counts.

Interference

Solid-solid overlaps via BRep boolean intersection — not just bbox.

Adjacency

Parts that should be near each other but aren't — the motor 600 mm from its mount.

Dimensional

Wrong thickness, swapped box() args, impossible dimensions.

Floating

Non-exempt parts isolated from the structural frame beyond a max gap.

Structural

Beam deflection, motor torque budget, belt tension. Static load math, not motion-clearance or full-travel sweeps.

Tolerance

Worst-case, RSS, Monte Carlo stacking with Cpk and variance decomposition.

BOM audit

BOM JSON ↔ CAD: qty, mfg_type, required/forbidden terms, count drift. Private fields never echoed.

Honesty toolchain: cadclaw doctor verifies your environment · cadclaw publish-audit stops private BOMs being committed · cadclaw claim-audit flags overclaims. The same discipline keeps the benchmark honest — it won't assert a result it hasn't earned. An MCP server exposes the checks (and only the checks) to AI assistants.

Quick start

pip install cadclaw
cadclaw doctor                                   # verify the environment
cadclaw harness --rules cadclaw.yaml             # run configured checks
cadclaw bom-audit --rules cadclaw.yaml           # or a single gate

Exit codes: 0 pass · 1 fail · 2 warn-only · 3 internal error. No commercial CAD software required for CADCLAW's own checks. Full how-it-works write-up: CI for mechanical design →

What CADCLAW does NOT prove

CADCLAW checks geometry, BOM JSON, and README text against rules you write. It does not prove:

Each report ships a confidence budget per gate: checked, not_checked, assumptions. Read it. (The benchmark's grades are geometric; they are not a substitute for physical testing.)

Origin

CADCLAW was built alongside the M3-CRETE open-source concrete 3D printer — the large, part-dense machine that is also MARB's first reference target. Across that project the harness:

Developed by Sunnyday Technologies.

Citation

If you use CADCLAW or MARB in published research or derivative work, please cite:

Sonnentag, N. (2026). CADCLAW: Automated validation framework for
STEP-based CAD assemblies. Sunnyday Technologies.
https://github.com/sunnyday-technologies/CADCLAW
DOI: 10.5281/zenodo.19647391

A CITATION.cff file is included for automated citation tooling.

DOI 10.5281/zenodo.19647391

Sunnyday Full Loop

Connected 3DCP Stack

The Sunnyday portfolio links material records, mix intelligence, printable hardware, logistics, and validation planning so project decisions remain traceable from candidate mix through measured result.