Can AI assemble
real machines?

We give different AIs the same parts and the same goal, then grade what each one builds with one automatic checker. It's the Mechanical Assembly Readiness Benchmark — and it runs on CADCLAW, our open-source verification engine.

See the results Read the deep dive The benchmark

Tool-independent · automated · mapped to TRL / MRL / IRL · the engine is open source (MIT)

The target machine the AIs were asked to build

The goal above is the only picture each AI received, plus ~100 authored STEP part files. No build steps. We measured whether they could figure out the rest.

The first results

Three AI workflows — Claude (Fusion and CadQuery) and OpenAI Codex (CadQuery) — each placed all ~100 parts with zero human help. None is buildable yet: every run still landed about 50 mm off in relative position and mis-oriented roughly half the asymmetric parts. Different models led different dimensions: Claude · CadQuery came closest on interface gaps (0 mm median gap error, about 54% of interfaces within 1 mm); OpenAI Codex got the most rotations right (69% aligned).

Even with that, the underlying result is real. From one picture and ~100 authored part files, with no build sequence, each model produced a coherent two‑metre machine — and Codex did it in 13 minutes on the first attempt. Tolerances are tight by design (≤5 mm = located, ≤5° = aligned) because real equipment runs on fractions of a millimetre. Placing the parts is the easy part now; getting the gaps and rotations correct is the hard part, and that is what these scores measure.

Gap correctness versus build time for three AI runs; Claude·CadQuery hit the answer key's gaps in 49 minutes, OpenAI Codex was loosest on gaps but fastest at 13 minutes

Gap correctness vs. speed. Claude · CadQuery led on interface gaps (0 mm median error, about 54% within 1 mm) but took 49 minutes; OpenAI Codex one‑shot it in 13 minutes with the loosest gaps and the strongest rotation. More iteration buys precision at the cost of time.

Until recently there was no way to measure this. The two pieces — an AI that drives CAD, and an automatic checker that confirms whether the result is correct — only recently became usable together. With both in place, every attempt gets a score that can be compared and tracked over time. Read the full comparison →

3AI workflows tested

~100parts placed, 0 human help

13 minfastest build (Codex)

15 / 100MARB full-stack score (L0–L7; a clean L1 build ≈ 15)

What MARB measures

A capability ladder, L0–L7, from "place one part correctly" up to "design, build, and certify a machine autonomously." Today's frontier sits at L1 — assemble the kit. The benchmark is graded on what the exported geometry proves, so any tool — Fusion, CadQuery, or an AI agent — is judged on equal footing, and scored against the readiness scales industry already uses.

The benchmark + readiness chart The deep dive

Prior art exists for AI generating CAD; MARB is, to our knowledge, the first tool-independent benchmark for whether a whole assembled machine is correct and buildable.

The engine: CADCLAW open source

MARB runs on CADCLAW — an open-source check suite for STEP assemblies. It's the same engine you can pip install and run in your own CI: a chain of automated gates that pass only when every configured check passes. Like pytest for mechanical design — in spirit.

Inventory

Missing or extra parts, by bounding-box signature, against expected counts.

Interference

Solid-solid overlaps via BRep boolean intersection — not just bbox.

Adjacency

Parts that should be near each other but aren't — the motor 600 mm from its mount.

Dimensional

Wrong thickness, swapped box() args, impossible dimensions.

Floating

Non-exempt parts isolated from the structural frame beyond a max gap.

Structural

Beam deflection, motor torque budget, belt tension. Static load math, not motion-clearance or full-travel sweeps.

Tolerance

Worst-case, RSS, Monte Carlo stacking with C_pk and variance decomposition.

BOM audit

BOM JSON ↔ CAD: qty, mfg_type, required/forbidden terms, count drift. Private fields never echoed.

Honesty toolchain: cadclaw doctor verifies your environment · cadclaw publish-audit stops private BOMs being committed · cadclaw claim-audit flags overclaims. The same discipline keeps the benchmark honest — it won't assert a result it hasn't earned. An MCP server exposes the checks (and only the checks) to AI assistants.

Quick start

pip install cadclaw
cadclaw doctor                                   # verify the environment
cadclaw harness --rules cadclaw.yaml             # run configured checks
cadclaw bom-audit --rules cadclaw.yaml           # or a single gate

Exit codes: 0 pass · 1 fail · 2 warn-only · 3 internal error. No commercial CAD software required for CADCLAW's own checks. Full how-it-works write-up: CI for mechanical design →

What CADCLAW does NOT prove

CADCLAW checks geometry, BOM JSON, and README text against rules you write. It does not prove:

That the native CAD model has no hidden or suppressed parts — it reads the STEP export, which can silently drop invisible parts.
That the physical build matches the CAD.
That a vendor part is in stock or the price you assumed.
That a printed part is strong enough for production — the structural gate does bare-beam math, not fatigue or creep.
That an AI-generated change is correct — passing the gates means "passed the gates we have," no more.

Each report ships a confidence budget per gate: checked, not_checked, assumptions. Read it. (The benchmark's grades are geometric; they are not a substitute for physical testing.)

Origin

CADCLAW was built alongside the M3-CRETE open-source concrete 3D printer — the large, part-dense machine that is also MARB's first reference target. Across that project the harness:

Caught 53 solid-solid interferences in a single run
Reduced STEP file size from 70 MB to 13 MB by finding geometry bloat
Validated 150+ assembly changes across 15 sessions without visual inspection

Developed by Sunnyday Technologies.

Citation

If you use CADCLAW or MARB in published research or derivative work, please cite:

Sonnentag, N. (2026). CADCLAW: Automated validation framework for
STEP-based CAD assemblies. Sunnyday Technologies.
https://github.com/sunnyday-technologies/CADCLAW
DOI: 10.5281/zenodo.19647391

A CITATION.cff file is included for automated citation tooling.

Sunnyday Full Loop

Connected 3DCP Stack

The Sunnyday portfolio links material records, mix intelligence, printable hardware, logistics, and validation planning so project decisions remain traceable from candidate mix through measured result.