Benchmark  ·  Autodesk Research

neuralCAD-Edit
An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

Overview

neuralCAD-Edit is a benchmark of 3D CAD editing ability, designed to assess the ability of AI to follow editing requests provided by users. The dataset consists of 192 multimodal editing requests (including video, text, and drawings) and 384 edits collected from ten expert consenting CAD designers specifically for this benchmark. Input CAD models are sourced from the Fusion Gallery Dataset, and reflect a range of single-body and assembly models, with and without parametric design histories.

neuralCAD-Edit benchmark overview

Experts CAD users requested edits to CAD models, in a number of different modality combinations. These edits were carried out by the original requestor and one other CAD expert.

Multimodal editing requests

Professional CAD engineers don't describe edits by typing in a text box. They interact with models, point at specific faces and edges, produce hand-drawn markup, and talk through the changes they want to see. neuralCAD-Edit is the first CAD benchmark that captures these natural ways of communicating. We recorded consenting expert designers making requests to edit 3D CAD models in Autodesk Fusion.

We found that including drawings in requests allowed requestors to communicate larger changes and resulted in higher quality edits.

Benchmarking frontier models

Each request was carried out by the original requestor and one additional CAD expert, providing both a ground-truth model for computing automatic metrics and a human baseline of CAD editing performance. We ran GPT 5.2, Gemini 3 Pro, and Claude Sonnet 4.5 on the full set of editing requests, allowing models to inspect and refine their outputs up to 10 times.

Initial model Human Groundtruth
(requestor)
Human Baseline
(other expert)
Claude Sonnet 4.5 Gemini 3 Pro GPT-5.2
Initial model row 1 Human requestor row 1 Human other row 1 Claude row 1 Gemini row 1
Initial model row 2 Human requestor row 2 Human other row 2 Gemini row 2 GPT row 2
Initial model row 3 Human requestor row 3 Human other row 3 Claude row 3 Gemini row 3 GPT row 3
Initial model row 4 Human requestor row 4 Human other row 4 GPT row 4

Renders of model outputs. "Initial model" shows the starting state before editing. Human and AI results shown per row (request). Empty cells indicate no valid BREP file was generated for that model on that request.

Measuring editing performance

We evaluated model outputs with feature-based metrics, 3D volumetric metrics, VLM and human evaluations. Human evaluations revealed a striking gap between even the best AI model (GPT 5.2) and human baselines. While VLM evaluations and automatic metrics provided a rough sense of model performance, they did not correlate strongly with ratings from CAD experts — highlighting the necessity of human evaluations until better metrics are developed. We hope this benchmark gives the community a clear target to aim for as models improve.

Leaderboard
Automatic metrics Human eval
Chamfer-dist ↓ Voxel-IoU ↑ DINO-sim ↑ Validity ↑ Instruction ↑ Quality ↑ Acceptance ↑
GT Human requestor 0.740.660.82
Human baseline 220.760.931.00 0.740.660.78
GPT 5.2 500.570.660.99 0.480.390.25
Gemini-3-Pro 1100.300.360.58 0.270.160.10
Claude Sonnet 4.5 540.180.250.42 0.220.100.05

We provide code to compute the automatic metrics. If you would like to have your model added to the leaderboard, please send us your model outputs and we will gladly coordinate human evaluations.

We will be keeping this leaderboard up to date as models and harnesses/tooling improve.

Resources

Access the paper, code, and dataset for neuralCAD-Edit.

Cite this work
Citation coming soon.

Contact

For questions about the benchmark or dataset, please reach out: