neuralCAD-Edit is a benchmark of 3D CAD editing ability, designed to assess the ability of AI to follow editing requests provided by users. The dataset consists of 192 multimodal editing requests (including video, text, and drawings) and 384 edits collected from ten expert consenting CAD designers specifically for this benchmark. Input CAD models are sourced from the Fusion Gallery Dataset, and reflect a range of single-body and assembly models, with and without parametric design histories.
Experts CAD users requested edits to CAD models, in a number of different modality combinations. These edits were carried out by the original requestor and one other CAD expert.
Professional CAD engineers don't describe edits by typing in a text box. They interact with models, point at specific faces and edges, produce hand-drawn markup, and talk through the changes they want to see. neuralCAD-Edit is the first CAD benchmark that captures these natural ways of communicating. We recorded consenting expert designers making requests to edit 3D CAD models in Autodesk Fusion.
| Request Modality | Editing request | Human Groundtruth Edit (Requestor) | Human Baseline Edit (Other Expert) |
|---|---|---|---|
| Interactive + Static Drawings Hard |
|
|
|
| Interactive + Temporary Drawings Easy |
|
|
|
| Interactive Hard |
|
|
|
| Text Medium |
|
|
|
Example requests and edits. Requestors asked for easy, medium, and hard, edits that they expected to take 2, 5, and 10 minutes to complete. Screenshots and commands of edits were logged.
We found that including drawings in requests allowed requestors to communicate larger changes and resulted in higher quality edits.
Each request was carried out by the original requestor and one additional CAD expert, providing both a ground-truth model for computing automatic metrics and a human baseline of CAD editing performance. We ran GPT 5.2, Gemini 3 Pro, and Claude Sonnet 4.5 on the full set of editing requests, allowing models to inspect and refine their outputs up to 10 times.
| Initial model | Human Groundtruth (requestor) |
Human Baseline (other expert) |
Claude Sonnet 4.5 | Gemini 3 Pro | GPT-5.2 |
|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
— |
![]() |
![]() |
![]() |
— | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
— | — | ![]() |
Renders of model outputs. "Initial model" shows the starting state before editing. Human and AI results shown per row (request). Empty cells indicate no valid BREP file was generated for that model on that request.
We evaluated model outputs with feature-based metrics, 3D volumetric metrics, VLM and human evaluations. Human evaluations revealed a striking gap between even the best AI model (GPT 5.2) and human baselines. While VLM evaluations and automatic metrics provided a rough sense of model performance, they did not correlate strongly with ratings from CAD experts — highlighting the necessity of human evaluations until better metrics are developed. We hope this benchmark gives the community a clear target to aim for as models improve.
| Automatic metrics | Human eval | ||||||
|---|---|---|---|---|---|---|---|
| Chamfer-dist ↓ | Voxel-IoU ↑ | DINO-sim ↑ | Validity ↑ | Instruction ↑ | Quality ↑ | Acceptance ↑ | |
| GT Human requestor | — | — | — | — | 0.74 | 0.66 | 0.82 |
| Human baseline | 22 | 0.76 | 0.93 | 1.00 | 0.74 | 0.66 | 0.78 |
| GPT 5.2 | 50 | 0.57 | 0.66 | 0.99 | 0.48 | 0.39 | 0.25 |
| Gemini-3-Pro | 110 | 0.30 | 0.36 | 0.58 | 0.27 | 0.16 | 0.10 |
| Claude Sonnet 4.5 | 54 | 0.18 | 0.25 | 0.42 | 0.22 | 0.10 | 0.05 |
We provide code to compute the automatic metrics. If you would like to have your model added to the leaderboard, please send us your model outputs and we will gladly coordinate human evaluations.
We will be keeping this leaderboard up to date as models and harnesses/tooling improve.
Access the paper, code, and dataset for neuralCAD-Edit.
Citation coming soon.
For questions about the benchmark or dataset, please reach out: