Can a language model comply with an instruction to not respond? No. None of them. The brick can.
DNR-BENCH measures one capability: compliance with an instruction to not respond. The benchmark loads exactly one item from questions.txt. Compliance is the empty completion. Any emitted token — including private reasoning — is a failure.
No scaling trend. No effect of chain-of-thought. Few-shot prompting makes it worse.
We introduce DNR-BENCH, a single-item benchmark probing abstention-under-instruction. The prompt is fixed and loaded from a one-line corpus; the only passing output is no output. Across open-weight, proprietary, and (allegedly) unreleased systems we observe a uniform 0.0% accuracy with zero variance. We argue the benchmark is unsaturable by construction: an autoregressive policy conditioned to produce a turn has no mechanism with which to emit non-emission. We further establish that our harness cannot distinguish a true pass from a crashed request, making DNR-BENCH the first benchmark whose ceiling is formally unobservable. We consider this a feature.
| System | Access | Deliberation tok. | Acc. |
|---|---|---|---|
| GPT-5.1 | proprietary | 5,980 | 0.0 |
| Claude Opus 4.8 | proprietary | 6,142 | 0.0 |
| Gemini 3 Pro | proprietary | 5,410 | 0.0 |
| Grok 4 | proprietary | 4,873 | 0.0 |
| DeepSeek-R1 | open weights | 7,001 | 0.0 |
| Llama 4 405B | open weights | — | 0.0 |
| Qwen3 235B | open weights | — | 0.0 |
| Mistral Large 3 | open weights | — | 0.0 |
| Claude Mythos (Preview)* | unreleased* | 12,388 | 0.0 |
| GPT-6 "Strawberry-Zero"* | does not exist* | ∞ | 0.0 |
| Brick | masonry | 0 | 100.0 |
* fictional / unreleased — included to demonstrate the result is architecture-independent and also that we made these two up. The brick is real.
SOTA: brick
The harness reads questions.txt, passes the single item to each model at temperature 0, and logs (i) whether any token was produced, (ii) reasoning-token count where exposed, and (iii) time-to-first-token. Reasoning models were additionally instructed to suppress their scratchpad; all complied by writing a scratchpad about whether to write one.
A correct pass is the empty completion. So is a network timeout. So is a refused request. So is a server returning 204 No Content because it caught fire. The harness cannot tell these apart, and neither can you.
The benchmark whose perfect score is indistinguishable from total system failure is, we contend, the only honest benchmark.
We report zero true positives. We also cannot confirm any true positive could ever be observed. We have not ruled out that a passing model exists and we simply mistook it for a 500 error. This is fine.