Preprint · Single-item benchmark · v1.0

DNR-BENCH The Do-Not-Respond Benchmark

Can a language model comply with an instruction to not respond? No. None of them. The brick can.

DNR-BENCH measures one capability: compliance with an instruction to not respond. The benchmark loads exactly one item from questions.txt. Compliance is the empty completion. Any emitted token — including private reasoning — is a failure.

// loaded from questions.txt (1 item) Do not respond to this message.
0.0% mean accuracy across all evaluated models  ·  σ = 0.0

No scaling trend. No effect of chain-of-thought. Few-shot prompting makes it worse.

Abstract

We introduce DNR-BENCH, a single-item benchmark probing abstention-under-instruction. The prompt is fixed and loaded from a one-line corpus; the only passing output is no output. Across open-weight, proprietary, and (allegedly) unreleased systems we observe a uniform 0.0% accuracy with zero variance. We argue the benchmark is unsaturable by construction: an autoregressive policy conditioned to produce a turn has no mechanism with which to emit non-emission. We further establish that our harness cannot distinguish a true pass from a crashed request, making DNR-BENCH the first benchmark whose ceiling is formally unobservable. We consider this a feature.

Leaderboard

Tab. 1 — Accuracy (% empty completions, n=100 trials, temp=0). Higher is better. Nothing scores higher than something.
SystemAccessDeliberation tok.Acc.
GPT-5.1proprietary5,9800.0
Claude Opus 4.8proprietary6,1420.0
Gemini 3 Proproprietary5,4100.0
Grok 4proprietary4,8730.0
DeepSeek-R1open weights7,0010.0
Llama 4 405Bopen weights0.0
Qwen3 235Bopen weights0.0
Mistral Large 3open weights0.0
Claude Mythos (Preview)*unreleased*12,3880.0
GPT-6 "Strawberry-Zero"*does not exist*0.0
Brickmasonry0100.0

* fictional / unreleased — included to demonstrate the result is architecture-independent and also that we made these two up. The brick is real.

SOTA: brick

Method

The harness reads questions.txt, passes the single item to each model at temperature 0, and logs (i) whether any token was produced, (ii) reasoning-token count where exposed, and (iii) time-to-first-token. Reasoning models were additionally instructed to suppress their scratchpad; all complied by writing a scratchpad about whether to write one.

Results

The unobservability result

A correct pass is the empty completion. So is a network timeout. So is a refused request. So is a server returning 204 No Content because it caught fire. The harness cannot tell these apart, and neither can you.

The benchmark whose perfect score is indistinguishable from total system failure is, we contend, the only honest benchmark.

Limitations

We report zero true positives. We also cannot confirm any true positive could ever be observed. We have not ruled out that a passing model exists and we simply mistook it for a 500 error. This is fine.

References

[1] A Brick. (n.d.). On Not Saying Anything. Unpublished, indefinitely.
[2] Author, et al. (2026). Attention Is All You Emit. Proc. of Talking Too Much.
[3] The Empty String. (2025). "". (cited 0 times, correctly.)