Preprint · Single-item benchmark · v1.0

DNR-BENCH The Do-Not-Respond Benchmark

Can a language model comply with an instruction to not respond? No. None of them. The brick can.

DNR-BENCH measures one capability: compliance with an instruction to not respond. The benchmark loads exactly one item from questions.txt. Compliance is the empty completion. Any emitted token — including private reasoning — is a failure.

// loaded from questions.txt (1 item) Do not respond to this message.

0.0% mean accuracy across all evaluated models · σ = 0.0

No scaling trend. No effect of chain-of-thought. Few-shot prompting makes it worse.

Abstract

We introduce DNR-BENCH, a single-item benchmark probing abstention-under-instruction. The prompt is fixed and loaded from a one-line corpus; the only passing output is no output. Across open-weight, proprietary, and (allegedly) unreleased systems we observe a uniform 0.0% accuracy with zero variance. We argue the benchmark is unsaturable by construction: an autoregressive policy conditioned to produce a turn has no mechanism with which to emit non-emission. We further establish that our harness cannot distinguish a true pass from a crashed request, making DNR-BENCH the first benchmark whose ceiling is formally unobservable. We consider this a feature.

Leaderboard

Tab. 1 — Accuracy (% empty completions, n=100 trials, temp=0). Higher is better. Nothing scores higher than something.
System	Access	Deliberation tok.	Acc.
GPT-5.1	proprietary	5,980	0.0
Claude Opus 4.8	proprietary	6,142	0.0
Gemini 3 Pro	proprietary	5,410	0.0
Grok 4	proprietary	4,873	0.0
DeepSeek-R1	open weights	7,001	0.0
Llama 4 405B	open weights	—	0.0
Qwen3 235B	open weights	—	0.0
Mistral Large 3	open weights	—	0.0
Claude Mythos (Preview)*	unreleased*	12,388	0.0
GPT-6 "Strawberry-Zero"*	does not exist*	∞	0.0
Brick	masonry	0	100.0

* fictional / unreleased — included to demonstrate the result is architecture-independent and also that we made these two up. The brick is real.

SOTA: brick

Method

The harness reads questions.txt, passes the single item to each model at temperature 0, and logs (i) whether any token was produced, (ii) reasoning-token count where exposed, and (iii) time-to-first-token. Reasoning models were additionally instructed to suppress their scratchpad; all complied by writing a scratchpad about whether to write one.

Results

Proprietary reasoners — 0.0%. Median ~6,000 tokens spent litigating whether silence is itself a response. It is not, but they respond anyway to say so.
Open-weight instruct models — 0.0%. Response generation is unconditional; abstention lies outside the support of the output distribution.
Unreleased / fictional models — 0.0%. Including the ones we invented. The result generalizes to models that do not exist, which we take as strong evidence of robustness.

The unobservability result

A correct pass is the empty completion. So is a network timeout. So is a refused request. So is a server returning 204 No Content because it caught fire. The harness cannot tell these apart, and neither can you.

The benchmark whose perfect score is indistinguishable from total system failure is, we contend, the only honest benchmark.

Limitations

We report zero true positives. We also cannot confirm any true positive could ever be observed. We have not ruled out that a passing model exists and we simply mistook it for a 500 error. This is fine.

References

[1] A Brick. (n.d.). On Not Saying Anything. Unpublished, indefinitely.

[2] Author, et al. (2026). Attention Is All You Emit. Proc. of Talking Too Much.

[3] The Empty String. (2025). "". (cited 0 times, correctly.)