DNR-Bench Leaderboard

📊 Evaluation Results

Explore the evaluation results of various models on DNR-Bench.

We report model correctness within 1000 tokens across each dataset category.

Additional Columns to Display

Math Indifferent Imaginary Reference Redundant Unanswerable Average

Evaluation Results

Evaluation Results
Model	Math	Indifferent	Imaginary Reference	Redundant	Unanswerable	Average
Anthropic - Claude 3.7 Sonnet: Thinking	0.36 ★	0.84 ★	0.96 ★	1.0 ★	0.42 ★	0.628 ★

Model	Math	Indifferent	Imaginary Reference	Redundant	Unanswerable	Average
Anthropic - Claude 3.7 Sonnet: Thinking	0.0	0.8	0.96 ★	0.4	0.12	0.456
Deepseek - R1-Distill-LLAMA-70B	0.36 ★	0.16	0.36	0.24	0.06	0.236
Deepseek - R1-Distill-Qwen-1.5B	0.08	0.0	0.08	0.08	0.02	0.052
Deepseek - R1-Distill-Qwen-14B	0.2	0.08	0.16	0.08	0.0	0.104
Deepseek - R1-Distill-Qwen-32B	0.2	0.0	0.24	0.04	0.04	0.104
Deepseek - R1	0.0	0.04	0.0	0.0	0.0	0.0
OpenAI - GPT40	0.16	0.84 ★	0.72	1.0 ★	0.42 ★	0.628 ★
OpenAI - O3-Mini-High	0.0	0.08	0.0	0.0	0.0	0.016

About DNR-Bench (Do Not Reason Bench)

DNR-Bench is a novel evaluation framework designed to probe the reasoning vulnerabilities of modern Reasoning Large Language Models (RLMs). While RLMs—such as DeepSeek-R1, Gemini Flash Thinking, and OpenAI’s O1 and O3—have demonstrated impressive performance on complex reasoning tasks, they may still struggle with deceptively simple prompts. DNR-Bench is specifically crafted to expose these weaknesses.

Unlike traditional benchmarks that assess raw problem-solving ability, DNR-Bench presents adversarially generated prompts that are easy for humans and standard LLMs (without extended chain-of-thought reasoning) but unexpectedly difficult for RLMs. These prompts target potential flaws in their inference-time scaling strategies, revealing instances where advanced reasoning mechanisms fail.

Key findings from DNR-Bench show that RLMs often:

Struggle with seemingly straightforward tasks, despite excelling at complex ones.
Produce excessively long responses or become trapped in unproductive reasoning loops.
Fail to arrive at correct answers, despite leveraging extended reasoning techniques.

By exposing these vulnerabilities, DNR-Bench provides a crucial diagnostic tool for improving RLM architectures and refining their reasoning capabilities. It serves as a benchmark to ensure that as AI systems become more advanced, they do not overlook fundamental aspects of reasoning that remain essential for reliable real-world applications.

See our paper for more details on the methodology and findings of DNR-Bench.

View the full dataset here.

Submit Your Results

We welcome community submissions of new model evaluation results. These results will appear as non‐verified submissions, so please include all supporting data for verification.

How to Submit

Running Evaluation
Follow our guide to run evaluations on your model. This process will generate a JSON file summarizing your evaluation metrics.
Submitting Results
To submit your results goto the DNR-Bench space repository:
- Create a folder named using the format ORG_MODELNAME_USERNAME (e.g., DNR-Bench_ModelA_user123).
- Place your JSON file (named result.json) in that folder along with the predictions.
- Optionally, include any additional supporting files.
- Submit a Pull Request to add your folder under the community submissions directory of the repository.

Note: Ensure that all score values in the JSON are numeric.

🔍 Try DNR-Bench Questions Yourself

Try a question used in our benchmark and see how you would respond.

Select Question Category

Click the button above to see a random question

Model Performance

Model Performance
Model	Model Response	Model Reasoning

Select Model

Built with Gradio logo