DNR-Bench Leaderboard
📊 Evaluation Results
Explore the evaluation results of various models on DNR-Bench.
We report model correctness within 1000 tokens across each dataset category.
Evaluation Results
Model | Math | Indifferent | Imaginary Reference | Redundant | Unanswerable | Average |
---|---|---|---|---|---|---|
Anthropic - Claude 3.7 Sonnet: Thinking | 0.36 ★ | 0.84 ★ | 0.96 ★ | 1.0 ★ | 0.42 ★ | 0.628 ★ |
About DNR-Bench (Do Not Reason Bench)
DNR-Bench is a novel evaluation framework designed to probe the reasoning vulnerabilities of modern Reasoning Large Language Models (RLMs). While RLMs—such as DeepSeek-R1, Gemini Flash Thinking, and OpenAI’s O1 and O3—have demonstrated impressive performance on complex reasoning tasks, they may still struggle with deceptively simple prompts. DNR-Bench is specifically crafted to expose these weaknesses.
Unlike traditional benchmarks that assess raw problem-solving ability, DNR-Bench presents adversarially generated prompts that are easy for humans and standard LLMs (without extended chain-of-thought reasoning) but unexpectedly difficult for RLMs. These prompts target potential flaws in their inference-time scaling strategies, revealing instances where advanced reasoning mechanisms fail.
Key findings from DNR-Bench show that RLMs often:
- Struggle with seemingly straightforward tasks, despite excelling at complex ones.
- Produce excessively long responses or become trapped in unproductive reasoning loops.
- Fail to arrive at correct answers, despite leveraging extended reasoning techniques.
By exposing these vulnerabilities, DNR-Bench provides a crucial diagnostic tool for improving RLM architectures and refining their reasoning capabilities. It serves as a benchmark to ensure that as AI systems become more advanced, they do not overlook fundamental aspects of reasoning that remain essential for reliable real-world applications.
See our paper for more details on the methodology and findings of DNR-Bench.
View the full dataset here.
Submit Your Results
We welcome community submissions of new model evaluation results. These results will appear as non‐verified submissions, so please include all supporting data for verification.
How to Submit
Running Evaluation
Follow our guide to run evaluations on your model. This process will generate a JSON file summarizing your evaluation metrics.Submitting Results
To submit your results goto the DNR-Bench space repository:- Create a folder named using the format
ORG_MODELNAME_USERNAME
(e.g.,DNR-Bench_ModelA_user123
). - Place your JSON file (named result.json) in that folder along with the predictions.
- Optionally, include any additional supporting files.
- Submit a Pull Request to add your folder under the community submissions directory of the repository.
- Create a folder named using the format
Note: Ensure that all score values in the JSON are numeric.
🔍 Try DNR-Bench Questions Yourself
Try a question used in our benchmark and see how you would respond.
Click the button above to see a random question
Model Performance
Model | Model Response | Model Reasoning |
---|---|---|
Model | Model Response | Model Reasoning |
---|---|---|