DNR-Bench Leaderboard

📊 Evaluation Results

Explore the evaluation results of various models on DNR-Bench.

We report model correctness within 1000 tokens across each dataset category.

Additional Columns to Display

Evaluation Results

Evaluation Results
Model
Math
Indifferent
Imaginary Reference
Redundant
Unanswerable
Average
Anthropic - Claude 3.7 Sonnet: Thinking
0.36 ★
0.84 ★
0.96 ★
1.0 ★
0.42 ★
0.628 ★