← All stories
● Covered by 1 source · 1 reportLow impact

CursorBench 3.1 Released with New Task Types and Grading Improvements

Aggregated by BrevFeed dev · updated 18h ago

🔖 Save

CursorBench 3.1 introduces new tasks aimed at codebase understanding and bugfinding, enhancing the evaluation of coding skills. The update also includes improved grading criteria for specific editing tasks to better assess developer performance.

Key points

New tasks focus on codebase understanding and bug finding
Improved grading criteria for edit tasks
Average cost per task calculated using token pricing

New Task Introductions

CursorBench 3.1 has added tasks that specifically target codebase understanding, bugfinding, planning, and code review. These tasks are designed to enhance evaluation metrics for developers and their coding capabilities.

Grading Criteria Enhancements

Alongside the new tasks, the update incorporates improved grading criteria for certain edit tasks. This adjustment aims to refine how developers' performances are assessed during these tasks, potentially leading to more accurate evaluations.

Previous Version Overview

The prior version, CursorBench 3.0, focused primarily on edit, refactor, and bugfix challenges. This groundwork has been expanded with the introduction of more diverse problem types in version 3.1.

Cost Analysis of Tasks

CursorBench computes the average cost per task by applying published pricing for each model used in tasks. The calculation considers inputs, cache read/write, and outputs, providing a cost-benefit analysis of performance across tasks. Variance in results is acknowledged, indicating that small differences might not be statistically significant.

✨ This summary was generated by AI from the outlets' reporting listed below. It is not independently verified and may contain errors — check the original sources. How BrevFeed works →

Reporting from

Hacker News Front Page — CursorBench 3.1 22h ago →