Software engineering benchmark and leaderboard for evaluating AI coding agents on real GitHub issues.
+ Important signal for coding-agent capability
+ Uses realistic software tasks
- Leaderboard performance may not match every codebase
- Can be gamed or overfit like any benchmark
Free public benchmark, datasets, and leaderboard access.
More in Models & Infrastructure