Datasets
| Dataset | Tasks |
|---|---|
MichaelY310/devopsgym | 728 |
pgcodellm/rebench-v2-test | 20 |
factory-ai/legacy-bench | 10 |
maxbittker/runebench | 32 |
scale-ai/swe-atlas-qna | 124 |
webgen-bench/webgen-bench | 101 |
cookbook/test | 1 |
scale-ai/swe-atlas-tw | 90 |
cais/swebenchpro | 731 |
futurehouse/bixbench | 205 |
quesma/compilebench | 15 |
bigcode/bigcodebench-hard-complete | 145 |
LiteCoder/LiteCoder-rl | 602 |
grafana/o11y-bench | 63 |
nvats/codeskills-bench | 23 |
gabeorlanski/slopcodebench | 36 |
yanagiorigami/frontier-cs | 172 |
sierra-research/tau3-bench | 375 |
rexbench/rexbench | 2 |
minnesotanlp/aar | 1,400 |
cmu/refav | 1,500 |
adyen/dabstep | 450 |
xlang/ds-1000 | 1,000 |
apple/mmau | 1,000 |
openthoughts/openthoughts-tblite | 100 |
openai/mmmlu | 150 |
meta/mlgym-bench | 12 |
featurebench/featurebench-lite | 30 |
featurebench/featurebench-modal | 200 |
bigcode/humanevalfix | 164 |
193 datasets