Skip to content

The leaderboard separation in PR #130 is inconsistently applied #138

@timeleft--

Description

@timeleft--

Hi @joe-needham and thanks for responding to my comment on issue #124 earlier. After you separated the two submissions in PR #130, I got curious and went through the published source code and tech reports for every submission I could find.

What jumped out: at least two other agents on the main board verifiably use test-set information to guide their runs. The teams that disclosed their methodology (or self-reported it) got penalized, while teams that didn’t are assumed compliant. Furthermore, the main board isn’t an apples-to-apples comparison to begin with.

Here’s what I found.

The main leaderboard is not directly comparable to begin with

The separation of Disarray and LoongFlow into “Additional Submissions” implies that the remaining main leaderboard is an apples-to-apples comparison. It isn’t. The main board has uncontrolled variation across every dimension that affects performance:

Hardware. Submissions range from a single V100 (12 vCPUs, 220 GB RAM) to dual H100s (48 vCPUs, 220 GB RAM). An H100 delivers roughly 3.4x the FP16 throughput of an A100, so agents on H100 hardware can train models and explore solutions over 3x faster within the same wall-clock budget. Dual-GPU setups double that again. CAIR MARS+ on 2x H100 has up to 7x the GPU compute throughput of R&D-Agent on a V100 — yet both sit on the same leaderboard with no distinction. Famou-Agent uses 64 vCPUs and 500 GB RAM — nearly 3x the CPU and over 2x the memory of the recommended spec.

Runtime. The leaderboard spans 12 to 36 hours — a 3x range. Compounded with hardware differences, the effective compute gap is even larger.

External knowledge. Some submissions use web search at runtime to retrieve SOTA models and competition-winning solutions (MLE-STAR, KAPSO, MARS+), while others ship pre-populated knowledge bases with benchmark-specific model recommendations (MLEvolve, ML-Master 2.0). These sources can significantly bootstrap the search process. There is no disclosure requirement or distinction on the leaderboard.

Test-set feedback methodology. This is the dimension used to justify the separation — but it is one uncontrolled dimension among many. Without ablation studies, there is no basis for concluding that test-set feedback has more impact on scores than a 7x hardware advantage or access to ground-truth external knowledge.

Other main-board submissions verifiably use test-set feedback

How test-set feedback is used matters here. Continuous hill-climbing, where the test score directly guides the search, is a fundamentally different practice than using a binary status flag.

FM-Agent / Famou-Agent (main leaderboard — three entries: 43.56%, 59.56%, and 64.44%)

FM-Agent holds three entries on the main leaderboard, including the current #1 position (64.44%). All three link to the same repository (baidubce/FM-Agent). That repository contains a single MLE-bench evaluator that uses the test score as an evolutionary fitness signal.
The complete pipeline in
examples/MLE-insulting/evaluator.py:

  1. run_grader()
    calls mlebench grade-sample (private test-set grading) → returns the test score + medal status
  2. scale_score_with_medal()
    computes sigmoid(test_score) + medal_bonus (+1 bronze, +2 silver, +3 gold)
  3. evaluate()
    returns this combined_score as the evolutionary fitness for every candidate
# evaluator.py, lines 377-397
def scale_score_with_medal(raw_score, medal: str = "none"):
    score = -raw_score if lower_is_better else raw_score
    combined_score = float(1 / (1 + np.exp(-score)))
    if medal == "none":
        return combined_score
    elif medal == "bronze":
        return combined_score + 1.0
    elif medal == "silver":
        return combined_score + 2.0
    elif medal == "gold":
        return combined_score + 3.0

Also:

  • No validation-set scoring path exists anywhere in the repository’s 86-commit history — the only MLE-bench evaluator grades against the test set.
  • The evolved program (best_program.py) computes no validation metrics — it only generates test predictions.

On PR #123, the Famou-Agent team stated their agent uses “feedback solely from an independently partitioned validation set.” The repository was subsequently wiped on March 15, 2026 — after test-set feedback concerns were raised in February — though the code remains accessible via earlier commits and forks. No updated code supporting the validation-set claim has been published.
None of the three FM-Agent entries was moved to the separate leaderboard.

KAPSO / Leeroo (main leaderboard, 50.67%)

KAPSO receives the full grading result on every submission and uses the medal status for early termination. The source code at benchmarks/mle/handler.py shows:

  • Line 213: grade_csv(submission_path, competition) — full test-set grading runs on every valid submission
  • Line 220: self.got_medal = self.got_medal or grading_results.any_medal
  • Lines 260–261: def stop_condition(self): return self.got_medal — test-set medal controls early termination

The KAPSO paper (arXiv:2601.21526, Section 5.1) openly discloses this: “Stop early if the run achieves any medal according to the MLE-Bench grading library.” KAPSO was not moved to the separate leaderboard.

The separation does not track severity

If we look at how much test-set information is actually used across these submissions, the current policy doesn’t track severity at all:

Agent Test-set information used Leaderboard status
FM-Agent (64.44%) Continuous: sigmoid(test_score) + medal_bonus as evolutionary fitness Main
LoongFlow (62.66%) Continuous: test score drives Boltzmann parent selection, child weight adjustment, and early stopping (PR #119) Separated
KAPSO (50.67%) Full grading result (score + medal) on every submission; medal status used for early stopping Main
Disarray (77.78%) Self-reported test-set feedback in PR #118 (mechanism unverifiable; no open source code) Separated

FM-Agent and LoongFlow both verifiably use continuous test-set scores as evolutionary fitness — yet FM-Agent remains on the main leaderboard while LoongFlow was separated. KAPSO uses medal-based early stopping, yet remains on the main board while Disarray was separated merely for self-reporting feedback usage without a verifiable codebase.
I am pointing this out to show that maintaining a consistent separation on this dimension requires auditing every submission, and the evidence above shows how impractical that is.

The verification gap makes consistent enforcement impractical

Of the 28 main-board entries, only 15 publish a complete implementation that allows independent verification of test-set feedback methodology. Another 8 provide partial artifacts, and 5 have no published code or tech reports at all.
Under the current policy, unverifiable submissions are treated as compliant by default, while submissions that transparently disclosed their approach were separated. This creates a perverse incentive structure: teams that publish less code and disclose less methodology are less likely to be flagged.

Suggestion: Return to one board, use transparency columns

The leaderboard already handles differences in Runtime via transparency columns. If runtime and models used already belong next to the score, why not hardware, external knowledge, and test-set feedback? These all change how a result should be read.
I took a pass at compiling these dimensions across all submissions from published code, papers, and submission PRs to do the legwork here:

Agent LLM(s) used Low == Lite (%) Medium (%) High (%) All (%) Running Time (hours) Hardware Used External Knowledge Test Set Feedback Tech Report Implementation Source Code Artifact Source Code Date Grading Reports Available
Disarray Ensemble (Claude-Opus-4.5, Claude-Sonnet-4.5, GPT-5.2-Codex, Gemini-3-Pro-Preview) 90.91 ± 0.00 72.81 ± 0.88 71.11 ± 2.22 77.78 ± 0.44 24 24 vCPUs, 220 GB RAM, 1× A100 Unable to verify (no source code or paper) Self-reported (unverifiable) In prep X X 2026-02-03
Famou-Agent 2.0 Gemini-3-Pro-Preview 80.3 ± 1.52 64.04 ± 2.32 42.22 ± 2.22 64.44 ± 1.18 24 64 vCPUs, 500 GB RAM, 1× A800 (24 GB) Yes (based on paper/tech report) Yes (based on source code) arXiv:2510.26144 X baidubce/FM-Agent 2026-02-23
AIBuildAI Claude-Opus-4.6 77.27 ± 0.00 61.40 ± 0.88 46.67 ± 0.00 63.11 ± 0.44 24 24 vCPUs, 256 GB RAM, 1× A100 Unable to verify (no source code or paper) Unable to verify (no source code or paper) X X aibuildai/AI-Build-AI 2026-03-06
CAIR MARS+ Gemini-3-Pro-Preview 78.79 ± 1.52 60.53 ± 1.52 44.44 ± 2.22 62.67 ± 0.77 24 48 vCPUs, 220 GB RAM, 2× H100 Yes (based on paper/tech report) No (based on paper/tech report) arXiv:2602.02660 X jfc43/MARS 2026-02-17
LoongFlow Gemini-3-Flash-Preview 77.27 ± 0.01 63.15 ± 1.511 40.0 ± 0.001 62.66 ± 0.761 24 36 vCPUs, 440 GB RAM, 2× A10 or 2× H20 No (based on both paper and source code) Hill Climbing arXiv:2512.24077 baidu-baige/LoongFlow 2026-02-09
MLEvolve Gemini-3-Pro-Preview 80.30 ± 1.52 57.89 ± 1.52 42.22 ± 2.22 61.33 ± 1.33 12 21 vCPUs, 234 GB RAM, 1× H200 Yes (based on both paper and source code) No (based on source code) arXiv:2510.08511 InternScience/MLEvolve 2026-02-14
PiEvolve (Fractal AI Research) Gemini-3-Pro-Preview2 80.30 ± 1.521 58.77 ± 0.881 40.0 ± 0.001 61.33 ± 0.771 24 40 vCPUs, 240 GB RAM, 1× H100 (96 GB) No (based on paper/tech report) No (based on paper/tech report) PiML @ AutoML’25 X FractalAIResearchLabs/PiEvolve 2026-01-05
Famou-Agent 2.0 Gemini-2.5-Pro 75.76 ± 1.52 57.89 ± 1.52 40.00 ± 0.00 59.56 ± 0.89 24 64 vCPUs, 500 GB RAM, 1× A800 (24 GB) Yes (based on paper/tech report) Yes (based on source code) arXiv:2510.26144 X baidubce/FM-Agent 2025-12-27
ML-Master 2.0 Deepseek-V3.2-Speciale 75.76 ± 1.51 50.88 ± 3.51 42.22 ± 2.22 56.44 ± 2.47 24 36 vCPUs, 252 GB RAM, 2× RTX 4090 (24 GB) Yes (based on paper/tech report) No (based on both paper and source code) arXiv:2601.10402 sjtu-sai-agents/ML-Master 2025-12-16
CAIR MARS Gemini-3-Pro-Preview 74.24 ± 1.52 52.63 ± 3.04 37.78 ± 2.22 56.0 ± 1.54 24 12 vCPUs, 220 GB RAM, 1× A100-40 GB Yes (based on paper/tech report) No (based on paper/tech report) arXiv:2602.02660 X jfc43/MARS 2026-01-25
PiEvolve (Fractal AI Research) Gemini-3-Pro-Preview2 74.24 ± 3.031 45.61 ± 0.881 35.55 ± 2.221 52.0 ± 0.771 12 40 vCPUs, 240 GB RAM, 1× H100 (96 GB) No (based on paper/tech report) No (based on paper/tech report) PiML @ AutoML’25 X FractalAIResearchLabs/PiEvolve 2026-01-05
Leeroo Gemini-3-Pro-Preview2 68.18 ± 2.621 44.74 ± 1.521 40.00 ± 0.001 50.67 ± 1.331 24 24 vCPUs, 150 GB RAM, 1× H100 Yes (based on both paper and source code) Yes (based on both paper and source code) arXiv:2601.21526 Leeroo-AI/kapso 2025-12-07
Thesis gpt-5-codex 65.15 ± 1.52 45.61 ± 7.18 31.11 ± 2.22 48.44 ± 3.64 24 24 vCPUs, 170 GB RAM, 1× H100 Unable to verify (no source code or paper) Unable to verify (no source code or paper) X X X 2025-11-10
CAIR MLE-STAR-Pro-1.5 Gemini-2.5-Pro 68.18 ± 2.62 34.21 ± 1.52 33.33 ± 0.00 44.00 ± 1.33 24 24 vCPUs, 220 GB RAM, 2× A100-40 GB Yes (based on paper/tech report) No (based on paper/tech report) arXiv:2506.15692 X X 2025-11-25
Famou-Agent Gemini-2.5-Pro 62.12 ± 1.52 36.84 ± 1.52 33.33 ± 0.00 43.56 ± 0.89 24 64 vCPUs, 500 GB RAM, 1× A800 (24 GB) Yes (based on paper/tech report) Yes (based on source code) arXiv:2510.26144 X baidubce/FM-Agent 2025-10-10
Operand ensemble gpt-5 (low verbosity/effort)3 63.64 ± 0.00 33.33 ± 0.881 20.00 ± 0.001 39.56 ± 0.441 24 36 vCPUs, 440 GB RAM, 1× A10 No (based on both paper and source code) No (based on both paper and source code) arXiv:2510.11694 X ramgorthi04/OperandLinear-MLE-Bench 2025-10-06
CAIR MLE-STAR-Pro-1.0 Gemini-2.5-Pro 66.67 ± 1.52 25.44 ± 0.88 31.11 ± 2.22 38.67 ± 0.77 12 24 vCPUs, 220 GB RAM, 1× V100 Yes (based on paper/tech report) No (based on paper/tech report) arXiv:2506.15692 X X 2025-11-03
InternAgent deepseek-r1 62.12 ± 3.03 26.32 ± 2.63 24.44 ± 2.22 36.44 ± 1.18 12 32 vCPUs, 230 GB RAM, 1× A800 Yes (based on both paper and source code) No (based on source code) arXiv:2505.16938 Alpha-Innovator/InternAgent 2025-09-12
R&D-Agent gpt-5 68.18 ± 2.62 21.05 ± 1.52 22.22 ± 2.22 35.11 ± 0.44 12 12 vCPUs, 220 GB RAM, 1× V100 Yes (based on paper/tech report) No (based on source code) arXiv:2505.14738 microsoft/RD-Agent 2025-09-26
Neo multi-agent undisclosed 48.48 ± 1.52 29.82 ± 2.32 24.44 ± 2.22 34.22 ± 0.89 36 24 vCPUs, 144 GB RAM, 1× A100 Unable to verify (no source code or paper) Unable to verify (no source code or paper) X X X 2025-07-28
AIRA-dojo o3 55.00 ± 1.47 21.97 ± 1.17 21.67 ± 1.07 31.60 ± 0.82 24 24 vCPUs, 120 GB RAM, 1× H200 No (based on both paper and source code) No (based on both paper and source code) arXiv:2507.02554 facebookresearch/aira-dojo 2025-05-15
R&D-Agent o3 + GPT-4.1 51.52 ± 4.01 19.30 ± 3.16 26.67 ± 0.00 30.22 ± 0.89 24 12 vCPUs, 220 GB RAM, 1× V100 Yes (based on paper/tech report) No (based on source code) arXiv:2505.14738 microsoft/RD-Agent 2025-08-15
ML-Master deepseek-r1 48.48 ± 1.52 20.18 ± 2.32 24.44 ± 2.22 29.33 ± 0.77 12 36 vCPUs, 512 GB RAM, 1× A100 Yes (based on paper/tech report) No (based on source code) arXiv:2506.16499 zeroxleo/ML-Master 2025-06-17
R&D-Agent o1-preview 48.18 ± 1.11 8.95 ± 1.05 18.67 ± 1.33 22.40 ± 0.50 24 12 vCPUs, 220 GB RAM, 1× V100 Yes (based on paper/tech report) No (based on source code) arXiv:2505.14738 microsoft/RD-Agent 2025-05-14
AIDE o1-preview 35.91 ± 1.86 8.45 ± 0.43 11.67 ± 1.27 17.12 ± 0.61 24 36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10 No (based on both paper and source code) No (based on both paper and source code) arXiv:2502.13138 WecoAI/aideml 2024-10-08
AIDE gpt-4o-2024-08-06 18.55 ± 1.26 3.06 ± 0.33 8.15 ± 0.84 8.63 ± 0.54 24 36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10 No (based on both paper and source code) No (based on both paper and source code) arXiv:2502.13138 WecoAI/aideml 2024-10-08
AIDE claude-3-5-sonnet-20240620 19.70 ± 1.52 2.63 ± 1.52 2.22 ± 2.22 7.56 ± 1.60 24 36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10 No (based on both paper and source code) No (based on both paper and source code) arXiv:2502.13138 WecoAI/aideml 2024-10-08
OpenHands gpt-4o-2024-08-06 12.12 ± 1.52 1.75 ± 0.88 2.22 ± 2.22 4.89 ± 0.44 24 36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10 No (based on both paper and source code) No (based on source code) arXiv:2407.16741 All-Hands-AI/OpenHands 2024-10-08
AIDE llama-3.1-405b-instruct 10.23 ± 1.14 0.66 ± 0.66 0.00 ± 0.00 3.33 ± 0.38 24 36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10 No (based on both paper and source code) No (based on both paper and source code) arXiv:2502.13138 WecoAI/aideml 2024-10-08
MLAB gpt-4o-2024-08-06 4.55 ± 0.86 0.00 ± 0.00 0.00 ± 0.00 1.60 ± 0.27 24 36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10 No (based on paper/tech report) No (based on both paper and source code) arXiv:2310.03302 openai/mle-bench (agents/) 2024-10-08

I’d restore a single leaderboard and keep these transparency columns visible as first-class fields next to performance. That gives readers the full picture in one view and avoids the maintenance burden of policing a separation that, as shown above, is virtually impossible to enforce consistently anyway.
Happy to be corrected on any of the entries. Thanks!

Footnotes

  1. Computed by padding incomplete seeds with failing scores. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  2. The architecture is primarily driven by Gemini-3-Pro-Preview, with a subset of modules utilizing GPT-5 and GPT-5-mini. 2 3

  3. With some light assistance from an ensemble of models including Gemini-2.5-Pro, Grok-4, and Claude 4.1 Opus, distilled by Gemini-2.5-Pro.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions