The leaderboard separation in PR #130 is inconsistently applied

@joe-needham

Hi @joe-needham and thanks for responding to my comment on issue #124 earlier. After you separated the two submissions in PR #130, I got curious and went through the published source code and tech reports for every submission I could find.

What jumped out: at least two other agents on the main board verifiably use test-set information to guide their runs. The teams that disclosed their methodology (or self-reported it) got penalized, while teams that didn’t are assumed compliant. Furthermore, the main board isn’t an apples-to-apples comparison to begin with.

Here’s what I found.

The main leaderboard is not directly comparable to begin with

The separation of Disarray and LoongFlow into “Additional Submissions” implies that the remaining main leaderboard is an apples-to-apples comparison. It isn’t. The main board has uncontrolled variation across every dimension that affects performance:

Hardware. Submissions range from a single V100 (12 vCPUs, 220 GB RAM) to dual H100s (48 vCPUs, 220 GB RAM). An H100 delivers roughly 3.4x the FP16 throughput of an A100, so agents on H100 hardware can train models and explore solutions over 3x faster within the same wall-clock budget. Dual-GPU setups double that again. CAIR MARS+ on 2x H100 has up to 7x the GPU compute throughput of R&D-Agent on a V100 — yet both sit on the same leaderboard with no distinction. Famou-Agent uses 64 vCPUs and 500 GB RAM — nearly 3x the CPU and over 2x the memory of the recommended spec.

Runtime. The leaderboard spans 12 to 36 hours — a 3x range. Compounded with hardware differences, the effective compute gap is even larger.

External knowledge. Some submissions use web search at runtime to retrieve SOTA models and competition-winning solutions (MLE-STAR, KAPSO, MARS+), while others ship pre-populated knowledge bases with benchmark-specific model recommendations (MLEvolve, ML-Master 2.0). These sources can significantly bootstrap the search process. There is no disclosure requirement or distinction on the leaderboard.

Test-set feedback methodology. This is the dimension used to justify the separation — but it is one uncontrolled dimension among many. Without ablation studies, there is no basis for concluding that test-set feedback has more impact on scores than a 7x hardware advantage or access to ground-truth external knowledge.

Other main-board submissions verifiably use test-set feedback

How test-set feedback is used matters here. Continuous hill-climbing, where the test score directly guides the search, is a fundamentally different practice than using a binary status flag.

FM-Agent / Famou-Agent (main leaderboard — three entries: 43.56%, 59.56%, and 64.44%)

FM-Agent holds three entries on the main leaderboard, including the current #1 position (64.44%). All three link to the same repository (baidubce/FM-Agent). That repository contains a single MLE-bench evaluator that uses the test score as an evolutionary fitness signal.
The complete pipeline in
examples/MLE-insulting/evaluator.py:

run_grader()
calls mlebench grade-sample (private test-set grading) → returns the test score + medal status
scale_score_with_medal()
computes sigmoid(test_score) + medal_bonus (+1 bronze, +2 silver, +3 gold)
evaluate()
returns this combined_score as the evolutionary fitness for every candidate

# evaluator.py, lines 377-397
def scale_score_with_medal(raw_score, medal: str = "none"):
    score = -raw_score if lower_is_better else raw_score
    combined_score = float(1 / (1 + np.exp(-score)))
    if medal == "none":
        return combined_score
    elif medal == "bronze":
        return combined_score + 1.0
    elif medal == "silver":
        return combined_score + 2.0
    elif medal == "gold":
        return combined_score + 3.0

Also:

No validation-set scoring path exists anywhere in the repository’s 86-commit history — the only MLE-bench evaluator grades against the test set.
The evolved program (best_program.py) computes no validation metrics — it only generates test predictions.

On PR #123, the Famou-Agent team stated their agent uses “feedback solely from an independently partitioned validation set.” The repository was subsequently wiped on March 15, 2026 — after test-set feedback concerns were raised in February — though the code remains accessible via earlier commits and forks. No updated code supporting the validation-set claim has been published.
None of the three FM-Agent entries was moved to the separate leaderboard.

KAPSO / Leeroo (main leaderboard, 50.67%)

KAPSO receives the full grading result on every submission and uses the medal status for early termination. The source code at benchmarks/mle/handler.py shows:

Line 213: grade_csv(submission_path, competition) — full test-set grading runs on every valid submission
Line 220: self.got_medal = self.got_medal or grading_results.any_medal
Lines 260–261: def stop_condition(self): return self.got_medal — test-set medal controls early termination

The KAPSO paper (arXiv:2601.21526, Section 5.1) openly discloses this: “Stop early if the run achieves any medal according to the MLE-Bench grading library.” KAPSO was not moved to the separate leaderboard.

The separation does not track severity

If we look at how much test-set information is actually used across these submissions, the current policy doesn’t track severity at all:

Agent	Test-set information used	Leaderboard status
FM-Agent (64.44%)	Continuous: `sigmoid(test_score) + medal_bonus` as evolutionary fitness	Main
LoongFlow (62.66%)	Continuous: test score drives Boltzmann parent selection, child weight adjustment, and early stopping (PR #119)	Separated
KAPSO (50.67%)	Full grading result (score + medal) on every submission; medal status used for early stopping	Main
Disarray (77.78%)	Self-reported test-set feedback in PR #118 (mechanism unverifiable; no open source code)	Separated

FM-Agent and LoongFlow both verifiably use continuous test-set scores as evolutionary fitness — yet FM-Agent remains on the main leaderboard while LoongFlow was separated. KAPSO uses medal-based early stopping, yet remains on the main board while Disarray was separated merely for self-reporting feedback usage without a verifiable codebase.
I am pointing this out to show that maintaining a consistent separation on this dimension requires auditing every submission, and the evidence above shows how impractical that is.

The verification gap makes consistent enforcement impractical

Of the 28 main-board entries, only 15 publish a complete implementation that allows independent verification of test-set feedback methodology. Another 8 provide partial artifacts, and 5 have no published code or tech reports at all.
Under the current policy, unverifiable submissions are treated as compliant by default, while submissions that transparently disclosed their approach were separated. This creates a perverse incentive structure: teams that publish less code and disclose less methodology are less likely to be flagged.

Suggestion: Return to one board, use transparency columns

The leaderboard already handles differences in Runtime via transparency columns. If runtime and models used already belong next to the score, why not hardware, external knowledge, and test-set feedback? These all change how a result should be read.
I took a pass at compiling these dimensions across all submissions from published code, papers, and submission PRs to do the legwork here:

Agent	LLM(s) used	Low == Lite (%)	Medium (%)	High (%)	All (%)	Running Time (hours)	Hardware Used	External Knowledge	Test Set Feedback	Tech Report	Implementation Source Code	Artifact Source Code	Date	Grading Reports Available
Disarray	Ensemble (Claude-Opus-4.5, Claude-Sonnet-4.5, GPT-5.2-Codex, Gemini-3-Pro-Preview)	90.91 ± 0.00	72.81 ± 0.88	71.11 ± 2.22	77.78 ± 0.44	24	24 vCPUs, 220 GB RAM, 1× A100	Unable to verify (no source code or paper)	Self-reported (unverifiable)	In prep	X	X	2026-02-03	✓
Famou-Agent 2.0	Gemini-3-Pro-Preview	80.3 ± 1.52	64.04 ± 2.32	42.22 ± 2.22	64.44 ± 1.18	24	64 vCPUs, 500 GB RAM, 1× A800 (24 GB)	Yes (based on paper/tech report)	Yes (based on source code)	arXiv:2510.26144	X	baidubce/FM-Agent	2026-02-23	✓
AIBuildAI	Claude-Opus-4.6	77.27 ± 0.00	61.40 ± 0.88	46.67 ± 0.00	63.11 ± 0.44	24	24 vCPUs, 256 GB RAM, 1× A100	Unable to verify (no source code or paper)	Unable to verify (no source code or paper)	X	X	aibuildai/AI-Build-AI	2026-03-06	✓
CAIR MARS+	Gemini-3-Pro-Preview	78.79 ± 1.52	60.53 ± 1.52	44.44 ± 2.22	62.67 ± 0.77	24	48 vCPUs, 220 GB RAM, 2× H100	Yes (based on paper/tech report)	No (based on paper/tech report)	arXiv:2602.02660	X	jfc43/MARS	2026-02-17	✓
LoongFlow	Gemini-3-Flash-Preview	77.27 ± 0.0¹	63.15 ± 1.51¹	40.0 ± 0.00¹	62.66 ± 0.76¹	24	36 vCPUs, 440 GB RAM, 2× A10 or 2× H20	No (based on both paper and source code)	Hill Climbing	arXiv:2512.24077	baidu-baige/LoongFlow	—	2026-02-09	✓
MLEvolve	Gemini-3-Pro-Preview	80.30 ± 1.52	57.89 ± 1.52	42.22 ± 2.22	61.33 ± 1.33	12	21 vCPUs, 234 GB RAM, 1× H200	Yes (based on both paper and source code)	No (based on source code)	arXiv:2510.08511	InternScience/MLEvolve	—	2026-02-14	✓
PiEvolve (Fractal AI Research)	Gemini-3-Pro-Preview²	80.30 ± 1.52¹	58.77 ± 0.88¹	40.0 ± 0.00¹	61.33 ± 0.77¹	24	40 vCPUs, 240 GB RAM, 1× H100 (96 GB)	No (based on paper/tech report)	No (based on paper/tech report)	PiML @ AutoML’25	X	FractalAIResearchLabs/PiEvolve	2026-01-05	✓
Famou-Agent 2.0	Gemini-2.5-Pro	75.76 ± 1.52	57.89 ± 1.52	40.00 ± 0.00	59.56 ± 0.89	24	64 vCPUs, 500 GB RAM, 1× A800 (24 GB)	Yes (based on paper/tech report)	Yes (based on source code)	arXiv:2510.26144	X	baidubce/FM-Agent	2025-12-27	✓
ML-Master 2.0	Deepseek-V3.2-Speciale	75.76 ± 1.51	50.88 ± 3.51	42.22 ± 2.22	56.44 ± 2.47	24	36 vCPUs, 252 GB RAM, 2× RTX 4090 (24 GB)	Yes (based on paper/tech report)	No (based on both paper and source code)	arXiv:2601.10402	sjtu-sai-agents/ML-Master	—	2025-12-16	✓
CAIR MARS	Gemini-3-Pro-Preview	74.24 ± 1.52	52.63 ± 3.04	37.78 ± 2.22	56.0 ± 1.54	24	12 vCPUs, 220 GB RAM, 1× A100-40 GB	Yes (based on paper/tech report)	No (based on paper/tech report)	arXiv:2602.02660	X	jfc43/MARS	2026-01-25	✓
PiEvolve (Fractal AI Research)	Gemini-3-Pro-Preview²	74.24 ± 3.03¹	45.61 ± 0.88¹	35.55 ± 2.22¹	52.0 ± 0.77¹	12	40 vCPUs, 240 GB RAM, 1× H100 (96 GB)	No (based on paper/tech report)	No (based on paper/tech report)	PiML @ AutoML’25	X	FractalAIResearchLabs/PiEvolve	2026-01-05	✓
Leeroo	Gemini-3-Pro-Preview²	68.18 ± 2.62¹	44.74 ± 1.52¹	40.00 ± 0.00¹	50.67 ± 1.33¹	24	24 vCPUs, 150 GB RAM, 1× H100	Yes (based on both paper and source code)	Yes (based on both paper and source code)	arXiv:2601.21526	Leeroo-AI/kapso	—	2025-12-07	✓
Thesis	gpt-5-codex	65.15 ± 1.52	45.61 ± 7.18	31.11 ± 2.22	48.44 ± 3.64	24	24 vCPUs, 170 GB RAM, 1× H100	Unable to verify (no source code or paper)	Unable to verify (no source code or paper)	X	X	X	2025-11-10	✓
CAIR MLE-STAR-Pro-1.5	Gemini-2.5-Pro	68.18 ± 2.62	34.21 ± 1.52	33.33 ± 0.00	44.00 ± 1.33	24	24 vCPUs, 220 GB RAM, 2× A100-40 GB	Yes (based on paper/tech report)	No (based on paper/tech report)	arXiv:2506.15692	X	X	2025-11-25	✓
Famou-Agent	Gemini-2.5-Pro	62.12 ± 1.52	36.84 ± 1.52	33.33 ± 0.00	43.56 ± 0.89	24	64 vCPUs, 500 GB RAM, 1× A800 (24 GB)	Yes (based on paper/tech report)	Yes (based on source code)	arXiv:2510.26144	X	baidubce/FM-Agent	2025-10-10	✓
Operand ensemble	gpt-5 (low verbosity/effort)³	63.64 ± 0.00	33.33 ± 0.88¹	20.00 ± 0.00¹	39.56 ± 0.44¹	24	36 vCPUs, 440 GB RAM, 1× A10	No (based on both paper and source code)	No (based on both paper and source code)	arXiv:2510.11694	X	ramgorthi04/OperandLinear-MLE-Bench	2025-10-06	✓
CAIR MLE-STAR-Pro-1.0	Gemini-2.5-Pro	66.67 ± 1.52	25.44 ± 0.88	31.11 ± 2.22	38.67 ± 0.77	12	24 vCPUs, 220 GB RAM, 1× V100	Yes (based on paper/tech report)	No (based on paper/tech report)	arXiv:2506.15692	X	X	2025-11-03	✓
InternAgent	deepseek-r1	62.12 ± 3.03	26.32 ± 2.63	24.44 ± 2.22	36.44 ± 1.18	12	32 vCPUs, 230 GB RAM, 1× A800	Yes (based on both paper and source code)	No (based on source code)	arXiv:2505.16938	Alpha-Innovator/InternAgent	—	2025-09-12	✓
R&D-Agent	gpt-5	68.18 ± 2.62	21.05 ± 1.52	22.22 ± 2.22	35.11 ± 0.44	12	12 vCPUs, 220 GB RAM, 1× V100	Yes (based on paper/tech report)	No (based on source code)	arXiv:2505.14738	microsoft/RD-Agent	—	2025-09-26	✓
Neo multi-agent	undisclosed	48.48 ± 1.52	29.82 ± 2.32	24.44 ± 2.22	34.22 ± 0.89	36	24 vCPUs, 144 GB RAM, 1× A100	Unable to verify (no source code or paper)	Unable to verify (no source code or paper)	X	X	X	2025-07-28	✓
AIRA-dojo	o3	55.00 ± 1.47	21.97 ± 1.17	21.67 ± 1.07	31.60 ± 0.82	24	24 vCPUs, 120 GB RAM, 1× H200	No (based on both paper and source code)	No (based on both paper and source code)	arXiv:2507.02554	facebookresearch/aira-dojo	—	2025-05-15	✓
R&D-Agent	o3 + GPT-4.1	51.52 ± 4.01	19.30 ± 3.16	26.67 ± 0.00	30.22 ± 0.89	24	12 vCPUs, 220 GB RAM, 1× V100	Yes (based on paper/tech report)	No (based on source code)	arXiv:2505.14738	microsoft/RD-Agent	—	2025-08-15	✓
ML-Master	deepseek-r1	48.48 ± 1.52	20.18 ± 2.32	24.44 ± 2.22	29.33 ± 0.77	12	36 vCPUs, 512 GB RAM, 1× A100	Yes (based on paper/tech report)	No (based on source code)	arXiv:2506.16499	zeroxleo/ML-Master	—	2025-06-17	✓
R&D-Agent	o1-preview	48.18 ± 1.11	8.95 ± 1.05	18.67 ± 1.33	22.40 ± 0.50	24	12 vCPUs, 220 GB RAM, 1× V100	Yes (based on paper/tech report)	No (based on source code)	arXiv:2505.14738	microsoft/RD-Agent	—	2025-05-14	✓
AIDE	o1-preview	35.91 ± 1.86	8.45 ± 0.43	11.67 ± 1.27	17.12 ± 0.61	24	36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10	No (based on both paper and source code)	No (based on both paper and source code)	arXiv:2502.13138	WecoAI/aideml	—	2024-10-08	✓
AIDE	gpt-4o-2024-08-06	18.55 ± 1.26	3.06 ± 0.33	8.15 ± 0.84	8.63 ± 0.54	24	36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10	No (based on both paper and source code)	No (based on both paper and source code)	arXiv:2502.13138	WecoAI/aideml	—	2024-10-08	✓
AIDE	claude-3-5-sonnet-20240620	19.70 ± 1.52	2.63 ± 1.52	2.22 ± 2.22	7.56 ± 1.60	24	36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10	No (based on both paper and source code)	No (based on both paper and source code)	arXiv:2502.13138	WecoAI/aideml	—	2024-10-08	✓
OpenHands	gpt-4o-2024-08-06	12.12 ± 1.52	1.75 ± 0.88	2.22 ± 2.22	4.89 ± 0.44	24	36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10	No (based on both paper and source code)	No (based on source code)	arXiv:2407.16741	All-Hands-AI/OpenHands	—	2024-10-08	✓
AIDE	llama-3.1-405b-instruct	10.23 ± 1.14	0.66 ± 0.66	0.00 ± 0.00	3.33 ± 0.38	24	36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10	No (based on both paper and source code)	No (based on both paper and source code)	arXiv:2502.13138	WecoAI/aideml	—	2024-10-08	✓
MLAB	gpt-4o-2024-08-06	4.55 ± 0.86	0.00 ± 0.00	0.00 ± 0.00	1.60 ± 0.27	24	36 vCPUs, 440 GB RAM, 4095 GiB SSD, 1× A10	No (based on paper/tech report)	No (based on both paper and source code)	arXiv:2310.03302	openai/mle-bench (agents/)	—	2024-10-08	✓

I’d restore a single leaderboard and keep these transparency columns visible as first-class fields next to performance. That gives readers the full picture in one view and avoids the maintenance burden of policing a separation that, as shown above, is virtually impossible to enforce consistently anyway.
Happy to be corrected on any of the entries. Thanks!

Computed by padding incomplete seeds with failing scores. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹
The architecture is primarily driven by Gemini-3-Pro-Preview, with a subset of modules utilizing GPT-5 and GPT-5-mini. ↩ ↩² ↩³
With some light assistance from an ensemble of models including Gemini-2.5-Pro, Grok-4, and Claude 4.1 Opus, distilled by Gemini-2.5-Pro. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The leaderboard separation in PR #130 is inconsistently applied #138

The main leaderboard is not directly comparable to begin with

Other main-board submissions verifiably use test-set feedback

FM-Agent / Famou-Agent (main leaderboard — three entries: 43.56%, 59.56%, and 64.44%)

KAPSO / Leeroo (main leaderboard, 50.67%)

The separation does not track severity

The verification gap makes consistent enforcement impractical

Suggestion: Return to one board, use transparency columns

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The leaderboard separation in PR #130 is inconsistently applied #138

Description

The main leaderboard is not directly comparable to begin with

Other main-board submissions verifiably use test-set feedback

FM-Agent / Famou-Agent (main leaderboard — three entries: 43.56%, 59.56%, and 64.44%)

KAPSO / Leeroo (main leaderboard, 50.67%)

The separation does not track severity

The verification gap makes consistent enforcement impractical

Suggestion: Return to one board, use transparency columns

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions