Add results for SWE-Bench Lite for Potpie AI by dhirenmathur · Pull Request #397 · SWE-bench/experiments

dhirenmathur · 2025-12-26T13:47:51Z

Hi SWE-bench Team, thanks for taking the time to maintain the benchmark and review this PR! This PR adds Potpie AI's results for SWE-bench Lite.

Overview

Potpie is an open-source AI agent system for the full SDLC (https://github.com/potpie-ai/potpie), built for large, messy codebases. It uses a code knowledge graph plus tool-driven multi-agent orchestration and parallel execution to combine semantic retrieval with bounded search for debugging, code + test generation, root-cause analysis, and documentation. The SWE-bench submission runs as a “custom agent” on the Potpie platform.

For each repository snapshot, Potpie indexes the code into a structured knowledge graph (files, functions, classes, and their relationships). Each node is enriched with generated docstrings and embeddings, stored in Neo4j alongside the graph with a vector index. Agents access this context through tools: vector search for high-recall semantic retrieval, and Cypher for precise, symbol-anchored navigation.

The SWE-bench agent produces minimal, constraint-compliant patches with:

a supervisor that holds requirements, enforces gates, and emits the final diff, and
a single isolated delegate that does the bounded, tool-heavy work (localization, RCA, generalization, and patch drafting).

Results of the local run are as follows:

==================================================
Resolved 189 instances (63.0%)
==================================================
Resolved by Repository

astropy/astropy: 5/6 (83.33%)
django/django: 84/114 (73.68%)
matplotlib/matplotlib: 14/23 (60.87%)
mwaskom/seaborn: 4/4 (100.00%)
pallets/flask: 1/3 (33.33%)
psf/requests: 6/6 (100.00%)
pydata/xarray: 2/5 (40.00%)
pylint-dev/pylint: 3/6 (50.00%)
pytest-dev/pytest: 10/17 (58.82%)
scikit-learn/scikit-learn: 14/23 (60.87%)
sphinx-doc/sphinx: 7/16 (43.75%)
sympy/sympy: 39/77 (50.65%)
==================================================
Resolved by Time

2012: 1/1 (100.00%)
2014: 3/3 (100.00%)
2015: 1/1 (100.00%)
2016: 2/4 (50.00%)
2017: 8/16 (50.00%)
2018: 8/21 (38.10%)
2019: 40/59 (67.80%)
2020: 39/66 (59.09%)
2021: 30/42 (71.43%)
2022: 37/57 (64.91%)
2023: 20/30 (66.67%)

Important Notes:
General:

We noticed that trajs and logs are now in .gitignore but the README still references the requirement to add the trajs and logs to each submission, we have uploaded our trajs and logs to drive and have provided links to it in the metadata file as well.
- logs: https://drive.google.com/drive/folders/1GI-dBE-Tm56OvHSjz34cAik3D-1fr3mn?usp=sharing
- trajs: https://drive.google.com/drive/folders/1s5ZHAgJwLqT9CoI0K3iAR9jdrD0VYTw7?usp=sharing

Test issues:

We observed lots of inconsistencies between our local test run and sb-cli run as well as a prefiously repoprted bug in the local run. We have captured these issues below.
- Refer sb-cli run id : 20251226_potpie by nandan@potpie.ai
sphinx-doc__sphinx-8595: marked unresolved by script but was observed to have the same diff generated as the gold solution. This does resolve in sb-cli run but not in the local run.
The following instances pass in the local run but fail during the sb-cli run

django__django-11049
django__django-11583
django__django-16595
matplotlib__matplotlib-24265
psf__requests-1963
pytest-dev__pytest-5227
pytest-dev__pytest-7168

Checklist :

[x] Is a pass@1 submission (does not attempt the same task instance more than once)
[x] Does not use SWE-bench test knowledge (PASS_TO_PASS, FAIL_TO_PASS)
[x] Does not use the hints field in SWE-bench
[x] Does not have web-browsing OR has taken steps to prevent lookup of SWE-bench solutions via web-browsing

Thanks again for maintaining the benchmark!

dhirenmathur · 2026-01-22T10:03:41Z

Hi @john-b-yang, thanks for you and the team’s work maintaining SWE-bench. When you get a chance, could you please take a look at this PR? We sent an invite to the repo as well; let us know if you need anything else from our side. Thanks!

john-b-yang · 2026-01-26T02:40:52Z

We're not accepting non-academic / industry lab submissions, closing. Feel free to market via your own channels.

dhirenmathur · 2026-01-27T14:54:14Z

hi @john-b-yang I did read the readme and didn't see any mention of lite, only for verified. Do full and lite also fall under the verified charter?

submission: potpie

31fa8eb

dhirenmathur closed this Dec 26, 2025

nndn added 3 commits December 26, 2025 23:25

Update metadata

5116b6d

Update author emails

353f621

Update README

0b9e59f

dhirenmathur changed the title ~~submission: potpie~~ Add results for SWE-Bench Lite for Potpie AI Dec 26, 2025

add report link

1404221

dhirenmathur reopened this Dec 27, 2025

john-b-yang closed this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add results for SWE-Bench Lite for Potpie AI #397

Add results for SWE-Bench Lite for Potpie AI #397
dhirenmathur wants to merge 5 commits into
SWE-bench:mainfrom
potpie-ai:submission/potpie

dhirenmathur commented Dec 26, 2025 •

edited

Loading

dhirenmathur commented Jan 22, 2026

john-b-yang commented Jan 26, 2026

dhirenmathur commented Jan 27, 2026

Labels

3 participants

Conversation

dhirenmathur commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.