Add results for SWE-Bench Lite for Potpie AI #397
Closed
dhirenmathur wants to merge 5 commits into
Closed
Conversation
Author
|
Hi @john-b-yang, thanks for you and the team’s work maintaining SWE-bench. When you get a chance, could you please take a look at this PR? We sent an invite to the repo as well; let us know if you need anything else from our side. Thanks! |
Member
|
We're not accepting non-academic / industry lab submissions, closing. Feel free to market via your own channels. |
Author
|
hi @john-b-yang I did read the readme and didn't see any mention of lite, only for verified. Do full and lite also fall under the verified charter? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi SWE-bench Team, thanks for taking the time to maintain the benchmark and review this PR! This PR adds Potpie AI's results for SWE-bench Lite.
Overview
Potpie is an open-source AI agent system for the full SDLC (https://github.com/potpie-ai/potpie), built for large, messy codebases. It uses a code knowledge graph plus tool-driven multi-agent orchestration and parallel execution to combine semantic retrieval with bounded search for debugging, code + test generation, root-cause analysis, and documentation. The SWE-bench submission runs as a “custom agent” on the Potpie platform.
For each repository snapshot, Potpie indexes the code into a structured knowledge graph (files, functions, classes, and their relationships). Each node is enriched with generated docstrings and embeddings, stored in Neo4j alongside the graph with a vector index. Agents access this context through tools: vector search for high-recall semantic retrieval, and Cypher for precise, symbol-anchored navigation.
The SWE-bench agent produces minimal, constraint-compliant patches with:
a supervisor that holds requirements, enforces gates, and emits the final diff, and
a single isolated delegate that does the bounded, tool-heavy work (localization, RCA, generalization, and patch drafting).
Results of the local run are as follows:
Important Notes:
General:
Test issues:
We observed lots of inconsistencies between our local test run and sb-cli run as well as a prefiously repoprted bug in the local run. We have captured these issues below.
20251226_potpiebynandan@potpie.aisphinx-doc__sphinx-8595: marked unresolved by script but was observed to have the same diff generated as the gold solution. This does resolve in sb-cli run but not in the local run.
The following instances pass in the local run but fail during the sb-cli run
Checklist :
[x] Is a pass@1 submission (does not attempt the same task instance more than once)
[x] Does not use SWE-bench test knowledge (PASS_TO_PASS, FAIL_TO_PASS)
[x] Does not use the hints field in SWE-bench
[x] Does not have web-browsing OR has taken steps to prevent lookup of SWE-bench solutions via web-browsing
Thanks again for maintaining the benchmark!