Refact-Bench is a benchmarking tool designed to evaluate AI models on software engineering tasks using the SWE-Bench framework. It provides a standardized environment for testing model performance on real-world programming challenges extracted from GitHub issues.
Before installing Refact-Bench, ensure you have the following:
- Python 3.7 or higher
- Docker installed and running
- Git
- pip package manager
First, install the required Python packages:
pip install -e .
This will install all dependencies listed in setup.py
, including the refact
package.
Clone the Refact repository and build the necessary components. To reproduce SWE evaluation results, you need to use following branches of refact
:
- https://github.com/smallcloudai/refact/tree/swe-boosted-prompt for SWE-lite
- https://github.com/smallcloudai/refact/tree/swe-boosted-prompt-verified for SWE-verified
git clone https://github.com/smallcloudai/refact.git
pip install -e ./refact/refact-agent/engine/python_binding_and_cmdline
fakeide compile-static-lsp release
This step compiles the Language Server Protocol (LSP) implementation needed for code analysis.
cd ./refact/refact-agent/engine/
cargo build --release
mkdir -p ./python_binding_and_cmdline/refact/bin/
cp ./target/release/refact-lsp ./python_binding_and_cmdline/refact/bin/refact-lsp
This builds the Rust-based LSP binary and places it in the correct location for the Python package to use.
Create a Docker integration configuration file:
mkdir -p ~/.config/refact/integrations.d/
Then set up the Docker integration configuration (this will overwrite any existing Docker integration config):
cat > ~/.config/refact/integrations.d/docker.yaml << 'EOF'
label: ref
docker_daemon_address: ''
docker_cli_path: docker
remote_docker: false
ssh_host: ''
ssh_user: root
ssh_port: '22'
ssh_identity_file: ''
available:
on_your_laptop: true
when_isolated: true
confirmation:
ask_user: []
deny: []
EOF
This configuration allows Refact-Bench to use Docker for creating isolated environments for each benchmark task.
To run a benchmark task, use the fakeide run
command. For example, to run the swe-verified
tasks using the claude-3-7-sonnet
model:
fakeide run --api-key <REFACT-API-KEY> --model claude-3-7-sonnet --docker tasks/swe/verified --experiment my-experiment
Replace <API-KEY>
with Refact key and my-experiment
with a name to group your benchmark runs.
To collect the results after running tasks:
fakeide collect --experiment my-experiment
The results of the benchmark will be stored in ./results/
If you want to test models on your self-hosted server, specify the --address-url
parameter with your local address:
fakeide run --address-url http://localhost:8080 --api-key <API-KEY> --model refact/claude-3-7-sonnet --docker tasks/swe/verified
Note: Your server should be started on 0.0.0.0
. A common use case with node is:
ssh <node-name> -L 0.0.0.0:8008:0.0.0.0:8008
The translation.py
script in the tasks/swe
directory is used to prepare SWE-Bench tasks for evaluation. It converts SWE-Bench datasets into the format required by Refact-Bench:
cd tasks/swe
python translation.py
This script processes the SWE-Bench datasets (Lite, Lite-dev, and Verified) and generates the necessary task files in the respective directories.
The main components of Refact-Bench are:
refact_scenarios/
: Core Python package with the implementation of the benchmarking frameworktasks/
: Contains the benchmark tasksswe/
: SWE-Bench related tasksverified/
: Verified SWE-Bench taskslite/
: SWE-Bench Lite taskslite-dev/
: Development subset of SWE-Bench Litetranslation.py
: Script to prepare SWE-Bench tasks
fakeide-logs/
: Contains logs from benchmark runs
check the logs in the fakeide-logs/
directory.