ZhikunCode · An Open-Source Multi-Agent AI Coding System for SWE-bench

We present ZhikunCode, an open-source, self-hosted AI coding system, and report its submission to SWE-bench Lite (300 instances, pass@1). ZhikunCode is built around three engineering principles: a strictly bounded toolset for evaluation, a compression cascade that keeps long sessions within the model context window, and a deterministic Agent-Loop that resists premature termination.

For this submission the agent is restricted to a closed set of six tools — Read, Edit, Write, Bash, Grep, Glob — with no internet access and no sub-agent delegation. We employ a single backbone model, qwen3.6-max-preview, with a 1200-second per-instance timeout, a hard ceiling of 80 reasoning turns, and two parallel workers. The harness produced a valid non-empty patch for 280 of 300 instances (93.3% patch generation rate). The official resolve rate is reported in §5 once SWE-bench Lite official evaluation completes.

§ 1 Introduction

Software engineering benchmarks have evolved from isolated function-completion tasks to repository-scale bug fixing. SWE-bench [1] is the most widely adopted yardstick of this generation: it asks an autonomous agent to read a real GitHub issue, locate the offending code in a multi-file Python repository, apply a patch, and have that patch pass the original project's hidden test suite. SWE-bench Lite further restricts the dataset to 300 instances tractable for systematic ablation while remaining representative of real-world Python projects.

This report describes the engineering decisions behind the ZhikunCode submission. ZhikunCode is not a benchmark-specific harness; it is a general-purpose coding agent with a Java orchestration backend, a React browser UI, and a Python analysis service, deployable via a single docker compose up. The same Agent-Loop, sandbox, and context-management machinery that power day-to-day developer use are exercised by the SWE-bench harness — the only deltas are (a) a restricted toolset and (b) a phase-disciplined system prompt.

We make three contributions:

An Agent-Loop with explicit phase transitions (ANALYZE → LOCATE → FIX → VERIFY) that prevents the model from terminating prematurely after the first hypothesis.
A five-layer context compression cascade with three-phase 413-recovery, allowing 80-turn sessions to remain inside an 128K-token budget without losing the failing-test signal.
A self-correction loop that turns compile and test failures into structured re-prompting (capped at three retries), pushing the patch-generation rate to 93.3% on SWE-bench Lite.

§ 2 System Architecture

2.1 Three-Tier Deployment

ZhikunCode is delivered as three cooperating processes packaged in a single Docker image: a Java 21 / Spring Boot 3.4 backend (:8080) that owns the Agent-Loop and tool registry; a React 18 + TypeScript frontend (:5173 in dev) for browser-based control; and a FastAPI / Python 3.11+ service (:8000) for AST and call-graph analysis. The SWE-bench harness is a thin Python script that drives the backend's REST API (/api/query) and writes the official all_preds.jsonl format.

2.2 The Agent-Loop: an 8-Step Query Cycle

The core execution engine, QueryEngine, drives every Agent decision through eight ordered phases per turn. The cycle is the same whether the agent is editing a TODO app from a browser or fixing a Django regression on SWE-bench:

1 Compression→ 2 Session→ 3 API Call→ 4 Response→ 5 Tool Result→ 6 Termination→ 7 Summary→ 8 State Update

Step 1 invokes the compression cascade (§2.4). Step 3 wraps the LLM API call with a circuit breaker, adaptive retry with exponential backoff, and a model-tier downgrade chain; this is where 413 (payload too large) errors trigger the three-phase recovery described below. Step 6 evaluates termination on six dimensions — including a guard against the well-known premature end-turn failure mode where models declare victory after the first edit without verification.

2.3 Closed Tool Set for SWE-bench

Although ZhikunCode ships with 48 built-in tools plus dynamic MCP extensions, the SWE-bench configuration restricts the Agent to exactly six. Any tool name outside this allowlist is rejected at the registry level and never reaches the LLM:

Tool	Signature	Purpose
`Read`	`Read(path, offset?, limit?)`	Read a file by line range.
`Edit`	`Edit(path, old_text, new_text)`	String-anchored in-place edit.
`Write`	`Write(path, content)`	Create or overwrite a file.
`Bash`	`Bash(command)`	Sandboxed shell execution; primary route for running failing tests.
`Grep`	`Grep(pattern, path)`	Regex-based source search.
`Glob`	`Glob(pattern)`	Filename pattern discovery.

The system prompt explicitly enumerates which tools do not exist (Agent, SubAgent, Delegate, WebSearch, Browse, …) so the model does not waste turns calling unavailable orchestration primitives — a known failure mode when transplanting general-purpose system prompts into a single-agent harness.

2.4 Five-Layer Context Compression Cascade

An 80-turn debugging session can easily emit several hundred kilobytes of tool output (test logs, file dumps, grep results). The ContextCascade manager keeps the prompt within the model's context window via five increasingly aggressive layers, applied in order until the budget is met:

Snip — trim long single tool outputs to head/tail summaries.
MicroCompact — drop the body of old tool results, retaining only the call signature.
AutoCompact — summarize multi-turn segments into structured notes.
CollapseDrain — incremental collapse of stable history every ten turns.
ReactiveCompact — last-resort summarization of the entire conversation.

If the upstream provider still returns HTTP 413, a three-phase recovery fires automatically: (i) aggressive compression via CollapseDrain, (ii) a forced ReactiveCompact, and (iii) media-stripping for any embedded binary payloads. This is the difference between a session that survives a 60K-token grep dump and one that simply fails.

2.5 Eight-Layer Bash Sandbox

Bash is the most consequential of the six tools — it runs pytest, git apply, and the project-specific reproduction scripts. Every command passes through eight layers of inspection: command parsing, three-tier blocklist filtering (with ReDoS-safe regex), path-traversal detection, a fourteen-step permission pipeline, sandboxed execution, argument sanitization, output validation, and audit logging. SWE-bench instances run with permission mode set to bypass (no human-in-the-loop), but the sandbox layers remain active and are responsible for converting destructive command attempts into structured errors that the agent can recover from rather than into harness crashes.

§ 3 Evaluation Configuration

Configuration is declared at the harness entry point swe-bench/swe_bench.py and serialized into the per-instance API request. The exact values used for this submission are:

Parameter	Value	Source / Default
Backbone model	`qwen3.6-max-preview`	`--model` / harness default
Per-instance timeout	1200 s	`--timeout`
Maximum reasoning turns	80	`--max-turns`
Parallel workers	2	`--workers`
Allowed tools	`Read, Edit, Write, Bash, Grep, Glob`	`ALLOWED_TOOLS` (closed set)
API endpoint	`/api/query` on `127.0.0.1:8080`	local backend
Max retries (transport)	3, 5 s back-off	`MAX_RETRIES`
Dataset	SWE-bench Lite (300 instances)	JSONL
Decoding	pass@1 (single sample)	—

3.1 Mandatory Phase Transitions

The system prompt enforces a four-phase workflow with budgeted turn counts. Each phase has explicit entry conditions, exit conditions, and a hard turn cap that pushes the agent forward when it would otherwise loop:

ANALYZE turns 1–3→ LOCATE turns 4–12→ FIX turns 13–40→ VERIFY turns 41–max_turns

Figure 1. Mandatory phase transitions inside the SWE-bench system prompt. The phases are not soft guidance — exceeding a phase's turn budget is treated by the prompt as a hard transition trigger.

ANALYZE reads the issue, runs the failing test once via Bash, and identifies the error type. LOCATE uses Grep and Read to pin down the exact function. FIX applies the minimal patch via Edit or Write. VERIFY re-runs the failing test, then runs adjacent tests for regression. The phase boundaries enumerated above match the runtime telemetry emitted by get_turn_progress() in swe-bench/swe_bench.py exactly: ANALYZE for turn ≤ 3, LOCATE for 4 ≤ turn ≤ 12, FIX for 13 ≤ turn ≤ 40, and VERIFY for any turn beyond 40 up to --max-turns. The phase boundaries are also where the self-correction loop (§4.2) inserts its diagnostic prompts.

3.2 Compliance & Transparency

The submission is engineered to comply with SWE-bench's evaluation contract for autonomous, oracle-free agents. The four declarations below are mechanically enforced by the harness in swe-bench/swe_bench.py and can be verified directly from the source:

No use of hints_text. The dataset loader (load_dataset) reads only four fields from each JSONL record — instance_id, repo, base_commit, and problem_statement. The hints_text field present in the upstream SWE-bench dataset is never accessed, parsed, or forwarded to the model under any code path.
No use of PASS_TO_PASS / FAIL_TO_PASS as a solving oracle. These fields are part of the evaluation harness contract and are not loaded by our solver. The model must independently locate the relevant tests by reading the repository (typically through Grep over tests/). The strings FAIL_TO_PASS and PASS_TO_PASS appear in the system prompt only as conceptual labels for the agent's own test discovery and regression checking, never as data injected from the dataset.
Oracle-free, no-human-in-the-loop. Every instance is solved by a single autonomous agent loop with no external oracle, no gold patch leakage, no test-set inspection, no human review, and no manual intervention. Permission mode is set to bypass; tool calls are executed deterministically by the harness without any operator confirmation step. Internet access is disabled at the tool surface (no WebFetch, WebSearch, or sub-agent delegation in the closed six-tool set).
pass@1 evaluation protocol. Each of the 300 instances is run exactly once. The single resulting unified diff is recorded as model_patch in all_preds.jsonl and submitted as the final answer. We perform no best-of-N resampling, no majority voting, no rerun-on-failure, and no patch ensembling. Resumption logic skips already-completed instances on retry but never re-evaluates a successfully-recorded prediction.

The combination above corresponds to the SWE-bench Lite leaderboard's checked category for autonomous, oracle-free, single-shot submissions.

§ 4 Key Technical Innovations

4.1 Compression Cascade with 413 Three-Phase Recovery

Long-tail SWE-bench instances are dominated by repositories with deep test stacks (Django, sympy, scikit-learn). A single failing-test invocation can return tens of thousands of tokens. The cascade described in §2.4 is the difference between an agent that aborts at turn 30 with context_length_exceeded and one that completes 80 turns. The 413 recovery is engineered to be idempotent and signal-preserving: even after Phase 3 media stripping, the failing-test stack trace and the most recent edit are pinned in place by the cascade's relevance scorer. We measured a 0% session-fatal-413 rate across the 300-instance run.

4.2 Self-Correction Loop

When the agent emits an Edit that fails to apply (anchor mismatch), or a Bash command that exits non-zero with a structured Python traceback, ZhikunCode auto-injects a diagnostic system message containing (a) the failed call, (b) the parsed error type, and (c) one of three suggested next actions: re-read the file, switch to Write, or escalate to a broader Grep. The loop is hard-capped at three retries per failure site — beyond that, the agent is forced into the next phase. This avoids the common failure mode where the model produces three-dozen near-identical broken edits.

4.3 Test-First Approach

The system prompt requires the agent to read the failing test before reading the source. The rationale is empirical: SWE-bench issue descriptions are sometimes ambiguous about the expected behavior, but the failing test is always a precise oracle. The phase ANALYZE prompt explicitly says: "Run the failing test to see the error (use Bash). Identify the error type and likely location." This single ordering decision measurably reduces fix-then-revert oscillation in our internal sweeps.

4.4 Closed-Set Tool Discipline

The system prompt enumerates not only the six available tools but explicitly the tools that do not exist. We observed that strong instruction-tuned models are otherwise prone to "hallucinating capability" — invoking WebSearch, Agent, or Delegate that the harness does not expose, then burning a turn on the silent failure. Negative enumeration eliminates this class of error.

§ 5 Results

XX/300 Resolve Rate

280/300 Patch Generation

93.3% Valid-Patch Yield

80 Max Turns

PLACEHOLDER: XX/300 (XX%) — Final official resolve rate to be filled in after the SWE-bench Lite ECS evaluation completes.

5.1 Patch Generation

For 280 of 300 instances (93.3%) the agent produced a non-empty unified diff that applied cleanly to the base commit. The remaining 20 instances split between two failure modes: (i) the agent exhausted 80 turns inside the LOCATE phase without converging on a single file, and (ii) the agent emitted a patch that the harness rejected at the git apply stage. Both failure modes are tractable with longer turn budgets or stronger localization heuristics, and are the focus of our ongoing work.

5.2 Resolve Rate

The official SWE-bench Lite resolve rate — the percentage of instances whose generated patch passes both the originally failing tests (FAIL_TO_PASS) and the originally passing tests (PASS_TO_PASS) without regression — is computed by the official evaluation harness on isolated per-instance Docker images. At the time of writing this number is PLACEHOLDER: XX/300 (XX%), pending completion of the official ECS evaluation. We will update this section in place once the official report is available; the patch generation rate above is a strict upper bound on the resolve rate.

Reproducibility

All configuration, prompts, harness scripts, and the underlying agent source code are open-source under the MIT License. The exact command used for this submission is:

python swe-bench/swe_bench.py \
    --dataset ./swe-bench-lite.json \
    --model qwen3.6-max-preview \
    --output ./swe-bench/results \
    --timeout 1200 --max-turns 80 --workers 2

§ 6 Limitations & Future Work

Single-agent ceiling. The submission disables ZhikunCode's multi-agent collaboration (Team / Swarm / SubAgent) to comply with SWE-bench's single-agent convention. The same repository, however, demonstrates strong gains from sub-agent delegation on full-stack tasks; lifting that restriction for a future SWE-bench Verified submission is a natural next step.

Single-model dependency. All 300 instances were run on a single backbone (qwen3.6-max-preview). The model-tier downgrade chain was active but rarely triggered. A heterogeneous-model ensemble (e.g., a stronger model for VERIFY) is unexplored.

Localization is the bottleneck. Of the 20 unsolved-patch instances, roughly two-thirds spent their turn budget inside LOCATE. Integrating ZhikunCode's existing LSP-based call-hierarchy tool (currently disabled for the closed six-tool set) into a future submission is the most promising single intervention.

No internet access by design. The closed tool set forbids WebFetch and WebSearch, even though some SWE-bench issues reference upstream library documentation. We consider this a feature, not a bug, for benchmark reproducibility — but it does cap the achievable rate on documentation-heavy issues.

Single-developer project. ZhikunCode is, at present, the work of one author. The 660+ files and 110K+ lines of code are written, tested, and maintained by one person; the SWE-bench harness was added in the same fashion. Bus-factor is acknowledged as a real risk for downstream adopters.

§ 7 Conclusion

ZhikunCode demonstrates that a general-purpose, self-hosted, browser-controlled coding agent can be made benchmark-ready by (a) restricting its tool surface to a closed six-tool set, (b) enforcing explicit phase transitions in the system prompt, and (c) keeping the conversation alive through a five-layer compression cascade with three-phase 413 recovery. The 93.3% patch generation rate on SWE-bench Lite — achieved with a single open-weight Chinese backbone and a 1200-second per-instance budget — suggests that the engineering envelope around the model matters at least as much as the model itself.

All artifacts — agent source, harness, system prompts, evaluation scripts — are open-source. We hope they are useful both as a baseline and as a substrate for further experimentation.

References

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770, 2024.
SWE-bench Lite official site & leaderboard. swebench.com/lite.html.
ZhikunCode source repository. github.com/zhikunqingtao/zhikuncode.
ZhikunCode full system architecture diagram. ZhikunCode-Architecture.html.
SWE-bench harness in this submission: swe-bench/swe_bench.py (in the source repository).
Alibaba Cloud DashScope · Qwen 3.6 Max Preview model card. help.aliyun.com/zh/dashscope.