Table of Contents generated with DocToc
Spec-driven development
This document explains the spec-driven build loop that lives in
tools/spec-loop/. It is how this framework can be
developed and maintained against a written description of what it does,
rather than against whatever happens to be in someone’s head on a given
afternoon.
The idea
The loop is a small instance of the general Ralph technique: run a fresh agent context against a fixed prompt, let it do one well-scoped thing, then run it again with a clean context. The power is in the funnel that feeds it — not “a loop that codes,” but a pipeline from what the product should do to one reviewable change at a time.
Two artefacts carry the state between iterations, so nothing depends on a long-lived context window:
- Specs (
tools/spec-loop/specs/) — a faithful, plain-Markdown description of each functional area of the product: what it does, where it lives in the code, the contract it must honour, and its known gaps. The specs are the desired state. - The implementation plan
(
tools/spec-loop/IMPLEMENTATION_PLAN.md) — the prioritised list of work items: the gaps between the specs and the code. Each work item is sized to one branch and one PR.
Three phases, four beats, one loop
Phase one is a human (or a planning conversation) writing the specs. Phases two and three are the loop, swapping prompts:
| Beat | Command | What it does | Commits? |
|---|---|---|---|
| plan | loop.sh plan | Compares specs against the code; rewrites the plan as prioritised work items. | No |
| build | loop.sh | Implements the single highest-priority work item on its own branch; validates; commits there. | Yes, on a work-item branch |
| update | loop.sh update | Inverse of plan: finds functionality the code has but the specs don’t, and brings the specs back in sync. | Yes, on spec/sync-specs |
| consolidate | loop.sh consolidate | Shrinks the plan when it grows too long, without dropping planned work. | Yes, the plan only |
Every beat loads the same operational context —
tools/spec-loop/AGENTS.md (repo map,
validation commands, branch and hard-limit rules) layered on top of the
repository-wide AGENTS.md. Each build iteration reads
only the spec and source files relevant to its one work item, so a fresh
context never drowns in the whole tree.
Specs are not RFCs
The framework already has docs/rfcs/ — normative principle
documents that say why and define the trust posture. Those are the
constitution. The specs are the work orders: discrete, concrete,
grounded in the actual code. The loop respects an RFC principle as a
constraint (it never pushes, it stays in the sandbox), but it never reads,
edits, or restates an RFC, and no spec ever lives under docs/rfcs/. The
two formats and lifecycles are deliberately separate; see
tools/spec-loop/specs/README.md.
Spec filenames are topics, not numbers — triage-mode.md,
pairing-mode.md, security-issue-lifecycle.md. There is no numeric
prefix, because numbering implies a priority the specs don’t carry.
Priority lives only in the implementation plan.
A branch per fix or feature
The defining constraint of this loop: one work item, one branch, one
PR. Before plan/build iterations, the runner snapshots open PRs so the
plan beat does not add work already in flight and the build beat skips
planned items that an open PR already covers. The build beat returns to
the integration base, then carves out a
spec/<slug> branch for the single work item it is about to implement. It
never commits feature work to the base branch, and loop.sh stops if it
detects that happening. The result of a run is a fan of independent
branches, each carrying exactly one change — each independently
reviewable and independently revertible.
This is the same discipline the framework asks of every state change, applied to the framework’s own development. It is also what makes the loop safe to run unattended: the blast radius of any one iteration is a single local branch.
Why it never pushes
git push and gh pr create are in the ask list of
.claude/settings.json — they require a human
confirmation. The loop honours that: it ends every iteration at a local
commit and prints the exact commands for the human to run:
git push -u origin spec/<slug>
gh pr create --web --base main --head spec/<slug> \
--title "<subject>" --body-file <prepared-body>
Opening the PR with --web is the framework’s convention so the reviewer
sees the title, body, and generative-AI disclosure in the browser before
submitting. The agent drafts; the human presses the button.
Security and the dangerously-skip-permissions flag
The loop runs the agent headless with --dangerously-skip-permissions.
That deserves a direct explanation, because it looks, at a glance, like
it throws away the framework’s permission gates.
Why the flag is there. Headless iterations have no human to answer a per-tool-call prompt. Without the flag, the agent would stall (or, in non-interactive mode, deny) the moment it tried to edit a file or run a validation command. The flag lets the loop do its job — edit, validate, commit — unattended.
What it bypasses, and what it does not. The framework’s sandbox is
layered (see docs/rfcs/RFC-AI-0004.md for the
normative statement and docs/setup/secure-agent-internals.md for the
mechanism). --dangerously-skip-permissions only reaches the top two:
| Layer | Mechanism | Bypassed by the flag? |
|---|---|---|
| 0. Clean environment | wrapper strips credential-shaped env vars before exec | No — it is the launching wrapper, not the agent |
| 1. Filesystem + network sandbox | bubblewrap + SNI proxy (Linux) / sandbox-exec (macOS); default-deny egress | No — enforced by the OS, not the agent |
| 2. Tool permissions | .claude/settings.json permissions.deny | Yes |
| 3. Forced confirmation | .claude/settings.json permissions.ask on git push, gh … | Yes |
So the flag removes the agent-level gate (Layers 2–3), but the OS-level boundary (Layers 0–1) is untouched — it is enforced beneath the agent and cannot be turned off from inside it. This is exactly the posture the flag’s own guidance assumes: it is “recommended only for sandboxes with no internet access.”
How the loop stays safe anyway. Three things, in order of importance:
- Run it only inside the sandbox harness. The OS layers the flag
cannot bypass are the real boundary. Never run the loop on a bare
machine — launch it through the project’s
claude-iso/sandbox wrapper so the filesystem and network allow-lists are in force. - Run it with no push/write credentials in the environment. The
clean-env wrapper already strips them; keep it that way.
github.comis on the network allow-list, but agit pushorgh pr createwith no token cannot authenticate, so it fails closed. As defence in depth the loop also passes--disallowedTools "Bash(git push *)" "Bash(gh *)". - Structural containment. Every iteration works on its own
spec/<slug>branch, the loop guards against commits landing on the base branch, and the prompts forbid push/PR. The human-in-the-loop gate is not removed — it is relocated from per-tool-call to the push / PR / merge boundary, where the human reviews a finished branch.
Net effect. During a run the per-call confirmation gate is traded for
autonomy, but credentials are absent, egress is fenced, and the blast
radius of any iteration is a single local branch the human has not yet
pushed. That is the same reason the loop is the project’s manual-loop
evidence and must never be promoted to auto-merge: the autonomy is
bounded to producing local branches, nothing more. An operator who wants
the per-call gate back can drop the flag and pre-authorise the loop’s
tools with --allowedTools instead — at the cost of the loop pausing on
anything it was not pre-authorised to do.
Keeping specs honest: the update beat
Not every contribution comes through the loop — people land new skills and
tools the normal way. When that happens the specs fall behind the code.
The update beat is the fix: it inventories .claude/skills/,
tools/, and docs/modes.md, diffs that against the specs, and back-fills
or corrects the specs (a proposed area that now has a shipped skill
becomes experimental; a drifted Where it lives is corrected; genuinely
new functionality gets a new topic-named spec). It edits only the spec
directory — it documents reality, it doesn’t change it — and lands as one
reviewable spec/sync-specs PR. Run it after a batch of normal PRs merges,
or on a schedule.
Layout
tools/spec-loop/
├── README.md operator quickstart
├── AGENTS.md loop-scoped operational context
├── loop.sh the runner (plan / build / update / consolidate)
├── PROMPT_plan.md gap analysis → plan
├── PROMPT_build.md implement one work item on its own branch
├── PROMPT_update.md back-fill specs from contributed code
├── PROMPT_consolidate.md shrink the plan
├── IMPLEMENTATION_PLAN.md prioritised work items (the gaps)
└── specs/ functional description of the product
├── overview.md
├── triage-mode.md mentoring-mode.md drafting-mode.md pairing-mode.md
├── security-issue-lifecycle.md privacy-llm-gate.md
├── agent-isolation-sandbox.md cve-tooling.md
├── adoption-and-setup.md adapters.md
└── meta-and-quality-tooling.md
Quick start
# 1. See what's out of sync, then read the plan it writes.
./tools/spec-loop/loop.sh plan 1
$EDITOR tools/spec-loop/IMPLEMENTATION_PLAN.md
# 2. Build the top work item (one branch, one commit) and stop.
./tools/spec-loop/loop.sh 1
# 3. Review the branch it produced, then push + open the PR yourself.
git log --oneline -1
git push -u origin spec/<slug>
gh pr create --web --base main --head spec/<slug> --title "…" --body-file …
# Later: someone merged skills outside the loop — resync the specs.
./tools/spec-loop/loop.sh update 1
Stop any run with Ctrl+C or touch STOP. By default the loop forks
work items from the branch you start it on (typically main); set
SPEC_LOOP_BASE to build on top of a different branch. Set
SPEC_LOOP_AGENT when the Claude-compatible agent CLI is installed
under a command name other than claude. Set SPEC_LOOP_PR_LIMIT to
change how many open PRs are included in the duplicate-work check.
How this composes with the framework’s principles
A loop that runs an agent unattended sounds, at first, like the opposite of human-in-the-loop. The branch-per-feature constraint is the reconciliation: the loop’s autonomy is bounded to producing local branches, and the human gate sits exactly where the framework always puts it — at push, at PR, at merge. Nothing the loop does is visible outside the maintainer’s machine until a human chooses to push it. The loop is the manual development cycle the framework can later point to as evidence; it is not, and must not become, an auto-merge.