Table of Contents generated with DocToc
Mode economics — what does each mode cost to run?
Indicative, not a quote. The numbers on this page describe the token-count shape of a typical invocation, not a billing prediction. Token prices vary by provider, model, date, and discount tier. Always multiply by your own provider’s current rate — or by zero if you are running local inference.
This page exists because MISSION.md § Affordability commits to documenting mode economics honestly: a maintainer evaluating adoption should be able to make an informed decision, not discover the cost after the fact. The same data informs the long-term capacity planning for an ASF-hosted inference endpoint (see Long-term: the ASF inference endpoint).
How to read this page
What “tokens” means here
One token ≈ 0.75 words in English prose, or roughly one character in structured code or JSON. Practical anchors:
| Content | Approximate token count |
|---|---|
| Typical bug-report body (400 words) | ~530 tokens |
| Small PR diff (50 lines changed) | ~800 tokens |
| Medium PR diff (300 lines changed) | ~5 000 tokens |
| Large PR diff (1 500 lines changed) | ~25 000 tokens |
| Mail thread, 10 messages | ~3 000–8 000 tokens |
| One skill file (SKILL.md), small setup/utility skill | ~1 000–3 000 tokens |
| One skill file (SKILL.md), typical workflow skill | ~3 500–9 000 tokens (median ~5 300) |
| One skill file (SKILL.md), large multi-step security skill | ~11 000–36 000 tokens |
Every invocation loads the relevant skill file as part of its context,
and that overhead varies widely by skill (measured with cl100k_base
across the current catalogue). Small setup/utility skills run
~1 000–3 000 tokens; most workflow skills ~3 500–9 000 (median ~5 300);
and the large multi-step security skills go much higher —
security-issue-triage ~11 000, security-issue-import ~22 000,
security-issue-sync ~36 000. This overhead applies before any
project-specific content is read.
Model classes
Skills are written against a capability contract, not a vendor. Three capability classes cover the realistic range for these workflows:
| Class | Parameter scale | Characteristics |
|---|---|---|
| Small | ~7B–13B equivalent | Fast and cheap. Good at extraction, classification, and short structured drafts. Struggles on long-chain reasoning, large contexts, and novel patterns. |
| Mid-tier | ~70B equivalent | Balanced quality and cost. Handles the full skill catalogue well. Recommended starting point for new adopters. |
| Large | Frontier reasoning | Highest capability and highest cost. Use where mid-tier recall or reasoning falls short — complex security analysis, multi-step code fix drafting, detecting novel vulnerability patterns. |
Local models (Ollama, vLLM, llama.cpp) map onto Small or Mid-tier by capability; they incur hardware cost rather than per-token billing. See Local and self-hosted inference.
Per-mode token shape
Triage
The lowest-cost mode. Most Triage skills are read-bounded: the expensive part is loading context (PR diff, report body, existing issue sample), not generating output. Every output is a short proposal — a label suggestion, a routing recommendation, a classification with rationale — so output tokens are low relative to input.
| Skill | Typical invocation | Token range | Primary cost driver |
|---|---|---|---|
pr-management-triage | Single PR triage pass | 5 000–30 000 | PR diff size and comment count |
pr-management-stats | Weekly queue report | 10 000–50 000 | Number of open PRs read |
pr-management-code-review | Single PR deep review | 15 000–80 000 | Diff size; code-heavy PRs are expensive |
issue-triage | Single issue classification | 4 000–15 000 | Issue body length + similar-issue cross-check sample |
issue-reassess | Pool-level sweep (10 issues) | 30 000–120 000 | Pool size; batch cost scales linearly |
security-issue-import | Single inbound report | 8 000–25 000 | Report length + known-dup cross-check |
security-issue-import-from-pr | Single security PR import | 10 000–30 000 | PR diff + associated discussion |
security-issue-import-from-md | Batch import (5 findings) | 15 000–60 000 | Number of findings × finding length |
security-issue-deduplicate | Two-tracker merge | 10 000–30 000 | Tracker age and mail-thread depth |
security-issue-invalidate | Single invalid close | 8 000–20 000 | Report length + reply draft |
security-issue-sync | Full tracker reconciliation | 20 000–100 000 | Tracker age, mail-thread depth, linked PRs |
security-cve-allocate | CVE allocation workflow | 5 000–12 000 | Mostly procedural; low variance |
Rule of thumb for Triage: budget 10 000–30 000 tokens per PR / issue / report on average. A project processing 50 inbound items per week uses roughly 500 000–1 500 000 tokens/week across Triage work.
Mentoring
Mentoring is conversational and per-reply: the agent reads thread context, project conventions, and contributor history, then produces a single targeted response. Cost per reply is moderate; total weekly cost depends on contributor volume.
| Skill | Typical invocation | Token range | Notes |
|---|---|---|---|
pr-management-mentor | Single threaded reply | 6 000–20 000 | Estimated; skill experimental |
Rule of thumb for Mentoring: budget 10 000–20 000 tokens per contributor interaction. A project with 20 active contributors each receiving 3 agent replies per week: roughly 600 000–1 200 000 tokens/week.
Drafting
The most variable mode. Short reporter replies are inexpensive; agent-drafted code fixes are expensive because the agent reads relevant source files in addition to the issue or report.
| Skill | Typical invocation | Token range | Notes |
|---|---|---|---|
security-issue-fix — reporter reply | Single reply draft | 10 000–35 000 | Reads report + canned responses + prior thread |
security-issue-fix — code fix | Agent-drafted fix + PR | 30 000–150 000 | Adds source files; wide variance |
issue-fix-workflow | Issue fix + PR | 25 000–120 000 | Bounded by what the skill reads from the codebase |
Rule of thumb for Drafting: reporter replies average 15 000–25 000 tokens; code-producing invocations average 50 000–100 000 tokens depending on codebase scope. Limiting the skill to the relevant source files is the single biggest lever on Drafting cost.
Pairing
Pairing runs in the developer’s own development cycle, not on project infrastructure — cost is per-developer-session. Multi-agent pipelines multiply the per-pass cost by the number of review agents.
| Skill | Typical invocation | Token range | Notes |
|---|---|---|---|
pairing-self-review | Pre-flight review of a local diff | 10 000–50 000 | Estimated; skill experimental. Scales with diff size and conventions doc length. |
| Multi-agent review pipeline | Full three-pass review | 30 000–200 000 | Estimated; future skill. 3–4 × single-pass cost. Parallelism reduces latency, not billing. |
Rule of thumb for Pairing: a typical pre-flight self-review of a medium PR uses 15 000–30 000 tokens. A three-agent review pipeline on the same PR: 45 000–90 000 tokens.
Auto-merge
Status: off. Auto-merge is not implemented; it has no token cost.
See docs/modes.md § Auto-merge.
Model class and mode cost shape
The table below describes the quality/cost trade-off per mode, not a hard recommendation. “Viable” means acceptable recall on typical cases; “Recommended” means the sweet spot between quality and cost; “Large class” means quality requirements that mid-tier models often miss.
| Mode | Small class | Mid-tier class | Large class |
|---|---|---|---|
| Triage — classification / routing | Viable for most cases | Recommended default | Rarely needed |
| Triage — security import (novel patterns) | Miss rate is higher | Recommended default | For subtle or novel reports |
| Mentoring | Acceptable on simple threads | Recommended default | Not typical |
| Drafting — reporter reply | Acceptable | Recommended default | Rarely needed |
| Drafting — code fix | Often insufficient | Recommended default | Complex bugs or large refactors |
| Pairing — self-review | Limited recall on conventions | Recommended default | Anchor pass in multi-agent pipelines |
Cost differential across classes (indicative ratio, not a price):
Small-class models are typically 10–50× cheaper per token than
Large-class models at hosted-API rates. Mid-tier sits at roughly 3–10×
cheaper than Large. The total invocation cost is token_count × per_token_rate;
the rate varies by vendor and changes over time — check your provider’s
current pricing page.
Local and self-hosted inference
Running a model locally (Ollama, vLLM, llama.cpp) shifts cost from per-token billing to hardware:
| Inference path | Per-token cost | Typical hardware cost | Notes |
|---|---|---|---|
| Consumer GPU, Small-class quantised model | $0 | ~$0.10–0.50/hr (capex amortised over ~3 yr lifespan × moderate utilisation) | Viable for Triage and short Mentoring/Drafting |
| Cloud spot GPU, Mid-tier model | $0 | ~$1–4/hr depending on GPU class | Viable for all modes; latency is higher than hosted APIs |
| CPU-only, quantised Small model | $0 | Near-zero | Very slow; not recommended for interactive Pairing |
Local inference is also the simplest privacy answer for most skills: data never leaves the machine, and no third-party data-processing agreement is needed. The framework’s vendor neutrality means local paths use identical skill code to hosted paths.
Reducing costs
-
Match model class to task. Triage classification and short Mentoring replies do not need a frontier model. Reserve Large-class for novel-pattern security analysis and complex multi-file code fixes.
-
Scope code reads. The biggest driver of Drafting cost is how many source files the agent loads. Small, well-named files help the skill read only what is relevant.
-
Cache skill context. Most agent CLIs support prompt-level caching. The skill file (size varies by skill class; see What “tokens” means here) and stable project configuration files are ideal cache candidates — the first invocation pays; subsequent invocations are cheap on the cached portion. Note: most provider caches have a short TTL (Anthropic prompt cache: 5 min default, 1 h extended at higher write cost), so bursty same-session workloads benefit most; periodic triage runs spaced hours apart will typically miss the cache.
-
Batch triage.
issue-reassessandpr-management-statsamortise context load across a pool. Running them weekly rather than per-event reduces overall token volume compared with individual calls. -
Run locally for development. When authoring or testing a new skill override, use a local model. Save the hosted model for production invocations.
Long-term: the ASF inference endpoint
MISSION.md § Affordability
names an ASF-hosted inference endpoint (inference.apache.org, name
TBD) as a long-term roadmap item: a community-affordable,
foundation-governed, audit-logged inference layer any open-source
maintainer — ASF or otherwise — can use without paying a vendor or
accepting a vendor’s gift.
This page’s data — token counts per mode, per typical workload — is the quantitative input for the capacity planning and cost models that endpoint will need. As pilot adopters accumulate real usage data, this page will be updated with observed ranges rather than theoretical estimates, so the endpoint sizing argument rests on evidence.
Cross-references
MISSION.md§ Affordability — the policy commitment behind this page.docs/modes.md— per-mode skill catalogue and maturity status.docs/prerequisites.md— what you need to run the framework, including model-backend setup.