Table of Contents generated with DocToc

Mode economics — what does each mode cost to run?

Indicative, not a quote. The numbers on this page describe the token-count shape of a typical invocation, not a billing prediction. Token prices vary by provider, model, date, and discount tier. Always multiply by your own provider’s current rate — or by zero if you are running local inference.

This page exists because MISSION.md § Affordability commits to documenting mode economics honestly: a maintainer evaluating adoption should be able to make an informed decision, not discover the cost after the fact. The same data informs the long-term capacity planning for an ASF-hosted inference endpoint (see Long-term: the ASF inference endpoint).


How to read this page

What “tokens” means here

One token ≈ 0.75 words in English prose, or roughly one character in structured code or JSON. Practical anchors:

ContentApproximate token count
Typical bug-report body (400 words)~530 tokens
Small PR diff (50 lines changed)~800 tokens
Medium PR diff (300 lines changed)~5 000 tokens
Large PR diff (1 500 lines changed)~25 000 tokens
Mail thread, 10 messages~3 000–8 000 tokens
One skill file (SKILL.md), small setup/utility skill~1 000–3 000 tokens
One skill file (SKILL.md), typical workflow skill~3 500–9 000 tokens (median ~5 300)
One skill file (SKILL.md), large multi-step security skill~11 000–36 000 tokens

Every invocation loads the relevant skill file as part of its context, and that overhead varies widely by skill (measured with cl100k_base across the current catalogue). Small setup/utility skills run ~1 000–3 000 tokens; most workflow skills ~3 500–9 000 (median ~5 300); and the large multi-step security skills go much higher — security-issue-triage ~11 000, security-issue-import ~22 000, security-issue-sync ~36 000. This overhead applies before any project-specific content is read.

Model classes

Skills are written against a capability contract, not a vendor. Three capability classes cover the realistic range for these workflows:

ClassParameter scaleCharacteristics
Small~7B–13B equivalentFast and cheap. Good at extraction, classification, and short structured drafts. Struggles on long-chain reasoning, large contexts, and novel patterns.
Mid-tier~70B equivalentBalanced quality and cost. Handles the full skill catalogue well. Recommended starting point for new adopters.
LargeFrontier reasoningHighest capability and highest cost. Use where mid-tier recall or reasoning falls short — complex security analysis, multi-step code fix drafting, detecting novel vulnerability patterns.

Local models (Ollama, vLLM, llama.cpp) map onto Small or Mid-tier by capability; they incur hardware cost rather than per-token billing. See Local and self-hosted inference.


Per-mode token shape

Triage

The lowest-cost mode. Most Triage skills are read-bounded: the expensive part is loading context (PR diff, report body, existing issue sample), not generating output. Every output is a short proposal — a label suggestion, a routing recommendation, a classification with rationale — so output tokens are low relative to input.

SkillTypical invocationToken rangePrimary cost driver
pr-management-triageSingle PR triage pass5 000–30 000PR diff size and comment count
pr-management-statsWeekly queue report10 000–50 000Number of open PRs read
pr-management-code-reviewSingle PR deep review15 000–80 000Diff size; code-heavy PRs are expensive
issue-triageSingle issue classification4 000–15 000Issue body length + similar-issue cross-check sample
issue-reassessPool-level sweep (10 issues)30 000–120 000Pool size; batch cost scales linearly
security-issue-importSingle inbound report8 000–25 000Report length + known-dup cross-check
security-issue-import-from-prSingle security PR import10 000–30 000PR diff + associated discussion
security-issue-import-from-mdBatch import (5 findings)15 000–60 000Number of findings × finding length
security-issue-deduplicateTwo-tracker merge10 000–30 000Tracker age and mail-thread depth
security-issue-invalidateSingle invalid close8 000–20 000Report length + reply draft
security-issue-syncFull tracker reconciliation20 000–100 000Tracker age, mail-thread depth, linked PRs
security-cve-allocateCVE allocation workflow5 000–12 000Mostly procedural; low variance

Rule of thumb for Triage: budget 10 000–30 000 tokens per PR / issue / report on average. A project processing 50 inbound items per week uses roughly 500 000–1 500 000 tokens/week across Triage work.

Mentoring

Mentoring is conversational and per-reply: the agent reads thread context, project conventions, and contributor history, then produces a single targeted response. Cost per reply is moderate; total weekly cost depends on contributor volume.

SkillTypical invocationToken rangeNotes
pr-management-mentorSingle threaded reply6 000–20 000Estimated; skill experimental

Rule of thumb for Mentoring: budget 10 000–20 000 tokens per contributor interaction. A project with 20 active contributors each receiving 3 agent replies per week: roughly 600 000–1 200 000 tokens/week.

Drafting

The most variable mode. Short reporter replies are inexpensive; agent-drafted code fixes are expensive because the agent reads relevant source files in addition to the issue or report.

SkillTypical invocationToken rangeNotes
security-issue-fix — reporter replySingle reply draft10 000–35 000Reads report + canned responses + prior thread
security-issue-fix — code fixAgent-drafted fix + PR30 000–150 000Adds source files; wide variance
issue-fix-workflowIssue fix + PR25 000–120 000Bounded by what the skill reads from the codebase

Rule of thumb for Drafting: reporter replies average 15 000–25 000 tokens; code-producing invocations average 50 000–100 000 tokens depending on codebase scope. Limiting the skill to the relevant source files is the single biggest lever on Drafting cost.

Pairing

Pairing runs in the developer’s own development cycle, not on project infrastructure — cost is per-developer-session. Multi-agent pipelines multiply the per-pass cost by the number of review agents.

SkillTypical invocationToken rangeNotes
pairing-self-reviewPre-flight review of a local diff10 000–50 000Estimated; skill experimental. Scales with diff size and conventions doc length.
Multi-agent review pipelineFull three-pass review30 000–200 000Estimated; future skill. 3–4 × single-pass cost. Parallelism reduces latency, not billing.

Rule of thumb for Pairing: a typical pre-flight self-review of a medium PR uses 15 000–30 000 tokens. A three-agent review pipeline on the same PR: 45 000–90 000 tokens.

Auto-merge

Status: off. Auto-merge is not implemented; it has no token cost. See docs/modes.md § Auto-merge.


Model class and mode cost shape

The table below describes the quality/cost trade-off per mode, not a hard recommendation. “Viable” means acceptable recall on typical cases; “Recommended” means the sweet spot between quality and cost; “Large class” means quality requirements that mid-tier models often miss.

ModeSmall classMid-tier classLarge class
Triage — classification / routingViable for most casesRecommended defaultRarely needed
Triage — security import (novel patterns)Miss rate is higherRecommended defaultFor subtle or novel reports
MentoringAcceptable on simple threadsRecommended defaultNot typical
Drafting — reporter replyAcceptableRecommended defaultRarely needed
Drafting — code fixOften insufficientRecommended defaultComplex bugs or large refactors
Pairing — self-reviewLimited recall on conventionsRecommended defaultAnchor pass in multi-agent pipelines

Cost differential across classes (indicative ratio, not a price): Small-class models are typically 10–50× cheaper per token than Large-class models at hosted-API rates. Mid-tier sits at roughly 3–10× cheaper than Large. The total invocation cost is token_count × per_token_rate; the rate varies by vendor and changes over time — check your provider’s current pricing page.


Local and self-hosted inference

Running a model locally (Ollama, vLLM, llama.cpp) shifts cost from per-token billing to hardware:

Inference pathPer-token costTypical hardware costNotes
Consumer GPU, Small-class quantised model$0~$0.10–0.50/hr (capex amortised over ~3 yr lifespan × moderate utilisation)Viable for Triage and short Mentoring/Drafting
Cloud spot GPU, Mid-tier model$0~$1–4/hr depending on GPU classViable for all modes; latency is higher than hosted APIs
CPU-only, quantised Small model$0Near-zeroVery slow; not recommended for interactive Pairing

Local inference is also the simplest privacy answer for most skills: data never leaves the machine, and no third-party data-processing agreement is needed. The framework’s vendor neutrality means local paths use identical skill code to hosted paths.


Reducing costs

  1. Match model class to task. Triage classification and short Mentoring replies do not need a frontier model. Reserve Large-class for novel-pattern security analysis and complex multi-file code fixes.

  2. Scope code reads. The biggest driver of Drafting cost is how many source files the agent loads. Small, well-named files help the skill read only what is relevant.

  3. Cache skill context. Most agent CLIs support prompt-level caching. The skill file (size varies by skill class; see What “tokens” means here) and stable project configuration files are ideal cache candidates — the first invocation pays; subsequent invocations are cheap on the cached portion. Note: most provider caches have a short TTL (Anthropic prompt cache: 5 min default, 1 h extended at higher write cost), so bursty same-session workloads benefit most; periodic triage runs spaced hours apart will typically miss the cache.

  4. Batch triage. issue-reassess and pr-management-stats amortise context load across a pool. Running them weekly rather than per-event reduces overall token volume compared with individual calls.

  5. Run locally for development. When authoring or testing a new skill override, use a local model. Save the hosted model for production invocations.


Long-term: the ASF inference endpoint

MISSION.md § Affordability names an ASF-hosted inference endpoint (inference.apache.org, name TBD) as a long-term roadmap item: a community-affordable, foundation-governed, audit-logged inference layer any open-source maintainer — ASF or otherwise — can use without paying a vendor or accepting a vendor’s gift.

This page’s data — token counts per mode, per typical workload — is the quantitative input for the capacity planning and cost models that endpoint will need. As pilot adopters accumulate real usage data, this page will be updated with observed ranges rather than theoretical estimates, so the endpoint sizing argument rests on evidence.


Cross-references