Why most people price AI agents wrong
Most builders quote a client a flat monthly number pulled out of thin air, then quietly hope the token bill stays under it. That works until the agent gets used. The fix is boring and it works: figure out what ONE run of the agent actually costs in tokens, multiply by how often it runs, add your margin, and quote from that. This guide shows the exact arithmetic — with real 2026 prices — and the one trick (routing tasks to the right-sized model) that cuts the bill the most. Cloudflare runs this at scale internally: across its first 30 days, its AI code-review system did 131,246 review runs over 48,095 merge requests, and the median run cost $0.98 (average $1.19). That's the whole game in one number — a known, predictable cost per task.
- A flat quote with no cost model is a bet, not a price.
- The unit that matters is cost PER TASK (per call, per review, per conversation), not per month.
- Once you know cost-per-task, the monthly number falls out of usage — and so does your margin.
The only formula you need
Token billing is simple once you see it. Every model has two prices: one for the tokens you send IN (the prompt, the context, the data), one for the tokens it sends OUT (the answer). You pay per million tokens. So the cost of one agent task is:
- Cost per task = (input tokens ÷ 1,000,000 × input price) + (output tokens ÷ 1,000,000 × output price)
- Input tokens = your system prompt + the user's request + any context/data you stuff in (this is usually the BIG number).
- Output tokens = what the model writes back (usually much smaller, but priced ~5× higher per token).
- Rough rule: 1 token ≈ 4 characters ≈ 0.75 of a word. A page of text is ~500–700 tokens.
Real 2026 prices (verified)
Here's what three Claude tiers actually cost as of June 2026, straight from Anthropic's pricing page. Cheap, mid, and frontier. The gap between them is the entire reason routing works — the cheapest model is 5× cheaper than the frontier one for the same tokens.
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Use it for |
|---|---|---|---|
| Claude Haiku 4.5 | $1 | $5 | Easy, high-volume calls — classify, extract, route, short replies |
| Claude Sonnet 4.6 | $3 | $15 | The everyday workhorse — most real tasks |
| Claude Opus 4.8 | $5 | $25 | The hard 20% — tricky reasoning, the calls you can't get wrong |
Worked example: pricing one agent task
Say your agent reviews a chunk of work for a client — a support ticket, a contract clause, a code diff, whatever. A realistic task sends ~30,000 input tokens (instructions + the thing to review + some context) and writes ~3,000 output tokens back. Same task, three models:
| Model | Input cost | Output cost | Cost per task |
|---|---|---|---|
| Haiku 4.5 | 30k × $1/1M = $0.030 | 3k × $5/1M = $0.015 | $0.045 |
| Sonnet 4.6 | 30k × $3/1M = $0.090 | 3k × $15/1M = $0.045 | $0.135 |
| Opus 4.8 | 30k × $5/1M = $0.150 | 3k × $25/1M = $0.075 | $0.225 |
The move that cuts the bill: model-routing
Here's the lever. Most tasks are easy. A few are hard. If you send everything to the frontier model 'to be safe', you pay frontier prices for work a cheap model would nail. Instead, route: a cheap model (or a small classifier) decides if a task is easy or hard, then easy tasks go to Haiku and only the hard ones go to Opus. Cloudflare does exactly this — it reserves its top-tier models (Claude Opus 4.7 / GPT-5.4) for the coordinator that orchestrates the review, and runs the bulk 'sub-reviewer' work on standard-tier models (Claude Sonnet 4.6 / GPT-5.3 Codex), with a lightweight model for text-heavy odds and ends. Same pattern, any scale. Watch what it does to 1,000 tasks a month:
| Strategy | Math | Monthly cost |
|---|---|---|
| Everything on Opus 4.8 | 1,000 × $0.225 | $225 |
| Routed: 80% Haiku + 20% Opus | (800 × $0.045) + (200 × $0.225) | $81 |
| You just saved | $225 − $81 | $144 (64% less) |
The hidden costs that blow up your estimate
Your clean 30k-in / 3k-out number is the floor, not the bill. In production three things quietly add tokens, and if you priced off the floor you eat them. Budget for all three before you quote.
- Retries. When a call fails, times out, or returns malformed JSON, you re-send the whole prompt — and you pay for the failed attempt AND the retry. An agent with a 2× average re-run rate doubles its input cost. Cloudflare's own system averages 2.7 re-reviews per merge request, which is exactly why their per-review average ($1.19) sits above the median ($0.98).
- Context bloat. Multi-turn agents resend the growing conversation on every step. Turn 1 sends 5k tokens; turn 10 might resend 50k because the whole history rides along each time. The input side compounds — measure a full conversation, not one turn.
- Tool calls. Every tool you give the model adds tokens before the task even starts. On Claude, the tool-use system prompt alone is ~290–500 input tokens depending on model, plus the JSON schema of each tool, plus every
tool_resultyou feed back in. A 6-tool agent that calls 3 tools per task can add several thousand input tokens you never see in the 'happy path' math.
Turn cost-per-task into a client price
Now you have a real cost floor, build the quote on top of it instead of guessing. Four steps:
- 1. Measure: run your agent on ~10 real tasks, average the input + output tokens, get your true cost-per-task (use the formula above).
- 2. Estimate volume: ask the client how many runs/month they expect, then assume MORE — usage always grows once it works. Price for the higher number.
- 3. Add a safety multiple: multiply your raw token cost by 3–5× to cover retries, longer-than-expected tasks, support, and your margin. Token cost should be a small slice of what you charge, not the whole price.
- 4. Quote a plan with a ceiling: e.g. 'up to 1,000 runs/month included, then $X per 100 after'. Now overage is the client's choice, not your loss.
A tier-based pricing table you can copy
If you sell more than one plan, stop quoting per task and sell tiers. Each tier caps which models the agent is allowed to use, which sets a hard ceiling on your cost-per-task — so you can publish a price with a known margin. This is the same 30k-in / 3k-out task from above, priced per tier at a 4× markup. Adjust the volume and markup to your own numbers.
| Plan | Models allowed | Your cost / task | Client price / task (4×) | Margin |
|---|---|---|---|---|
| Starter | Haiku 4.5 only | $0.045 | $0.18 | $0.135 |
| Pro | Haiku + Sonnet (routed 70/30) | $0.072 | $0.29 | $0.218 |
| Premium | Haiku + Sonnet + Opus (routed 60/25/15) | $0.101 | $0.40 | $0.299 |
Metering and billing per client — without building it yourself
The pricing model above has one moving part that's a pain to build: actually metering each client's usage against the model they're allowed to use, applying your markup, and drawing it down from a credit balance — per client, under your own brand. If you'd rather not write that billing layer, a gateway can do it. Knotie's AI Gateway is OpenAI-compatible: you keep the standard OpenAI SDK and point base_url at https://api.knotie.ai. You mint a virtual key per client and restrict that key to specific models — exactly the per-tier model ceiling from the table above (budget / mid / premium families: Claude, GPT, Gemini). Each key is pre-funded from Knotie Credits (100 credits = $1; $10 / 1,000 credits reserved per key, overflow drawing your main balance), and per key you set a profit multiplier and a price per credit, so the markup math you just did is enforced automatically and billed to the client. Clients can self-serve their own keys from your white-label portal, and you can whitelist approved domains per key.
The 6-point tokenomics checklist
Run this before you send any AI-agent quote:
- Do I know my agent's average input AND output tokens from REAL runs (not a guess)?
- Have I priced one task on the cheap, mid, AND frontier model so I know the spread?
- Am I routing — easy tasks to a cheap model, only the hard ones to the frontier?
- Am I caching the static part of my prompt (90% off on repeat reads)?
- Did I check the vendor's live pricing page TODAY, not from memory?
- Is my client price a clean unit (credit/minute/run) with a ceiling, at a 3–5× multiple over raw cost?
Get the next drop
New AI build guides + the occasional bonus template. No spam, unsubscribe anytime.
By submitting you agree to our Privacy Policy & Terms. Unsubscribe anytime.