TokenMaxxing — sqirvy.xyz

Most teams can cut their LLM costs by 2–5x just by changing how they structure prompts and sessions—without switching providers or dumbing down quality. The core levers are: move as many tokens as possible into prompt cache hits, stop re‑sending useless history, and route low‑value work to cheaper models. This post walks through concrete patterns you can apply in an afternoon to turn “mystery AI bills” into predictable, efficient spend.

Tokenmaxxing is the practice (and mild obsession) of driving your LLM token costs as low as possible—LooksMaxxing, but for AI bills.

When you use a paid AI model, you’re charged based on tokens. Tokens are the subword chunks the model actually processes; they’re the unit you pay for.

Input tokens: everything you send plus any context/history the server resends.
Output tokens: what the model generates.
Cache tokens: input tokens that hit the prompt cache and are billed at a discounted rate.

For example, Anthropic’s current Claude Sonnet class models are in the ballpark of:

\$3 / Mtok for input tokens
\$15 / Mtok for output tokens
5–6x cheaper per Mtok for cache hits (up to ~90% discount vs base input)
1.25–2x more expensive per Mtok for cache writes, depending on TTL (5 min vs 1 hour)

You control everything except how many output tokens the user asks for and what the model decides to emit. That means most of your leverage is in reducing normal input tokens and pushing as many as possible into cache.

1. Exploit prompt caching hard

The single biggest lever for tokenmaxxing is turning as many tokens as possible into prompt cache hits instead of normal input. Prompt caching is the model “remembering” the heavy, unchanging front part of the prompt so it doesn’t have to reread it every time. The server turns that stable prefix (system prompt, tools, repo snapshot, long instructions) into an internal representation once, then reuses it on later calls with the same prefix. New requests only process the changed tail of the prompt (latest user message, recent edits), so responses are faster and cheaper with identical behavior.

Caching is implemented on the server, but you (or your agent) are responsible for structuring prompts so the cache can hit.

Design prompts for stable prefixes
- Modern LLM runtimes and agents (Claude API, Claude Code, Bedrock, etc.) support explicit prompt caching.
- The server looks for identical initial text blocks in your request; if it has seen that exact prefix before with matching cache_control, it reuses the cached representation.
- That first instance was already expanded into the model’s internal K/V cache, so subsequent calls only add the changing suffix.
- Treat “prompt layout stability” as a first‑class invariant in your code.
- Test your setup with tooling (Claude Console, Claude Code /cost//context, provider dashboards) that surfaces cache usage.
Avoid prefix jitter
- Cache only hits on exact prefix matches; any jitter at the top kills savings.
- Don’t inject timestamps, random IDs, or volatile user metadata near the top of your system prompt.
- Push user‑specific, fast‑changing bits as late as possible (final user message, last few turns) so the long prefix remains identical.
- In multi‑step flows or agents, reuse a fixed frame prompt and pass small deltas as parameters instead of regenerating a slightly different mega‑prompt each step.

Anthropic’s Claude API exposes this via cache_control markers on prompt blocks. Claude Code uses its own heuristics to decide which parts of your workspace go into the cachable prefix automatically.

2. Clamp output and conversation growth

You can save a surprising amount just by not letting outputs and history bloat.

Compact or clear context aggressively
- Avoid carrying historical messages that are no longer relevant to the current question.
- Summarize older turns into short state blobs instead of resending full transcripts.
Set realistic max_tokens
- For API calls, start with modest max_tokens and only increase until quality breaks.
- For code, you often don’t need 4k tokens when 512–1k is plenty for a patch or explanation.
Keep internal reasoning lean
- For non‑user‑visible content (scratchpads, tool calls, “internal thoughts”), keep them terse and structured, not essay‑length.

Because output tokens are several times more expensive than input for the same model, giant responses are a direct multiplier on your bill.

3. Route smartly across models

Tokenmaxxing isn’t just fewer tokens; it’s also cheaper tokens where quality is sufficient.

Prefer cheaper models where possible
- Use smaller / cheaper variants (e.g., Claude Haiku vs Sonnet) or other providers for classification, light transforms, format conversions, and short reasoning.
- Cost deltas can be 10–50x between model tiers, so this routing matters a lot.
Save big models for high‑leverage work
- Reserve the strongest models for synthesis, tricky debugging, complex refactors, or planning that genuinely needs more context and reasoning depth.
Hybrid agent patterns
- In agent flows, do plan → subtask decomposition on a stronger model once, then execute subtasks on cheaper models using the plan as a compressed spec.

Think in terms of “token value density”: how much actual business value per 1k tokens does this segment produce, and can you shift low‑value work to cheaper models or delete it entirely?

4. Instrument and iterate

To tokenmaxx effectively, you need feedback loops around real usage, not vibes.

Track the right metrics
- Log input_tokens, cache_read_input_tokens, cache_creation_input_tokens, and output_tokens per endpoint and per prompt class.
- Monitor cache hit rates vs. normal input; a sudden drop usually means your prefix changed.[^10]
Alert on regressions
- Put alerts on sharp changes: for example, a deploy that injects timestamps into the system prompt will show as a drop in cache reads and a spike in normal input tokens.
- A simple pattern is to log a per‑request “prompt signature” hash of the static prefix; drift in that hash is often your first smell that something broke.

Over a few iterations, this gives you a tight loop: change prompt layout → observe cache hits and cost → keep what works, roll back what doesn’t.

Appendix: Claude Code

With Claude Code, tokenmaxxing is mostly about how you use the CLI / VS Code client, since it handles prompt caching under the hood. Your levers are session hygiene, /cost and /context, and how you scope tasks. See the Claude Code docs for prompt caching details.

Instrumentation

Lean on Claude Code’s built‑in caching (most modern coding agents do something similar).
Use /cost and /context as lightweight instrumentation.
/cost shows tokens and implied usage cost for the current session, even on subscription plans.[^11]
/context shows how much of the context window is in use, which approximates how much gets re‑examined on each turn.
Run them after a few heavy rounds; if context is huge and marginal cost per message is high, it’s time to /clear or start a new session.

This is analogous to profiling your program before optimizing: you want to find and fix the outlier sessions.

Practice “session hygiene”

Use /clear between unrelated tasks so old context isn’t dragged into new work.
Be specific to reduce turn count: “Add validation for X in file Y” is more likely to complete in one round than “fix auth.”
Shorter, focused sessions mean fewer re‑reads of large CLAUDE.md or project context, even when cached.

Control code size and attached context

You can indirectly influence how many tokens Claude Code has to chew through.

Keep CLAUDE.md and other instruction files tight; a 15k‑token CLAUDE.md dominates your token profile and gets pulled into the cache repeatedly.
When asking about large files, scope the request to specific functions, regions, or modules instead of “read the whole repo and refactor X.”
Avoid dumping huge logs or full minified bundles; trim to relevant excerpts or use filters.
The right mental model is: “What’s the minimal context Claude needs to do this step?”

Clamp output verbosity

Claude Code can emit huge code blocks if you let it, and those output tokens are the expensive ones.

Ask for focused patches or diffs instead of full‑file rewrites when possible.
When you only need a helper or a small refactor, say so (“only output the new function and a brief explanation”).
If it restates large chunks of context, explicitly tell it not to repeat earlier code or instructions.

Output discipline matters because those tokens are where your costs spike fastest.