Cost Optimization
Reduce LLM spend without sacrificing quality.
LLM costs scale with traffic, and they scale fast. This guide collects the tactics that consistently move the needle for PromptHelm users — each is measurable against per-prompt analytics so you can prove the delta before committing.
Match the model to the task
The simplest win. Reserve flagship models for tasks that demonstrably benefit from them; route routine work to cheaper tiers.
| Name | Type | Default | Description |
|---|---|---|---|
| Routine classification | task | — | GPT-4.1 Mini, Claude Haiku 4.5, Gemini Flash-Lite. 5-20× cheaper than flagships at near-identical accuracy on short-context tasks. |
| Multi-step reasoning | task | — | GPT-5, Claude Sonnet 4.5. Worth the cost when chain-of-thought matters. |
| Bulk extraction | task | — | DeepSeek V3 or Gemini Flash. Cheap, fast, parallelizable. |
Enable prompt caching
Both OpenAI and Anthropic discount repeated input tokens when the system prompt and prefix are stable. Cached input is billed at roughly 25% of the standard rate.
- Keep the system prompt stable across calls (don't interpolate timestamps or per-user data into it).
- Push variable content (user input, document snippets) to the end of the prompt so the prefix stays cacheable.
- Watch the analytics view's "cached tokens" tile to confirm the cache is hitting.
Tighten maxTokens
Most prompts are over-budgeted on output. Profile a representative
sample of real responses, then set maxTokens to roughly 1.5× the
observed p95 output length. Lower ceilings remove the worst-case
spend from the math entirely.
Use stop sequences to truncate early
When the model is reliably appending boilerplate after the useful
output (closing remarks, JSON schema repetition), set a stopSequence
that cuts generation as soon as the useful part is done. Stops save
both cost and latency.
A/B test for quality parity
Use the Playground to fork a candidate
version that swaps the model or trims maxTokens, then promote it to
a small slice of traffic. After a few hundred calls, the
analytics view makes the cost-vs-quality
tradeoff explicit — promote fully only when the cheaper version holds.
Monitor and alert
Add a saved filter on the Cost dashboard that groups by model
and promptSlug. If a single prompt's cost trend doubles week-over-week
without a traffic spike, it's almost always a recent version that
ballooned the system prompt or unlocked a more expensive model.
Cost is per-version
Analytics let you slice cost by prompt version, which is the fastest way to find "the change that doubled spend" — open the by-version breakdown on a high-cost prompt and look for the inflection point.