PPromptHelm Docs
Guides

Cost Optimization

Reduce LLM spend without sacrificing quality.

LLM costs scale with traffic, and they scale fast. This guide collects the tactics that consistently move the needle for PromptHelm users — each is measurable against per-prompt analytics so you can prove the delta before committing.

Match the model to the task

The simplest win. Reserve flagship models for tasks that demonstrably benefit from them; route routine work to cheaper tiers.

NameTypeDefaultDescription
Routine classificationtaskGPT-4.1 Mini, Claude Haiku 4.5, Gemini Flash-Lite. 5-20× cheaper than flagships at near-identical accuracy on short-context tasks.
Multi-step reasoningtaskGPT-5, Claude Sonnet 4.5. Worth the cost when chain-of-thought matters.
Bulk extractiontaskDeepSeek V3 or Gemini Flash. Cheap, fast, parallelizable.

Enable prompt caching

Both OpenAI and Anthropic discount repeated input tokens when the system prompt and prefix are stable. Cached input is billed at roughly 25% of the standard rate.

  • Keep the system prompt stable across calls (don't interpolate timestamps or per-user data into it).
  • Push variable content (user input, document snippets) to the end of the prompt so the prefix stays cacheable.
  • Watch the analytics view's "cached tokens" tile to confirm the cache is hitting.

Tighten maxTokens

Most prompts are over-budgeted on output. Profile a representative sample of real responses, then set maxTokens to roughly 1.5× the observed p95 output length. Lower ceilings remove the worst-case spend from the math entirely.

Use stop sequences to truncate early

When the model is reliably appending boilerplate after the useful output (closing remarks, JSON schema repetition), set a stopSequence that cuts generation as soon as the useful part is done. Stops save both cost and latency.

A/B test for quality parity

Use the Playground to fork a candidate version that swaps the model or trims maxTokens, then promote it to a small slice of traffic. After a few hundred calls, the analytics view makes the cost-vs-quality tradeoff explicit — promote fully only when the cheaper version holds.

Monitor and alert

Add a saved filter on the Cost dashboard that groups by model and promptSlug. If a single prompt's cost trend doubles week-over-week without a traffic spike, it's almost always a recent version that ballooned the system prompt or unlocked a more expensive model.

Cost is per-version

Analytics let you slice cost by prompt version, which is the fastest way to find "the change that doubled spend" — open the by-version breakdown on a high-cost prompt and look for the inflection point.

Next steps

On this page