PromptHelm Docs

LLM costs scale with traffic, and they scale fast. This guide collects the tactics that consistently move the needle for PromptHelm users — each is measurable against per-prompt analytics so you can prove the delta before committing.

Match the model to the task

The simplest win. Reserve flagship models for tasks that demonstrably benefit from them; route routine work to cheaper tiers.

Name	Type	Default	Description
Routine classification	task	—	GPT-4.1 Mini, Claude Haiku 4.5, Gemini Flash-Lite. 5-20× cheaper than flagships at near-identical accuracy on short-context tasks.
Multi-step reasoning	task	—	GPT-5, Claude Sonnet 4.5. Worth the cost when chain-of-thought matters.
Bulk extraction	task	—	DeepSeek V3 or Gemini Flash. Cheap, fast, parallelizable.

Enable prompt caching

Both OpenAI and Anthropic discount repeated input tokens when the system prompt and prefix are stable. Cached input is billed at roughly 25% of the standard rate.

Keep the system prompt stable across calls (don't interpolate timestamps or per-user data into it).
Push variable content (user input, document snippets) to the end of the prompt so the prefix stays cacheable.
Watch the analytics view's "cached tokens" tile to confirm the cache is hitting.

Tighten `maxTokens`

Most prompts are over-budgeted on output. Profile a representative sample of real responses, then set maxTokens to roughly 1.5× the observed p95 output length. Lower ceilings remove the worst-case spend from the math entirely.

Use stop sequences to truncate early

When the model is reliably appending boilerplate after the useful output (closing remarks, JSON schema repetition), set a stopSequence that cuts generation as soon as the useful part is done. Stops save both cost and latency.

A/B test for quality parity

Use the Playground to fork a candidate version that swaps the model or trims maxTokens, then promote it to a small slice of traffic. After a few hundred calls, the analytics view makes the cost-vs-quality tradeoff explicit — promote fully only when the cheaper version holds.

Monitor and alert

Add a saved filter on the Cost dashboard that groups by model and promptSlug. If a single prompt's cost trend doubles week-over-week without a traffic spike, it's almost always a recent version that ballooned the system prompt or unlocked a more expensive model.

Cost is per-version

Analytics let you slice cost by prompt version, which is the fastest way to find "the change that doubled spend" — open the by-version breakdown on a high-cost prompt and look for the inflection point.

Cost Optimization

Match the model to the task

Enable prompt caching

Tighten `maxTokens`

Use stop sequences to truncate early

A/B test for quality parity

Monitor and alert

Next steps

Per-prompt analytics

Managing versions

On this page

Cost Optimization

Match the model to the task

Enable prompt caching

Tighten maxTokens

Use stop sequences to truncate early

A/B test for quality parity

Monitor and alert

Next steps

Per-prompt analytics

Managing versions

On this page

Tighten `maxTokens`