Intermediate14 min

Model Configuration

The six parameters that separate a production LLM call from a fragile prototype: temperature, max_tokens, timeout, max_retries, rate limiting, and usage tracking. Each has a failure mode that won't surface until you're in production.

Quick Reference

→temperature=0 for extraction/classification, 0.7–1.0 for creative, 0–0.3 for code
→max_tokens caps billed output length — set it to schema size + 20% buffer for structured output
→timeout should be 2× your observed p99 latency — especially critical for reasoning models
→max_retries=6 by default; raise to 10–15 for long-running agent tasks on unreliable networks
→InMemoryRateLimiter throttles before the 429 — but only works within a single process
→get_usage_metadata_callback() is the cleanest way to track tokens; avoid it in long-running processes (memory leak in #32300)
→model.bind(stop=['\n']) overrides params per-call without creating a new model instance

When Configuration Actually Matters

Default configuration is fine for a demo. In production, it's a liability. Every parameter has a failure mode — and most of them are silent. Defaults are designed to work for most inputs, not your inputs. The three axes you're tuning across all parameters are quality (does the output match what I need?), cost (how many tokens am I burning?), and reliability (does this call succeed consistently?).

Every config parameter sits on one of three axes — tune quality, cost, or reliability independently

One parameter, two axes

max_tokens sits at the intersection of quality and cost. Too low: truncated output, possibly mid-JSON. Too high: you pay for tokens the model won't use anyway. The right value is schema size + 20% buffer for structured output, or an empirical p99 from your production logs for open-ended generation.

Temperature: What to Set and Why

Temperature controls the probability distribution over next tokens. At 0, the model always picks the highest-probability token. At 1.0, it samples from the full distribution. This isn't a creativity dial — it's a consistency vs. diversity trade-off. For tasks where there's one right answer, low temperature reduces variance. For tasks where many answers are valid, higher temperature explores the space.

Token Limits and Cost Math

max_tokens caps how many output tokens the model generates. It has no effect on input cost — you pay for input tokens regardless of this setting. The calculation for structured output is: estimate your maximum JSON response size in tokens (roughly characters ÷ 3.5), add a 20% buffer for reasoning preamble, and set max_tokens to that. For open-ended generation, use empirical p99 from production logs.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.