LangChain/Models
Intermediate14 min

Model Configuration

The six parameters that separate a production LLM call from a fragile prototype: temperature, max_tokens, timeout, max_retries, rate limiting, and usage tracking. Each has a failure mode that won't surface until you're in production.

Quick Reference

  • temperature=0 for extraction/classification, 0.7–1.0 for creative, 0–0.3 for code
  • max_tokens caps billed output length — set it to schema size + 20% buffer for structured output
  • timeout should be 2× your observed p99 latency — especially critical for reasoning models
  • max_retries=6 by default; raise to 10–15 for long-running agent tasks on unreliable networks
  • InMemoryRateLimiter throttles before the 429 — but only works within a single process
  • get_usage_metadata_callback() is the cleanest way to track tokens; avoid it in long-running processes (memory leak in #32300)
  • model.bind(stop=['\n']) overrides params per-call without creating a new model instance

When Configuration Actually Matters

Default configuration is fine for a demo. In production, it's a liability. Every parameter has a failure mode — and most of them are silent. Defaults are designed to work for most inputs, not your inputs. The three axes you're tuning across all parameters are quality (does the output match what I need?), cost (how many tokens am I burning?), and reliability (does this call succeed consistently?).

init_chat_model( … )Qualitytemperature0 = consistentmax_tokenscap output lengthCostmax_tokenslimits billed tokensmodel choice$/MTok varies 100×Reliabilitymax_retriesdefault 6timeoutfail fast or waitrate_limiterthrottle before 429max_tokens affects both quality and cost— the lever you'll tune most

Every config parameter sits on one of three axes — tune quality, cost, or reliability independently

One parameter, two axes

max_tokens sits at the intersection of quality and cost. Too low: truncated output, possibly mid-JSON. Too high: you pay for tokens the model won't use anyway. The right value is schema size + 20% buffer for structured output, or an empirical p99 from your production logs for open-ended generation.