Local Models
Running models locally with Ollama solves three real problems: data that can't leave your machine, usage patterns that make cloud APIs expensive at scale, and offline operation. This article walks through the decision, the current model landscape, hardware requirements, and the LangChain integration — including reasoning models and structured output.
Quick Reference
- →Use local when: hard privacy requirement, >50K req/day cost pressure, or air-gapped deployment
- →Avoid local when: frontier quality matters, context >32K tokens, or multimodal accuracy is critical
- →Install: brew install ollama (Mac) | curl -fsSL https://ollama.com/install.sh | sh (Linux)
- →Pull models: ollama pull qwen3:8b | ollama pull deepseek-r1:8b | ollama pull llama4:8b
- →Minimum RAM rule: model size on disk × 1.3 for inference headroom (e.g. 5GB model → 7GB RAM)
- →from langchain_ollama import ChatOllama | reasoning=True for DeepSeek R1
- →Use environment variable for model name — same code, local dev, cloud prod
Should You Run Local?
Local inference solves three specific problems. If none of these apply to you, cloud APIs will serve you better — they have frontier quality, no infrastructure overhead, and elastic scaling.
Privacy, scale, or offline → local. Otherwise, cloud frontier quality wins.
| Situation | Use Local? | Why |
|---|---|---|
| PII / confidential data, can't leave your infra | Yes | Zero data egress — model runs entirely on your hardware |
| >50K requests/day at $0.002/req = $100/day | Yes | Break-even on a dev machine is ~3–4 months of heavy use |
| Air-gapped / offline deployment | Yes | Ollama serves on localhost, no network required after pull |
| Complex multi-step reasoning (e.g. legal analysis) | No | Local 8B models make more reasoning errors than Opus 4.7 |
| Context windows >32K tokens | No | Most local models have 4–8K effective context despite claimed sizes |
| Latency-sensitive production API (<500ms p99) | No | Consumer hardware: 20–50 tok/s; cloud: 100–200 tok/s |
| Multimodal accuracy matters (not just working) | No | Vision quality gap between local and frontier is large |
8B local models score roughly 70–77% on MMLU benchmarks. Frontier cloud models (Opus 4.7, GPT-5) score 87–92%. That 15-point gap shows up as wrong answers, missed edge cases, and unreliable tool calls in production. If your task requires high accuracy, measure the gap on your specific data before committing to local.