LangChain/Models
Intermediate12 min

Local Models

Running models locally with Ollama solves three real problems: data that can't leave your machine, usage patterns that make cloud APIs expensive at scale, and offline operation. This article walks through the decision, the current model landscape, hardware requirements, and the LangChain integration — including reasoning models and structured output.

Quick Reference

  • Use local when: hard privacy requirement, >50K req/day cost pressure, or air-gapped deployment
  • Avoid local when: frontier quality matters, context >32K tokens, or multimodal accuracy is critical
  • Install: brew install ollama (Mac) | curl -fsSL https://ollama.com/install.sh | sh (Linux)
  • Pull models: ollama pull qwen3:8b | ollama pull deepseek-r1:8b | ollama pull llama4:8b
  • Minimum RAM rule: model size on disk × 1.3 for inference headroom (e.g. 5GB model → 7GB RAM)
  • from langchain_ollama import ChatOllama | reasoning=True for DeepSeek R1
  • Use environment variable for model name — same code, local dev, cloud prod

Should You Run Local?

Local inference solves three specific problems. If none of these apply to you, cloud APIs will serve you better — they have frontier quality, no infrastructure overhead, and elastic scaling.

Do you need local inference?Hard privacy req?(data must not leave)YESNOCost at scale?(>50K req/day)YESNOOffline / air-gap?(no network access)YESNOLOCALOllama / self-hostno egressCLOUDfrontier qualityelastic scale

Privacy, scale, or offline → local. Otherwise, cloud frontier quality wins.

SituationUse Local?Why
PII / confidential data, can't leave your infraYesZero data egress — model runs entirely on your hardware
>50K requests/day at $0.002/req = $100/dayYesBreak-even on a dev machine is ~3–4 months of heavy use
Air-gapped / offline deploymentYesOllama serves on localhost, no network required after pull
Complex multi-step reasoning (e.g. legal analysis)NoLocal 8B models make more reasoning errors than Opus 4.7
Context windows >32K tokensNoMost local models have 4–8K effective context despite claimed sizes
Latency-sensitive production API (<500ms p99)NoConsumer hardware: 20–50 tok/s; cloud: 100–200 tok/s
Multimodal accuracy matters (not just working)NoVision quality gap between local and frontier is large
The quality gap is real

8B local models score roughly 70–77% on MMLU benchmarks. Frontier cloud models (Opus 4.7, GPT-5) score 87–92%. That 15-point gap shows up as wrong answers, missed edge cases, and unreliable tool calls in production. If your task requires high accuracy, measure the gap on your specific data before committing to local.