Intermediate12 min

Local Models

Running models locally with Ollama solves three real problems: data that can't leave your machine, usage patterns that make cloud APIs expensive at scale, and offline operation. This article walks through the decision, the current model landscape, hardware requirements, and the LangChain integration — including reasoning models and structured output.

Quick Reference

→Use local when: hard privacy requirement, >50K req/day cost pressure, or air-gapped deployment
→Avoid local when: frontier quality matters, context >32K tokens, or multimodal accuracy is critical
→Install: brew install ollama (Mac) | curl -fsSL https://ollama.com/install.sh | sh (Linux)
→Pull models: ollama pull qwen3:8b | ollama pull deepseek-r1:8b | ollama pull llama4:8b
→Minimum RAM rule: model size on disk × 1.3 for inference headroom (e.g. 5GB model → 7GB RAM)
→from langchain_ollama import ChatOllama | reasoning=True for DeepSeek R1
→Use environment variable for model name — same code, local dev, cloud prod

Should You Run Local?

Local inference solves three specific problems. If none of these apply to you, cloud APIs will serve you better — they have frontier quality, no infrastructure overhead, and elastic scaling.

Privacy, scale, or offline → local. Otherwise, cloud frontier quality wins.

Situation	Use Local?	Why
PII / confidential data, can't leave your infra	Yes	Zero data egress — model runs entirely on your hardware
>50K requests/day at $0.002/req = $100/day	Yes	Break-even on a dev machine is ~3–4 months of heavy use
Air-gapped / offline deployment	Yes	Ollama serves on localhost, no network required after pull
Complex multi-step reasoning (e.g. legal analysis)	No	Local 8B models make more reasoning errors than Opus 4.7
Context windows >32K tokens	No	Most local models have 4–8K effective context despite claimed sizes
Latency-sensitive production API (<500ms p99)	No	Consumer hardware: 20–50 tok/s; cloud: 100–200 tok/s
Multimodal accuracy matters (not just working)	No	Vision quality gap between local and frontier is large

The quality gap is real

8B local models score roughly 70–77% on MMLU benchmarks. Frontier cloud models (Opus 4.7, GPT-5) score 87–92%. That 15-point gap shows up as wrong answers, missed edge cases, and unreliable tool calls in production. If your task requires high accuracy, measure the gap on your specific data before committing to local.

Which Model for Which Task

The local model landscape changed significantly in 2025–2026. Llama 3.2, Mistral NeMo, and Qwen 2.5 are superseded. These are the current choices and what they're actually good at.

Hardware Requirements

This is the section the old article skipped. Pulling a model that doesn't fit in your VRAM/RAM is the #1 local model beginner mistake — Ollama will swap to disk and run at <1 tok/s, which is unusable.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.