LLM Foundations

Everything a software engineer needs to understand about large language models: how transformers work, the model landscape, prompt engineering as a discipline, and when and how to fine-tune.

0/19

Tokenization Deep Dive

How LLMs split text into tokens, why BPE is the dominant algorithm, and the engineering decisions that follow: cross-provider cost math, pre-flight token counting (including Anthropic's Token Counting API), content-type multipliers, and the production traps that blow budgets.

beginner14 min

The Inference Pipeline

Two fundamentally different phases explain every latency, cost, and quality trade-off you face when calling an LLM API. Understanding them tells you what to optimize — and stops you from optimizing the wrong thing.

intermediate13 min

Context Windows & Context Management

Context windows in 2026 are large -- the problem is no longer size, it's quality degradation. Learn what every major model actually offers, what the 'lost in the middle' problem means for your production prompts, how to do real cost math, and how prompt caching cuts your context spend by up to 90%.

intermediate12 min

Hallucinations: The Engineering Response

LLMs hallucinate because they generate statistically plausible tokens, not verified facts. This article gives you real failure rates by domain, the full defense-in-depth stack with cost math, working code for Anthropic's Citations API, and an eval harness to measure hallucination in your own system.

intermediate14 min

★Model Families Compared

The hard part of model selection isn't capability — all frontier models are good. It's knowing which model to pick for your task, budget, and compliance constraints. This guide maps every major family to the decisions that actually matter in production.

beginner10 min

Open vs Closed Models

The false binary of 'pick one side' has been replaced by three deployment tiers in 2026: closed API, hosted open model API, and self-hosted. The quality gap has nearly closed on most tasks. The decision is now mostly about ops budget, volume, and licensing risk — not capability.

intermediate14 min

Model Selection Framework

How to choose the right model for your workload — and when not to add routing at all. Covers the 2026 model landscape, scorecard evaluation, real cost math, fallback chains, and a first-30-days runbook for moving from a single model to a production router.

intermediate16 min

Reading Benchmarks Critically

Benchmark scores are marketing until proven otherwise. This article teaches you the specific checks — sample size, prompt sensitivity, contamination risk, saturation — that separate signal from noise, plus how to build and maintain a private benchmark that actually predicts production quality.

intermediate14 min

Multimodal Models

How to decide when to use multimodal models, which one to pick, what it costs, and how to catch the hallucinations before they reach production. Covers vision, audio, model tiering, and validation strategy for engineers building with GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, and Llama 4.

intermediate14 min

★Prompt Anatomy

The structural components of an LLM prompt and why structure matters more than length. Covers the four message roles (including OpenAI's developer role), when to skip elaborate structure, how system prompts interact with prompt caching costs, and the production failure modes that vague prompts cause.

beginner14 min

Techniques That Work

Six prompt engineering techniques — few-shot, chain-of-thought, self-consistency, role prompting, decomposition, and cost math — with the decision framework for picking the right one and the cost arithmetic every production engineer needs before committing.

intermediate16 min

Structured Output Techniques

Both OpenAI and Anthropic now offer native constrained decoding — the model physically cannot produce tokens that violate your schema. This guide covers when to use it, how it works, and the three layers of validation you still need on top of it.

intermediate14 min

Systematic Prompt Iteration

A methodology for improving prompts without guessing: build a golden test suite, evaluate systematically, compare with statistical rigor, and monitor for drift. This is the engineering discipline that separates one-off prompt tweaks from reliable production improvement.

advanced14 min

When Prompting Isn't Enough

A decision framework for escaping prompt engineering ceilings: how to diagnose what's actually broken, and which of the four escalation paths — structured output, model switch, RAG, or fine-tuning — fixes it.

advanced12 min

★When to Fine-Tune

The fine-tuning landscape shifted in 2026: GPT-5.4 supports distillation but not traditional SFT, o4-mini is retired, and open-source models like Gemma 4 now match closed-source quality at a fraction of the cost. This article walks through the decision — whether to fine-tune at all, which approach to use, and how to compute ROI before committing to training.

intermediate14 min

Training Data Engineering

A decision-first guide to building training datasets that actually improve fine-tuned models. Covers when to invest, correct data formats for SFT/DPO/RFT, annotation quality, cleaning pipelines, synthetic generation with cost math, distribution engineering, and the failure modes that kill fine-tuning projects before training even starts.

advanced20 min

LoRA & QLoRA

LoRA fine-tunes large models by training tiny adapter matrices instead of all weights. This article covers when to use it (and when not to), the math behind it, QLoRA's memory derivation, DoRA and Unsloth as 2026 best practices, and the failure modes — catastrophic forgetting, safety degradation, overfitting — that most tutorials skip.

advanced16 min

End-to-End Fine-Tuning Pipeline

Pick your path first (API, self-hosted, or cloud-managed), then follow it end-to-end. Covers cost math for all three paths, updated OpenAI pipeline with GPT-4.1, updated Hugging Face pipeline with TRL v1.0 SFTConfig, training monitoring, post-training LoRA merge and deployment, and the five failure modes that most fine-tuning projects hit.

advanced16 min

Evaluating Fine-Tuned Models

Five dimensions you must evaluate before shipping a fine-tuned model — including catastrophic forgetting, which kills most fine-tunes that look good on paper. Covers loss curve interpretation, contamination detection, LLM-as-judge eval, and a go/no-go decision framework.

advanced14 min