LLM Foundations
Everything a software engineer needs to understand about large language models: how transformers work, the model landscape, prompt engineering as a discipline, and when and how to fine-tune.
How LLMs break text into tokens, why BPE is the dominant algorithm, and the practical implications for cost, context limits, and multilingual performance. Includes hands-on token counting with tiktoken and cross-model comparisons.
What actually happens when you call an LLM API -- from prompt tokenization through logit computation to output sampling. Understand KV caching, sampling strategies (temperature, top-p, top-k), batching, and how these choices affect output quality and latency.
What context windows really mean, why the 'lost in the middle' problem plagues long-context models, how attention patterns change at different positions, and practical strategies for working within context limits.
LLMs hallucinate because they are statistical pattern matchers, not knowledge databases. Understand the types of hallucination, when they are most likely, practical mitigation strategies, and why designing around hallucination is more realistic than eliminating it.
A comprehensive comparison of the major LLM families: GPT (OpenAI), Claude (Anthropic), Gemini (Google), and leading open models (Llama, Mistral, Qwen). Pricing, capabilities, context windows, and when to use each.
The trade-offs between closed-source API models (GPT-4, Claude) and open-weight models (Llama, Mistral). When self-hosting makes economic sense, licensing traps to avoid, and a decision framework for choosing between them.
A systematic framework for choosing the right LLM for your use case across four dimensions: capability, cost, latency, and privacy. Includes model scorecards, multi-model strategies, fallback chains, and a working model router implementation.
How to interpret LLM benchmarks without being misled. Covers major benchmarks (MMLU, HumanEval, MATH, Arena Elo), what they actually test, benchmark contamination, and how to build your own task-specific benchmark that actually matters.
How modern LLMs process images, audio, and video alongside text. Covers vision capabilities (image understanding, OCR, diagram analysis), audio features, current limitations, practical use cases, and working code examples for extracting structured data from images.
The structural components of an LLM prompt: system messages, user messages, and assistant messages. How each part influences model behavior, why system prompts are privileged, and practical demonstrations of how prompt structure transforms output quality.
Evidence-based prompt engineering techniques: chain-of-thought reasoning, self-consistency, role prompting, and step-by-step decomposition. When each technique helps, when it hurts, and how to measure the improvement.
Getting reliable JSON, structured data, and type-safe outputs from LLMs. Covers JSON mode, function calling, constrained decoding, Pydantic validation, and handling partial/malformed output in streaming scenarios.
How to version-control prompts, A/B test with statistical significance, build prompt test suites with golden examples, and run regression tests to ensure new prompts don't break old cases. A disciplined engineering approach to prompt development.
How to recognize when prompt engineering has hit its ceiling and what escalation path to take. Decision framework for: improve prompt -> add context (RAG) -> fine-tune -> change model, with cost-benefit analysis and real examples.
A decision framework for choosing between prompt engineering, RAG, and fine-tuning. When fine-tuning is the right investment, when it is a waste of time, cost analysis comparing approaches, and the use cases where fine-tuning delivers the most value.
How to prepare high-quality training data for LLM fine-tuning. Covers data formats, quality-over-quantity principles, data cleaning and deduplication, synthetic data generation, and a complete data preparation pipeline.
Parameter-efficient fine-tuning with LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA). How they work intuitively, why they need only 1-10% of the memory of full fine-tuning, how to choose hyperparameters (rank, alpha, target modules), and a complete configuration example with Hugging Face PEFT.
Complete fine-tuning pipelines for three approaches: OpenAI (simplest), Hugging Face + PEFT (most control), and cloud-managed (Vertex AI, Bedrock). Includes training monitoring, loss curve interpretation, overfitting detection, and a full working Hugging Face training script.
How to rigorously evaluate fine-tuned LLMs: train/validation/test splitting for LLMs, detecting overfitting and benchmark contamination, A/B testing fine-tuned vs base models with real users, and a complete evaluation harness implementation.