Advanced10 min
Content Moderation & Safety
AI systems can generate or be manipulated into producing harmful content. Learn to build input/output filtering pipelines, use dedicated safety classifiers, handle category taxonomies, and design escalation workflows for when automated moderation is not enough.
Quick Reference
- →Filter inputs AND outputs — a safe prompt can still produce unsafe responses
- →Use dedicated safety classifiers (OpenAI Moderation API, Perspective API) rather than prompting the main model to self-moderate
- →Build a category taxonomy: hate speech, self-harm, illegal content, NSFW, PII, prompt injection — each needs different handling
- →Not all unsafe content should be blocked — some should be flagged for human review, some should be warned, some should be logged
- →Prompt injection is a safety issue: adversarial inputs that override your system prompt
- →Escalation workflows: automated flag → human review → policy update → retrain classifier
Input & Output Filtering Architecture
Content moderation for AI systems requires filtering at two points: before the user's message reaches the model (input filtering) and before the model's response reaches the user (output filtering). Both are necessary because a benign input can produce a harmful output (the model goes off-track), and a harmful input might produce a safe output (but you still want to log the attempt).
| Filter Point | What It Catches | Action on Match |
|---|---|---|
| Input filter | Hate speech, threats, explicit content in user messages | Block request, log attempt, warn user |
| Input filter | Prompt injection attempts | Block request, flag for security review |
| Input filter | PII in user messages | Redact before forwarding to model |
| Output filter | Harmful content generated by model | Block response, regenerate with safety prompt |
| Output filter | Hallucinated harmful instructions | Block and serve safe fallback response |
| Output filter | PII leaked in model response | Redact before showing to user |
Content moderation pipeline with input and output filters