AI Engineering Judgment/Compliance & Responsible AI
Advanced10 min

Content Moderation & Safety

AI systems can generate or be manipulated into producing harmful content. Learn to build input/output filtering pipelines, use dedicated safety classifiers, handle category taxonomies, and design escalation workflows for when automated moderation is not enough.

Quick Reference

  • Filter inputs AND outputs — a safe prompt can still produce unsafe responses
  • Use dedicated safety classifiers (OpenAI Moderation API, Perspective API) rather than prompting the main model to self-moderate
  • Build a category taxonomy: hate speech, self-harm, illegal content, NSFW, PII, prompt injection — each needs different handling
  • Not all unsafe content should be blocked — some should be flagged for human review, some should be warned, some should be logged
  • Prompt injection is a safety issue: adversarial inputs that override your system prompt
  • Escalation workflows: automated flag → human review → policy update → retrain classifier

Input & Output Filtering Architecture

Content moderation for AI systems requires filtering at two points: before the user's message reaches the model (input filtering) and before the model's response reaches the user (output filtering). Both are necessary because a benign input can produce a harmful output (the model goes off-track), and a harmful input might produce a safe output (but you still want to log the attempt).

Filter PointWhat It CatchesAction on Match
Input filterHate speech, threats, explicit content in user messagesBlock request, log attempt, warn user
Input filterPrompt injection attemptsBlock request, flag for security review
Input filterPII in user messagesRedact before forwarding to model
Output filterHarmful content generated by modelBlock response, regenerate with safety prompt
Output filterHallucinated harmful instructionsBlock and serve safe fallback response
Output filterPII leaked in model responseRedact before showing to user
Content moderation pipeline with input and output filters