AI Engineering Judgment/Compliance & Responsible AI
Advanced11 min

PII & Data Handling

Every prompt you send to an LLM API can contain personal data — user names, emails, addresses, and more. Learn PII detection, redaction before API calls, data residency requirements (GDPR, CCPA), provider retention policies, and anonymization strategies that keep you compliant.

Quick Reference

  • Every LLM API call is a potential data export — treat prompts with the same sensitivity as database exports
  • Detect and redact PII BEFORE sending to the LLM API, not after
  • Know your provider's data retention policy: OpenAI retains API data for 30 days, Anthropic for 30 days (enterprise terms vary)
  • GDPR requires data processing agreements (DPAs) with every LLM provider that touches EU user data
  • Anonymize with placeholder tokens ([USER_NAME], [EMAIL]) that you can restore in the response
  • Never include full database records in prompts — extract only the fields the model needs

How PII Ends Up in Prompts

PII enters your LLM prompts through several channels, some obvious and some subtle. User messages naturally contain personal information. Conversation history accumulates PII over multiple turns. RAG retrieval can pull documents containing PII from other users. System prompts might include user profile data for personalization. Every one of these channels needs to be audited and controlled.

PII SourceExampleRisk LevelMitigation
User message'My SSN is 123-45-6789, can you help?'HighDetect and redact before sending to model
Conversation historyPrevious turns containing names, emails, addressesHighRedact history before re-sending
RAG retrievalDocuments containing other users' dataCriticalTenant isolation in vector store, metadata filtering
System promptUser profile injected for personalizationMediumMinimize profile data, use IDs not names
Tool resultsDatabase query returns with PII fieldsHighSelect only needed fields, redact before injection
Error messagesStack traces containing user dataMediumSanitize error messages before including in prompt
RAG Cross-Contamination Is the Biggest Risk

If your RAG system does not enforce tenant isolation, User A's query can retrieve User B's documents. This is a data breach. Always filter retrieval results by user/tenant ID. Use metadata filters in your vector store, not post-retrieval filtering (which is a race condition).