PII & Data Handling

Every prompt you send to an LLM API can contain personal data — user names, emails, addresses, and more. Learn PII detection, redaction before API calls, data residency requirements (GDPR, CCPA), provider retention policies, and anonymization strategies that keep you compliant.

Quick Reference

→Every LLM API call is a potential data export — treat prompts with the same sensitivity as database exports
→Detect and redact PII BEFORE sending to the LLM API, not after
→Know your provider's data retention policy: OpenAI retains API data for 30 days, Anthropic for 30 days (enterprise terms vary)
→GDPR requires data processing agreements (DPAs) with every LLM provider that touches EU user data
→Anonymize with placeholder tokens ([USER_NAME], [EMAIL]) that you can restore in the response
→Never include full database records in prompts — extract only the fields the model needs

How PII Ends Up in Prompts

PII enters your LLM prompts through several channels, some obvious and some subtle. User messages naturally contain personal information. Conversation history accumulates PII over multiple turns. RAG retrieval can pull documents containing PII from other users. System prompts might include user profile data for personalization. Every one of these channels needs to be audited and controlled.

PII Source	Example	Risk Level	Mitigation
User message	'My SSN is 123-45-6789, can you help?'	High	Detect and redact before sending to model
Conversation history	Previous turns containing names, emails, addresses	High	Redact history before re-sending
RAG retrieval	Documents containing other users' data	Critical	Tenant isolation in vector store, metadata filtering
System prompt	User profile injected for personalization	Medium	Minimize profile data, use IDs not names
Tool results	Database query returns with PII fields	High	Select only needed fields, redact before injection
Error messages	Stack traces containing user data	Medium	Sanitize error messages before including in prompt

RAG Cross-Contamination Is the Biggest Risk

If your RAG system does not enforce tenant isolation, User A's query can retrieve User B's documents. This is a data breach. Always filter retrieval results by user/tenant ID. Use metadata filters in your vector store, not post-retrieval filtering (which is a race condition).

PII Detection & Redaction Pipeline

A PII detection pipeline scans text for personal data patterns and replaces them with placeholders before the text reaches the LLM API. The pipeline should run on every input to the LLM — user messages, system prompts, tool results, and conversation history.

Data Residency & Compliance

When you send data to an LLM API, it is processed on the provider's infrastructure — which may be in a different country than your users. GDPR, CCPA, and similar regulations have specific requirements about where personal data can be processed, how long it can be retained, and what agreements must be in place.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.