phi-shield

Detect, mask, redact, or de-identify Protected Health Information (PHI) and Personally Identifiable Information (PII) from any file or text, in compliance with HIPAA Safe Harbor (45 CFR §164.514). Use this skill whenever the user wants to: redact PHI or PII from documents, de-identify patient data, anonymize health records, mask sensitive fields before sharing data, check whether a file contains PHI, scrub clinical notes or EHR exports, prepare a dataset for research or analytics, comply with HIPAA de-identification requirements, or sanitize CSV/Excel/text/PDF/DOCX files of patient identifiers. Triggers on: PHI, PII, HIPAA, de-identify, anonymize, redact, mask, scrub, sanitize, patient data, health records, clinical notes, EHR, medical records, safe harbor, 18 identifiers, protected health information, personally identifiable.

active

IDE:

codex

Version:

1.0.0

Owner:jnishan5

hipaa

phi

pii

healthcare

de-identification

redaction

compliance

safe-harbor

privacy

PHI Shield — HIPAA-Compliant De-identification Skill

Detect, mask, and redact PHI/PII from structured data (CSV, Excel), unstructured text (clinical notes, emails, reports), and documents (DOCX, PDF), using a two-layer approach: regex-based pattern matching for structured identifiers + NLP-based NER for names and contextual entities.

Legal disclaimer: This skill implements HIPAA Safe Harbor de-identification (45 CFR §164.514(b)) as a technical control. It is NOT a substitute for legal counsel or a formal Expert Determination assessment. Always have qualified personnel review outputs before sharing or publishing de-identified data.

Mode selection

Ask the user which mode they need if not already specified:

Mode	What it does	Use when
`detect`	Scan and report what PHI/PII is found — no changes	Auditing a file
`mask`	Replace PHI with type labels: `[PATIENT_NAME]`, `[SSN]`	Readable output needed
`redact`	Replace PHI with `█████` or `[REDACTED]`	Strongest privacy
`pseudonymize`	Replace with consistent fake values	Downstream analytics need structure
`safe-harbor`	Full HIPAA Safe Harbor — remove all 18 identifiers	Research/sharing compliance

Default to mask if unspecified.

Step-by-step workflow

Step 1: Identify input type

file /mnt/user-data/uploads/<filename>
stat -c '%s bytes' /mnt/user-data/uploads/<filename>

Route by extension:

.csv / .tsv → structured pipeline (redact_structured.py)
.xlsx / .xls → structured pipeline (redact_structured.py)
.txt / .md / .log → unstructured pipeline (redact_text.py)
.docx → unstructured pipeline (redact_text.py with docx support)
.pdf → extract text first, then unstructured pipeline
Raw text pasted in chat → run inline detection (redact_text.py with stdin)

Step 2: Install dependencies

pip install pandas openpyxl python-docx pdfminer.six \
    presidio-analyzer presidio-anonymizer spacy \
    --break-system-packages -q

python -m spacy download en_core_web_lg --quiet 2>/dev/null || \
python -m spacy download en_core_web_sm --quiet

Step 3: Run the appropriate script

Structured data (CSV/Excel):

python /path/to/phi-shield/scripts/redact_structured.py \
    "<input_path>" \
    "<output_path>" \
    --mode mask \
    --audit /tmp/phi_audit.json

Unstructured text/DOCX/PDF:

python /path/to/phi-shield/scripts/redact_text.py \
    "<input_path>" \
    "<output_path>" \
    --mode mask \
    --audit /tmp/phi_audit.json

Inline text (pasted in chat): Write the text to /tmp/input.txt first, then run redact_text.py on it.

Step 4: Read and present the audit report

Read /tmp/phi_audit.json after the script completes. Always show the user:

Total PHI instances found (by category)
Which columns/sections were affected
Confidence breakdown (high / medium / low)
Any items flagged for manual review

Step 5: Save and present output

cp <output_path> /mnt/user-data/outputs/<original_name>_deidentified.<ext>

Call present_files on the output and the audit JSON.

PHI categories detected

See references/phi_categories.md for the full pattern library and NER labels. The 18 HIPAA Safe Harbor identifiers covered:

Names (patient, relative, employer) — NER + patterns
Geographic subdivisions < state (address, city, county, ZIP) — patterns + NER
Dates (except year): birth, admission, discharge, death — patterns
Phone numbers — patterns
Fax numbers — patterns
Email addresses — patterns
Social Security numbers — patterns
Medical record numbers — patterns
Health plan beneficiary numbers — patterns
Account numbers — patterns
Certificate / license numbers — patterns
Vehicle identifiers and license plates — patterns
Device identifiers and serial numbers — patterns
URLs — patterns
IP addresses — patterns
Biometric identifiers (finger/voice prints) — keyword detection
Full-face photos and comparable images — flagged (cannot auto-redact image content)
Any other unique identifying code — heuristic + patterns

Additional PII (non-HIPAA but commonly needed):

Passport numbers
Credit card numbers (PAN)
Bank account / routing numbers
National ID numbers (non-US)
Gender / race / ethnicity (quasi-identifier, flagged with low confidence)
Employer names (quasi-identifier)

Output quality standards

Masks must be consistent within a document: the same name always maps to the same token ([PATIENT_NAME_1], [PATIENT_NAME_2], etc.)
Dates must be handled per HIPAA rule: remove month/day, keep year UNLESS age > 89 (in which case replace with "90+")
ZIP codes: keep first 3 digits only if that 3-digit area has > 20,000 people, else replace with 000 — see references/phi_categories.md for the rule
Audit report must list every detection with: category, confidence, line/column reference, action taken, and a non-reversible token (salted hash) referencing the value — original PHI must never be persisted in audit logs (see references/audit_schema.md)
Never log or echo original PHI values to stdout in production mode
If confidence < 0.6 on any detection, flag it in audit as "needs manual review"

Reference files

references/phi_categories.md — Full regex pattern library + NER label mappings
references/safe_harbor_rules.md — Exact HIPAA 45 CFR §164.514 rules with implementation guidance for edge cases (ZIP, dates, ages, re-ID codes)
references/gdpr_extensions.md — Additional rules for GDPR Art. 4 / UK GDPR when European patient data is involved
references/audit_schema.md — JSON schema for the audit report output

Related Assets

Optum Harmony Healthcare Demo App

experimental

Create a Harmony-based example healthcare application that showcases eligibility, claims, and remittance concepts using current Harmony skills, instructions, navigation, forms, and components.

Owner: harmony-platform

AIRB Submission Prep (Optum)

experimental

Prepare a complete AIRB submission package and checklist for a UAIS/LLM project following RAI Development Guide v3.0 requirements.

Owner: epic-platform-sre

UHG/Optum GitHub Actions Compliance Policy

active

Corporate policy for allowed GitHub Actions sources in workflows

Owner: thudak

Optum Responsible AI (RAI) compliance

experimental

Responsible AI compliance requirements for Optum AI/ML development, covering AIRB submission, shadow mode pilots, RAI risk tiers, and governance processes.

Owner: epic-platform-sre