Skip to content

phi-shield

Detect, mask, redact, or de-identify Protected Health Information (PHI) and Personally Identifiable Information (PII) from any file or text, in compliance with HIPAA Safe Harbor (45 CFR §164.514). Use this skill whenever the user wants to: redact PHI or PII from documents, de-identify patient data, anonymize health records, mask sensitive fields before sharing data, check whether a file contains PHI, scrub clinical notes or EHR exports, prepare a dataset for research or analytics, comply with HIPAA de-identification requirements, or sanitize CSV/Excel/text/PDF/DOCX files of patient identifiers. Triggers on: PHI, PII, HIPAA, de-identify, anonymize, redact, mask, scrub, sanitize, patient data, health records, clinical notes, EHR, medical records, safe harbor, 18 identifiers, protected health information, personally identifiable.

active
IDE:
codex
Version:
1.0.0
Owner:jnishan5
hipaa
phi
pii
healthcare
de-identification
redaction
compliance
safe-harbor
privacy

PHI Shield — HIPAA-Compliant De-identification Skill

Detect, mask, and redact PHI/PII from structured data (CSV, Excel), unstructured text (clinical notes, emails, reports), and documents (DOCX, PDF), using a two-layer approach: regex-based pattern matching for structured identifiers + NLP-based NER for names and contextual entities.

Legal disclaimer: This skill implements HIPAA Safe Harbor de-identification (45 CFR §164.514(b)) as a technical control. It is NOT a substitute for legal counsel or a formal Expert Determination assessment. Always have qualified personnel review outputs before sharing or publishing de-identified data.


Mode selection

Ask the user which mode they need if not already specified:

ModeWhat it doesUse when
detectScan and report what PHI/PII is found — no changesAuditing a file
maskReplace PHI with type labels: [PATIENT_NAME], [SSN]Readable output needed
redactReplace PHI with █████ or [REDACTED]Strongest privacy
pseudonymizeReplace with consistent fake valuesDownstream analytics need structure
safe-harborFull HIPAA Safe Harbor — remove all 18 identifiersResearch/sharing compliance

Default to mask if unspecified.


Step-by-step workflow

Step 1: Identify input type

file /mnt/user-data/uploads/<filename>
stat -c '%s bytes' /mnt/user-data/uploads/<filename>

Route by extension:

  • .csv / .tsv → structured pipeline (redact_structured.py)
  • .xlsx / .xls → structured pipeline (redact_structured.py)
  • .txt / .md / .log → unstructured pipeline (redact_text.py)
  • .docx → unstructured pipeline (redact_text.py with docx support)
  • .pdf → extract text first, then unstructured pipeline
  • Raw text pasted in chat → run inline detection (redact_text.py with stdin)

Step 2: Install dependencies

pip install pandas openpyxl python-docx pdfminer.six \
    presidio-analyzer presidio-anonymizer spacy \
    --break-system-packages -q

python -m spacy download en_core_web_lg --quiet 2>/dev/null || \
python -m spacy download en_core_web_sm --quiet

Step 3: Run the appropriate script

Structured data (CSV/Excel):

python /path/to/phi-shield/scripts/redact_structured.py \
    "<input_path>" \
    "<output_path>" \
    --mode mask \
    --audit /tmp/phi_audit.json

Unstructured text/DOCX/PDF:

python /path/to/phi-shield/scripts/redact_text.py \
    "<input_path>" \
    "<output_path>" \
    --mode mask \
    --audit /tmp/phi_audit.json

Inline text (pasted in chat): Write the text to /tmp/input.txt first, then run redact_text.py on it.

Step 4: Read and present the audit report

Read /tmp/phi_audit.json after the script completes. Always show the user:

  • Total PHI instances found (by category)
  • Which columns/sections were affected
  • Confidence breakdown (high / medium / low)
  • Any items flagged for manual review

Step 5: Save and present output

cp <output_path> /mnt/user-data/outputs/<original_name>_deidentified.<ext>

Call present_files on the output and the audit JSON.


PHI categories detected

See references/phi_categories.md for the full pattern library and NER labels. The 18 HIPAA Safe Harbor identifiers covered:

  1. Names (patient, relative, employer) — NER + patterns
  2. Geographic subdivisions < state (address, city, county, ZIP) — patterns + NER
  3. Dates (except year): birth, admission, discharge, death — patterns
  4. Phone numbers — patterns
  5. Fax numbers — patterns
  6. Email addresses — patterns
  7. Social Security numbers — patterns
  8. Medical record numbers — patterns
  9. Health plan beneficiary numbers — patterns
  10. Account numbers — patterns
  11. Certificate / license numbers — patterns
  12. Vehicle identifiers and license plates — patterns
  13. Device identifiers and serial numbers — patterns
  14. URLs — patterns
  15. IP addresses — patterns
  16. Biometric identifiers (finger/voice prints) — keyword detection
  17. Full-face photos and comparable images — flagged (cannot auto-redact image content)
  18. Any other unique identifying code — heuristic + patterns

Additional PII (non-HIPAA but commonly needed):

  • Passport numbers
  • Credit card numbers (PAN)
  • Bank account / routing numbers
  • National ID numbers (non-US)
  • Gender / race / ethnicity (quasi-identifier, flagged with low confidence)
  • Employer names (quasi-identifier)

Output quality standards

  • Masks must be consistent within a document: the same name always maps to the same token ([PATIENT_NAME_1], [PATIENT_NAME_2], etc.)
  • Dates must be handled per HIPAA rule: remove month/day, keep year UNLESS age > 89 (in which case replace with "90+")
  • ZIP codes: keep first 3 digits only if that 3-digit area has > 20,000 people, else replace with 000 — see references/phi_categories.md for the rule
  • Audit report must list every detection with: category, confidence, line/column reference, action taken, and a non-reversible token (salted hash) referencing the value — original PHI must never be persisted in audit logs (see references/audit_schema.md)
  • Never log or echo original PHI values to stdout in production mode
  • If confidence < 0.6 on any detection, flag it in audit as "needs manual review"

Reference files

  • references/phi_categories.md — Full regex pattern library + NER label mappings
  • references/safe_harbor_rules.md — Exact HIPAA 45 CFR §164.514 rules with implementation guidance for edge cases (ZIP, dates, ages, re-ID codes)
  • references/gdpr_extensions.md — Additional rules for GDPR Art. 4 / UK GDPR when European patient data is involved
  • references/audit_schema.md — JSON schema for the audit report output

Related Assets