PII Redaction Text Classifier
This project builds a machine learning model to detect and redact sensitive personal information (PII) from internal text data, enhancing customer privacy and generating compliant test datasets.
- PII Detection and Redaction: Automatically detects and redacts sensitive data types, including names, phone numbers, and financial information.
- Compliance-Friendly Dataset Generation: Ensures that internal datasets meet privacy and regulatory requirements by masking PII fields.
- Scalable and Extensible: Modular code allows for scaling and adapting to additional PII types or sources.
- Preprocessing and Vectorization: Utilizes scikit-learn for efficient data processing and transformation.
- Deep Learning with DeBERTa: Employs DeBERTa for high-accuracy PII entity recognition and context-aware redaction.
- Regex-Based Masking: Complements DeBERTa with Regex for faster, pattern-based PII detection, enhancing model accuracy for well-defined data patterns.
PII Masking Dataset (200k): A dataset with labeled PII used to train and validate the redaction model.
- Dataset: Add more attributes to improve data identification, find a more comprehensive/representative dataset
- Models: Reduce redundant labels in output, make ensemble model to better generalize each attribute
- Ayan Gaur
- Joy Chang
- Shrieyaa Sekar Jayanthi
- Shubhangi Waldiya
- Stella Huang
- Break Through Tech AI at UCLA & American Express