Skip to content

Implemented a pretrained model, trained deBERTa-base, distilBERT on corrected and cleaned dataset in order to identify and redact personally identifiable information (PII) and contextual data from text

Notifications You must be signed in to change notification settings

Amex2B/AmexSensitiveData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

98 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sensitive Data Discovery with American Express

PII Redaction Text Classifier

This project builds a machine learning model to detect and redact sensitive personal information (PII) from internal text data, enhancing customer privacy and generating compliant test datasets.

πŸ“Œ Key Features

  • PII Detection and Redaction: Automatically detects and redacts sensitive data types, including names, phone numbers, and financial information.
  • Compliance-Friendly Dataset Generation: Ensures that internal datasets meet privacy and regulatory requirements by masking PII fields.
  • Scalable and Extensible: Modular code allows for scaling and adapting to additional PII types or sources.

πŸ”§ Techniques Used

  • Preprocessing and Vectorization: Utilizes scikit-learn for efficient data processing and transformation.
  • Deep Learning with DeBERTa: Employs DeBERTa for high-accuracy PII entity recognition and context-aware redaction.
  • Regex-Based Masking: Complements DeBERTa with Regex for faster, pattern-based PII detection, enhancing model accuracy for well-defined data patterns.

πŸ“š Dataset

PII Masking Dataset (200k): A dataset with labeled PII used to train and validate the redaction model.

πŸ‘£ Potential Next Steps

  • Dataset: Add more attributes to improve data identification, find a more comprehensive/representative dataset
  • Models: Reduce redundant labels in output, make ensemble model to better generalize each attribute

Contributors:

  • Ayan Gaur
  • Joy Chang
  • Shrieyaa Sekar Jayanthi
  • Shubhangi Waldiya
  • Stella Huang
  • Break Through Tech AI at UCLA & American Express

About

Implemented a pretrained model, trained deBERTa-base, distilBERT on corrected and cleaned dataset in order to identify and redact personally identifiable information (PII) and contextual data from text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5