Sensitive Data Discovery with American Express

PII Redaction Text Classifier

This project builds a machine learning model to detect and redact sensitive personal information (PII) from internal text data, enhancing customer privacy and generating compliant test datasets.

📌 Key Features

PII Detection and Redaction: Automatically detects and redacts sensitive data types, including names, phone numbers, and financial information.
Compliance-Friendly Dataset Generation: Ensures that internal datasets meet privacy and regulatory requirements by masking PII fields.
Scalable and Extensible: Modular code allows for scaling and adapting to additional PII types or sources.

🔧 Techniques Used

Preprocessing and Vectorization: Utilizes scikit-learn for efficient data processing and transformation.
Deep Learning with DeBERTa: Employs DeBERTa for high-accuracy PII entity recognition and context-aware redaction.
Regex-Based Masking: Complements DeBERTa with Regex for faster, pattern-based PII detection, enhancing model accuracy for well-defined data patterns.

📚 Dataset

PII Masking Dataset (200k): A dataset with labeled PII used to train and validate the redaction model.

👣 Potential Next Steps

Dataset: Add more attributes to improve data identification, find a more comprehensive/representative dataset
Models: Reduce redundant labels in output, make ensemble model to better generalize each attribute

Contributors:

Ayan Gaur
Joy Chang
Shrieyaa Sekar Jayanthi
Shubhangi Waldiya
Stella Huang
Break Through Tech AI at UCLA & American Express

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.venv		.venv
data		data
model		model
regex		regex
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sensitive Data Discovery with American Express

📌 Key Features

🔧 Techniques Used

📚 Dataset

👣 Potential Next Steps

Contributors:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Amex2B/AmexSensitiveData

Folders and files

Latest commit

History

Repository files navigation

Sensitive Data Discovery with American Express

📌 Key Features

🔧 Techniques Used

📚 Dataset

👣 Potential Next Steps

Contributors:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages