Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 70 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,92 @@
![Build Status](https://github.com/spacexnu/ShadowData/actions/workflows/main.yml/badge.svg)

# ShadowData
A Python library for anonymizing, masking, and encrypting sensitive data with a small, focused API.

# ShadowData - A sensitive data handler python library
ShadowData is a Python library designed to simplify the processing and handling of sensitive data securely and efficiently.
## What it does today
- Text and pattern anonymization (free-form text replacement, IPv4, email, phone)
- Localized identifiers (US SSN, Brazil CPF/CNPJ)
- Symmetric encryption and decryption (Fernet)
- PII detection via spaCy (optional extra)

## Features (The project is under development)
Planned: richer masking helpers and reversible transforms.

- Data anonymization. (Work in progress)
- PII - Personal Identified Information detection using Natual Language Processing (Work in progress)
- Encryption and decryption of sensitive data. (Work in progress)
- Data masking for privacy-preserving data handling. (Work in progress)
- Compliance with GDPR, LGPD and other data protection regulations. (Work in progress)

## Installation Instructions
## Installation

```bash
pip install shadow_data
```
* Installs only the core library, without the spaCy dependency.

```bash
pip install shadowdata[spacy]
Optional spaCy support:

```bash
pip install shadow_data[spacy]
```
* Installs spaCy automatically, based on your platform.

By default, ShadowData will automatically download the necessary language model if it’s not already installed. However, if you’d prefer to install it manually, use the following command as example:
spaCy models are downloaded automatically at runtime when needed. To install manually:

```bash
python -m spacy download en_core_web_trf
```
Make sure to run this command within your project’s virtual environment.

[Check spaCy's documentation to know more about the Language Models.](https://spacy.io/models)
## Quickstart

## Usage
There are some usage examples at the [examples](examples) directory
```python
from shadow_data.anonymization import (
EmailAnonymization,
Ipv4Anonymization,
PhoneNumberAnonymization,
TextProcessor,
)
from shadow_data.cryptohash.symmetric_cipher import Symmetric
from shadow_data.l10n.usa import IdentifierAnonymizer

## Contributing
text = "Contact me at user@example.com or 415-555-0199. Server: 10.0.0.1"
anonymized_text = Ipv4Anonymization.anonymize_ipv4(text)
anonymized_text = TextProcessor.replace_text("Contact", "Reach", anonymized_text)
email = EmailAnonymization.anonymize_email("user@example.com")
phone = PhoneNumberAnonymization.anonymize_phone_number("415-555-0199")
print(anonymized_text, email, phone)

ssn = "Billy's SSN is 479-92-5042."
ssn_anonymizer = IdentifierAnonymizer(ssn)
ssn_anonymizer.anonymize()
print(ssn_anonymizer.cleaned_content)

symmetric = Symmetric()
key = symmetric.create_key()
ciphertext = symmetric.encrypt("hello")
plaintext = symmetric.decrypt(ciphertext)
print(key, ciphertext, plaintext)
```

Contributions are welcome! Please follow the guidelines below to contribute to the project.
## Docs
- `docs/README.md`
- `docs/usage.md`
- `docs/cryptography.md`
- `docs/pii.md`

## Examples
- `examples/quickstart.py`
- `examples/anonymization.md`
- `examples/i10n_us.md`
- `examples/i10n_brazil.md`
- `examples/pii_nlp.md`
- `examples/symmetric_cipher.md`

## Testing

```bash
poetry run pytest -vvv
```

## Contributing

1. Fork the repository.
2. Create a new branch for your feature (git checkout -b my-new-feature).
3. Commit your changes (git commit -am 'Add new feature').
4. Push the branch (git push origin my-new-feature).
5. Open a pull request.
1. Fork the repository.
2. Create a new branch for your feature (`git checkout -b my-new-feature`).
3. Commit your changes (`git commit -am 'Add new feature'`).
4. Push the branch (`git push origin my-new-feature`).
5. Open a pull request.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
This project is licensed under the MIT License - see `LICENSE` for details.
13 changes: 13 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# ShadowData Docs

This folder provides focused guides for the current feature set.

## Contents
- `docs/usage.md`: core anonymization helpers and localized identifiers
- `docs/cryptography.md`: symmetric encryption and key handling
- `docs/pii.md`: spaCy-based PII detection

## Quick pointers
- PII detection is optional and requires the `shadow_data[spacy]` extra.
- spaCy models download at runtime when first used.
- Masking and reversible transforms are planned but not yet implemented.
39 changes: 39 additions & 0 deletions docs/cryptography.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Cryptography

ShadowData provides symmetric encryption using Fernet from the `cryptography` package.

## Generate and use a key

```python
from shadow_data.cryptohash.symmetric_cipher import Symmetric

symmetric = Symmetric()
key = symmetric.create_key()

ciphertext = symmetric.encrypt("Hello World!")
plaintext = symmetric.decrypt(ciphertext)

print(key)
print(ciphertext)
print(plaintext)
```

## Use an existing key

```python
from shadow_data.cryptohash.symmetric_cipher import Symmetric

key = b"bpSGcODTJ1iOwxloIQJrAiYDRaqyypdCsQfg1EwVOTc="

symmetric = Symmetric(cipher_key=key)

ciphertext = symmetric.encrypt("Hello World")
plaintext = symmetric.decrypt(ciphertext)

print(ciphertext)
print(plaintext)
```

## Error handling
- `CipherKeyNotFoundError`: raised when encrypting or decrypting without a key.
- `InvalidCipherKeyError`: raised when setting an invalid Fernet key.
31 changes: 31 additions & 0 deletions docs/pii.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# PII Detection (spaCy)

PII detection is powered by spaCy and is an optional dependency.

## Install

```bash
pip install shadow_data[spacy]
```

## Use a model

```python
from shadow_data.pii.enums import ModelLang, ModelCore, ModelSize
from shadow_data.pii.spacy import SensitiveData

content = "Alice Johnson works at Example Corp in Seattle."
instance = SensitiveData()
entities = instance.identify_sensitive_data(
ModelLang.ENGLISH,
ModelCore.WEB,
ModelSize.SMALL,
content,
)
print(entities)
```

## Notes
- The model name is assembled as `{lang}_{core}_{size}` (for example, `en_core_web_sm`).
- Models are downloaded automatically on first use if missing.
- Returned entities are filtered to these labels: `PER`, `LOC`, `ORG`, `MISC`.
70 changes: 70 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Usage

This guide covers the anonymization helpers and localized identifiers.

## Text replacement
`TextProcessor.replace_text` uses regular expressions for matching and replacement.

```python
from shadow_data.anonymization import TextProcessor

content = "The user's name is Alice Jones."
updated = TextProcessor.replace_text("Alice Jones", "ANONYMOUS", content)
print(updated)
```

## IPv4 anonymization
`Ipv4Anonymization.anonymize_ipv4` masks the final three octets with `X` and works on full text.

```python
from shadow_data.anonymization import Ipv4Anonymization

text = "Primary IP: 192.168.1.100"
print(Ipv4Anonymization.anonymize_ipv4(text))
```

## Email anonymization
`EmailAnonymization.anonymize_email` validates email format and replaces the user part with `*`, while keeping the last 3 characters of the first domain label.

```python
from shadow_data.anonymization import EmailAnonymization

print(EmailAnonymization.anonymize_email("user@example.com"))
```

## Phone number anonymization
`PhoneNumberAnonymization.anonymize_phone_number` preserves the last 4 digits and keeps the original formatting.

```python
from shadow_data.anonymization import PhoneNumberAnonymization

print(PhoneNumberAnonymization.anonymize_phone_number("+1 (415) 555-0199"))
```

## Localized identifiers
### US SSN

```python
from shadow_data.l10n.usa import IdentifierAnonymizer

text = "SSN: 479-92-5042"
anonymizer = IdentifierAnonymizer(text)
anonymizer.anonymize()
print(anonymizer.cleaned_content)
```

### Brazil CPF/CNPJ

```python
from shadow_data.l10n.brazil import IdentifierAnonymizer

cpf = "806.846.761-09"
cpf_anonymizer = IdentifierAnonymizer(cpf)
cpf_anonymizer.anonymize()
print(cpf_anonymizer.cleaned_content)

cnpj = "26.283.050/0001-17"
cnpj_anonymizer = IdentifierAnonymizer(cnpj)
cnpj_anonymizer.anonymize()
print(cnpj_anonymizer.cleaned_content)
```
40 changes: 40 additions & 0 deletions examples/quickstart.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from shadow_data.anonymization import (
EmailAnonymization,
Ipv4Anonymization,
PhoneNumberAnonymization,
TextProcessor,
)
from shadow_data.cryptohash.symmetric_cipher import Symmetric
from shadow_data.l10n.brazil import IdentifierAnonymizer as BrazilIdentifierAnonymizer
from shadow_data.l10n.usa import IdentifierAnonymizer as UsaIdentifierAnonymizer

text = "Contact me at user@example.com or 415-555-0199. Server: 10.0.0.1"

anonymized_text = Ipv4Anonymization.anonymize_ipv4(text)
anonymized_text = TextProcessor.replace_text("Contact", "Reach", anonymized_text)
email = EmailAnonymization.anonymize_email("user@example.com")
phone = PhoneNumberAnonymization.anonymize_phone_number("415-555-0199")

print(anonymized_text)
print(email)
print(phone)

ssn_text = "Billy's SSN is 479-92-5042."
ssn_anonymizer = UsaIdentifierAnonymizer(ssn_text)
ssn_anonymizer.anonymize()
print(ssn_anonymizer.cleaned_content)

cpf = "806.846.761-09"
cpf_anonymizer = BrazilIdentifierAnonymizer(cpf)
cpf_anonymizer.anonymize()
print(cpf_anonymizer.cleaned_content)

symmetric = Symmetric()
key = symmetric.create_key()

ciphertext = symmetric.encrypt("hello")
plaintext = symmetric.decrypt(ciphertext)

print(key)
print(ciphertext)
print(plaintext)
Loading
Loading