Skip to content

todor02/Anonymizer

Repository files navigation

Document Anonymizer Tool

A local document processing tool that anonymizes personal data in text, DOCX, and PDF files all locally.


Features

  • Multiple File Format Support: Process TXT, DOCX, and PDF files
  • Local Processing: All processing happens on your local machine
  • Detailed Logging: Generates JSON logs of all anonymization actions
  • Customizable Rules: Enable/disable specific detection rules via configuration
  • Comprehensive Data Detection:
    • Credit card numbers
    • Phone numbers (US format)
    • Email addresses
    • Social Security Numbers
    • Salary amounts
    • Street addresses
    • ZIP codes
    • IP addresses
    • Person names (context-aware heuristic detection)

Installation

Automated Installation (Windows)

  1. Navigate to the project directory
  2. Double-click on install_anonymizer.bat to run the installer

The script will:

  • Check for Python 3.11.0 and install it if needed
  • Upgrade pip, setuptools, and wheel to required versions
  • Install the anonymizer package and its dependencies

Automated Installation (Linux)

  1. Open a terminal and navigate to the project directory
  2. Run the installation script:
chmod +x install_anonymizer.sh
./install_anonymizer.sh

The script will:

  • Check for Python 3.11.0 , If you DON'T have Python installed you need to do it manually! (because of system distribution differences)
  • Upgrade pip, setuptools, and wheel to required versions
  • Install the anonymizer package and its dependencies

Usage

  • This will create an anonymized version at path/to/your/file_anonymized.docx:
    • anonymizer path/to/your/file.docx

Advanced Options

  • Specify custom output file:

    • anonymizer input.pdf -o output.pdf
  • Use a custom configuration file:

    • anonymizer document.txt -c custom_config.json
  • List all available anonymization rules:

    • anonymizer --list-rules
  • Create a default configuration file to customize:

    • anonymizer --create-config name.json
  • Enable verbose logging(future-proofing):

    • anonymizer file.docx -v

Configuration

  • You can create a custom configuration file to enable/disable specific anonymization rules:

  • Create a default config file

    • anonymizer --create-config my_config.json

Output

For each processed file, the tool generates:

  • An anonymized version of the original file (with _anonymized suffix by default)
  • A detailed JSON log file showing what was anonymized and where

Supported File Types

  • TXT: Plain text files
  • DOCX: Microsoft Word documents
  • PDF: Portable Document Format files

Limitations

  • This is an MVP (Minimum Viable Product) and may have limitations with complex document layouts
  • PDF processing preserves text content but may not maintain exact visual formatting
  • Name detection uses heuristics and may have false positives/negatives
  • Currently optimized for US format phone numbers and addresses

Requirements

  • Python 3.11.0
  • 4GB+ RAM recommended for processing large PDF files

Troubleshooting

If you encounter issues:

  • Ensure you have a stable internet connection for the initial installation
  • Check that your system meets the Python version requirement (3.11.0)
  • For PDF processing issues, ensure you have adequate system memory
  • Use the -v flag for verbose logging to identify problems

Development

The project uses setuptools for packaging. The main entry point is in anonymizer/main.py.

To modify and reinstall:

  1. Make your changes
  2. Run
    • install_anonymizer.(bat|sh)
    • or
    • pip install -e .

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published