Skip to content

todor02/DataCleaner

Repository files navigation

Document DataCleaner Tool

A local document processing tool that cleans, normalizes, classifies, and combines CSV/XLSX files all locally.


Table of Contents


Features

  • Multiple File Format Support: Process CSV and XLSX files
  • Local Processing: All processing happens on your local machine
  • Configurable Rules: Adjust column mappings, keep specific columns, and set classification rules via configuration
  • Data Cleaning and Transformation:
    • Normalize inconsistent headers (id, record Number, ref_numrecord_id)
    • Standardize date formats (YYYY-MM-DD)
    • Remove duplicates and empty rows
    • Combine multiple input files into one clean dataset
    • Classify records into Residential or Commercial categories based on description

Installation

Automated Installation (Windows)

  1. Navigate to the project directory
  2. Double-click on install_datacleaner.bat to run the installer

The script will:

  • Check for Python 3.11.0 and install it if needed
  • Upgrade pip, setuptools, and wheel
  • Install the datacleaner package and its dependencies

Automated Installation (Linux)

  1. Open a terminal and navigate to the project directory
  2. Run the installation script:
chmod +x install_datacleaner.sh
./install_datacleaner.sh
  • The script will:

    • Check for Python 3.11.0 , If you DON'T have Python installed you need to do it manually! (because of system distribution differences)

    • Upgrade pip, setuptools, and wheel

    • Install the datacleaner package and its dependencies


Usage

  • This will create a processed version at path/to/your/file_processed.xlsx:

    • datacleaner path/to/your/file.csv

Advanced Options

  • Specify custom output file:

    • datacleaner input.xlsx -o output.xlsx
  • Process multiple files and combine them into a single dataset:

    • datacleaner -i file1.csv file2.xlsx -o combined.xlsx
  • Use a custom configuration file:

    • datacleaner file.csv -c custom_config.yaml
  • Process multiple files and combine them into a single dataset with a custom config:

    • datacleaner -i file1.csv file2.xlsx -o combined.xlsx -c custom_config.yaml
  • List the default configuration (header mappings, columns, classification rules):

    • datacleaner --list-default-config

Configuration

You can use a custom configuration file (.yaml) to modify how data is cleaned:

  • Header Mappings: Maps inconsistent headers to standardized ones ("record number" -> record_id)

  • Keep Columns: Keeps only the specified list of important columns

  • Classification: Defines keywords for residential and commercial categories

Example command to use a custom config:

  • datacleaner file.csv -c my_config.yaml

Output

For each processed file, the tool generates:

  • A cleaned .xlsx version of the input file (with _processed suffix by default)

  • If multiple files are provided, a combined and deduplicated .xlsx file


Supported File Types

  • CSV

  • XLSX


Requirements

  • Python 3.11.0

  • pandas

  • PyYAML


Troubleshooting

If you encounter issues:

  • Ensure you are using Python 3.11.0

  • Verify that your input file is either .csv or .xlsx

  • Use the --list-default-config option to inspect the default configuration

  • If combining files, make sure they have compatible structures with the anyconfig.yaml file


Development

The project uses setuptools for packaging. The main entry point is in datacleaner/main.py.

To modify and reinstall:

  1. Make your changes

  2. Run

    • install_datacleaner.(bat|sh)
    • or
    • pip install -e .

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published