Document DataCleaner Tool

A local document processing tool that cleans, normalizes, classifies, and combines CSV/XLSX files all locally.

Features

Multiple File Format Support: Process CSV and XLSX files
Local Processing: All processing happens on your local machine
Configurable Rules: Adjust column mappings, keep specific columns, and set classification rules via configuration
Data Cleaning and Transformation:
- Normalize inconsistent headers (id, record Number, ref_num → record_id)
- Standardize date formats (YYYY-MM-DD)
- Remove duplicates and empty rows
- Combine multiple input files into one clean dataset
- Classify records into Residential or Commercial categories based on description

Installation

Automated Installation (Windows)

Navigate to the project directory
Double-click on install_datacleaner.bat to run the installer

The script will:

Check for Python 3.11.0 and install it if needed
Upgrade pip, setuptools, and wheel
Install the datacleaner package and its dependencies

Automated Installation (Linux)

Open a terminal and navigate to the project directory
Run the installation script:

chmod +x install_datacleaner.sh
./install_datacleaner.sh

The script will:
- Check for Python 3.11.0 , If you DON'T have Python installed you need to do it manually! (because of system distribution differences)
- Upgrade pip, setuptools, and wheel
- Install the datacleaner package and its dependencies

Usage

This will create a processed version at path/to/your/file_processed.xlsx:
- datacleaner path/to/your/file.csv

Advanced Options

Specify custom output file:
- datacleaner input.xlsx -o output.xlsx
Process multiple files and combine them into a single dataset:
- datacleaner -i file1.csv file2.xlsx -o combined.xlsx
Use a custom configuration file:
- datacleaner file.csv -c custom_config.yaml
Process multiple files and combine them into a single dataset with a custom config:
- datacleaner -i file1.csv file2.xlsx -o combined.xlsx -c custom_config.yaml
List the default configuration (header mappings, columns, classification rules):
- datacleaner --list-default-config

Configuration

You can use a custom configuration file (.yaml) to modify how data is cleaned:

Header Mappings: Maps inconsistent headers to standardized ones ("record number" -> record_id)
Keep Columns: Keeps only the specified list of important columns
Classification: Defines keywords for residential and commercial categories

Example command to use a custom config:

datacleaner file.csv -c my_config.yaml

Output

For each processed file, the tool generates:

A cleaned .xlsx version of the input file (with _processed suffix by default)
If multiple files are provided, a combined and deduplicated .xlsx file

Supported File Types

CSV
XLSX

Requirements

Python 3.11.0
pandas
PyYAML

Troubleshooting

If you encounter issues:

Ensure you are using Python 3.11.0
Verify that your input file is either .csv or .xlsx
Use the --list-default-config option to inspect the default configuration
If combining files, make sure they have compatible structures with the anyconfig.yaml file

Development

The project uses setuptools for packaging. The main entry point is in datacleaner/main.py.

To modify and reinstall:

Make your changes
Run
- install_datacleaner.(bat|sh)
- or
- pip install -e .

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
datacleaner		datacleaner
.gitignore		.gitignore
README.md		README.md
install_datacleaner.bat		install_datacleaner.bat
install_datacleaner.sh		install_datacleaner.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document DataCleaner Tool

Table of Contents

Features

Installation

Automated Installation (Windows)

Automated Installation (Linux)

Usage

Advanced Options

Configuration

Output

Supported File Types

Requirements

Troubleshooting

Development

About

Uh oh!

Releases

Packages

Languages

todor02/DataCleaner

Folders and files

Latest commit

History

Repository files navigation

Document DataCleaner Tool

Table of Contents

Features

Installation

Automated Installation (Windows)

Automated Installation (Linux)

Usage

Advanced Options

Configuration

Output

Supported File Types

Requirements

Troubleshooting

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages