A local document processing tool that cleans, normalizes, classifies, and combines CSV/XLSX files all locally.
- Features
- Installation
- Usage
- Configuration
- Output
- Supported File Types
- Limitations
- Requirements
- Troubleshooting
- Development
- Multiple File Format Support: Process CSV and XLSX files
- Local Processing: All processing happens on your local machine
- Configurable Rules: Adjust column mappings, keep specific columns, and set classification rules via configuration
- Data Cleaning and Transformation:
- Normalize inconsistent headers (
id,record Number,ref_num→record_id) - Standardize date formats (
YYYY-MM-DD) - Remove duplicates and empty rows
- Combine multiple input files into one clean dataset
- Classify records into Residential or Commercial categories based on description
- Normalize inconsistent headers (
- Navigate to the project directory
- Double-click on
install_datacleaner.batto run the installer
The script will:
- Check for Python 3.11.0 and install it if needed
- Upgrade
pip,setuptools, andwheel - Install the datacleaner package and its dependencies
- Open a terminal and navigate to the project directory
- Run the installation script:
chmod +x install_datacleaner.sh
./install_datacleaner.sh-
The script will:
-
Check for Python 3.11.0 , If you DON'T have Python installed you need to do it manually! (because of system distribution differences)
-
Upgrade pip, setuptools, and wheel
-
Install the datacleaner package and its dependencies
-
-
This will create a processed version at path/to/your/file_processed.xlsx:
datacleaner path/to/your/file.csv
-
Specify custom output file:
datacleaner input.xlsx -o output.xlsx
-
Process multiple files and combine them into a single dataset:
datacleaner -i file1.csv file2.xlsx -o combined.xlsx
-
Use a custom configuration file:
datacleaner file.csv -c custom_config.yaml
-
Process multiple files and combine them into a single dataset with a custom config:
datacleaner -i file1.csv file2.xlsx -o combined.xlsx -c custom_config.yaml
-
List the default configuration (header mappings, columns, classification rules):
datacleaner --list-default-config
You can use a custom configuration file (.yaml) to modify how data is cleaned:
-
Header Mappings: Maps inconsistent headers to standardized ones ("record number" -> record_id)
-
Keep Columns: Keeps only the specified list of important columns
-
Classification: Defines keywords for residential and commercial categories
Example command to use a custom config:
datacleaner file.csv -c my_config.yaml
For each processed file, the tool generates:
-
A cleaned .xlsx version of the input file (with _processed suffix by default)
-
If multiple files are provided, a combined and deduplicated .xlsx file
-
CSV
-
XLSX
-
Python 3.11.0
-
pandas
-
PyYAML
If you encounter issues:
-
Ensure you are using Python 3.11.0
-
Verify that your input file is either .csv or .xlsx
-
Use the --list-default-config option to inspect the default configuration
-
If combining files, make sure they have compatible structures with the anyconfig.yaml file
The project uses setuptools for packaging. The main entry point is in datacleaner/main.py.
To modify and reinstall:
-
Make your changes
-
Run
install_datacleaner.(bat|sh)- or
pip install -e .