Skip to content

Python tool for automated CSV data cleaning and standardization

License

Notifications You must be signed in to change notification settings

Morvyr/CSV-data-cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSV Data Cleaner - Automated Data Processing Tool

A Python script that automatically cleans and standardizes messy CSV data, handling common data quality issues like duplicates, inconsistent formatting, and missing values.

Features

  • Removes duplicate rows - Identifies and eliminates exact duplicates
  • Standardizes text formatting - Converts column values to appropriate casing
  • Standardizes date formats - Handles multiple date formats (YYYY-MM-DD, MM/DD/YYYY, YYYY/MM/DD) and converts all to YYYY-MM-DD
  • Handles missing values - Intelligently fills missing data with appropriate defaults
  • Removes currency symbols - Strips $ and € from price fields and converts to numeric
  • Generates summary report - Provides detailed statistics about the cleaning process
  • Flexible input options - Accepts command-line arguments or interactive prompts

Installation

Requires Python 3.7+ and pandas:

pip install pandas --break-system-packages

Usage

Command-Line Mode (Recommended)

python data_cleaner.py input_file.csv output_file.csv

Example:

python data_cleaner.py messy_sales_data.csv cleaned_sales_data.csv

Interactive Mode

python data_cleaner.py

The script will prompt you for the input filename.

What It Cleans

Text Standardization

  • Customer names: john smithJohn Smith
  • Products: LAPTOPLaptop
  • Emails: USER@EMAIL.COMuser@email.com
  • Status: completedCompleted

Date Standardization

  • 01/16/20242024-01-16
  • 2024/01/182024-01-18
  • Handles multiple formats automatically

Missing Values

  • Missing products → Unknown Product
  • Missing prices → 0.00
  • Missing emails → no-email@provided.com
  • Missing customer names → Unknown Customer
  • Missing status → Pending Review
  • Missing quantities → 1
  • Rows with missing Order IDs are removed (invalid data)

Price Cleaning

  • Removes currency symbols: $25.5025.50
  • Converts to numeric format
  • Handles European formats with

Example

Before (messy_sales_data.csv):

Order_ID,Customer_Name,Product,Quantity,Price,Order_Date,Status,Email
1001,john smith,Laptop,1,899.99,2024-01-15,Completed,john@email.com
1002,SARAH JONES,Mouse,2,25.50,01/16/2024,completed,SARAH@EMAIL.COM
1007,David Lee,,1,149.99,2024-01-21,Completed,david@email.com

After (cleaned_sales_data.csv):

Order_ID,Customer_Name,Product,Quantity,Price,Order_Date,Status,Email
1001,John Smith,Laptop,1,899.99,2024-01-15,Completed,john@email.com
1002,Sarah Jones,Mouse,2,25.5,2024-01-16,Completed,sarah@email.com
1007,David Lee,Unknown Product,1,149.99,2024-01-21,Completed,david@email.com

Output

The script generates:

  1. Cleaned CSV file - Your data with all issues resolved
  2. Console summary - Statistics about what was cleaned:
    • Number of rows processed
    • Duplicates removed
    • Missing values handled
    • Sample of cleaned data

Technical Details

Built with:

  • Python 3.x
  • pandas for data manipulation
  • datetime for date parsing
  • Built-in error handling for robustness

Key capabilities:

  • Handles multiple date format variations
  • Graceful error handling for malformed data
  • Preserves data types (integers, floats, strings)
  • Zero missing values in output
  • Fast processing (handles thousands of rows efficiently)

Use Cases

Perfect for:

  • Sales data cleanup
  • E-commerce order processing
  • Customer database standardization
  • Data migration preparation
  • Report generation preprocessing
  • Any CSV data that needs consistent formatting

License

MIT License - Free to use, modify, and distribute.

Author

Built by Morvyr W. as part of a Python automation portfolio.

Connect: X: @MorvyrWinds GitHub: Morvyr


Portfolio project demonstrating Python data processing, pandas manipulation, and automated data quality improvement.

About

Python tool for automated CSV data cleaning and standardization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages