Skip to content

roALAB1/data-normalization-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Data Normalization Platform

GitHub Release License Build Status

A production-ready web application for normalizing and cleaning messy data at scale. Built to handle tens to hundreds of thousands of rows with intelligent parsing, validation, and formatting.

๐ŸŽฏ Overview

A unified, enterprise-scale data normalization platform that automatically detects and normalizes multiple data types in a single workflow with batch processing API, real-time monitoring, and Redis caching (v3.50.0):

  • Smart Column Mapping (v3.50.0): Automatically detects fragmented columns (House + Street + Apt, First + Last Name, Area Code + Phone) and suggests intelligent combinations
  • Intelligent Auto-Detection: Automatically identifies column types (name, email, phone, address, city, state, zip, country, company, url) with 95%+ accuracy
  • Multi-Column Processing: Normalize all columns simultaneously with real-time progress tracking
  • Enterprise Streaming: Process 100k+ rows with memory-efficient streaming architecture
  • Parallel Processing: Web Worker pool (4-8 workers) for maximum performance
  • Real-Time Memory Monitoring: Live dashboard tracking worker pool performance, memory usage, recycling events, retry statistics
  • Company Name Detection: Intelligent identification of company columns, no splitting, title case normalization with abbreviation preservation
  • Results Preservation: Seamless navigation between results and monitoring dashboard without data loss
  • Specialized Normalizers:
    • Names: 750+ credentials, Asian name detection, context-aware parsing
    • Phones: 250+ countries, type detection, multiple formats
    • Emails: RFC 5322 validation, provider-specific rules, MX/SMTP verification
    • Addresses: Title Case, 27+ abbreviations, standardization
    • URLs: Protocol/www removal, domain extraction, international TLD support
    • Company Names: Title case with abbreviation preservation (IBM, LLC, Inc.)

Key Features

โœ… Smart Column Mapping ๐Ÿค–: Auto-detect fragmented columns and suggest intelligent combinations (v3.50.0)
โœ… Intelligent Multi-Type Platform ๐Ÿค–: Auto-detect data types (name, email, phone, address, url) with 95%+ accuracy and normalize multiple columns
โœ… Address Normalization ๐Ÿ : Title Case conversion, 27+ street suffix abbreviations, 10,000+ addresses/second
โœ… 750+ Professional Credentials: Comprehensive coverage across healthcare, finance, IT, engineering, supply chain, legal, education, and more
โœ… Enterprise Email Verification ๐Ÿ“ง: MX record validation, SMTP testing, disposable detection, reputation scoring
โœ… Email CSV Batch Processing ๐Ÿ“Š: Upload CSV files with up to 10,000 emails for bulk normalization
โœ… Phone Normalization Enhanced ๐Ÿ“ฑ: Google libphonenumber integration with 250+ countries, type detection, multiple formats
โœ… WebSocket Progress Tracking โšก: Real-time batch job updates with <100ms latency
โœ… Asian Name Detection ๐ŸŒ: Intelligent detection of family-name-first patterns for 400+ Chinese/Korean/Japanese/Vietnamese surnames
โœ… Context-Aware Parsing ๐Ÿง : Uses email/phone/company context to boost parsing accuracy
โœ… Cross-Page Navigation ๐Ÿงญ: Seamless navigation between all demo pages
โœ… Intelligent CSV Parsing: Auto-detects column structure (single full name, first/last split, multi-column)
โœ… Batch Processing: Server-side job queue handles unlimited dataset sizes
โœ… Real-time Processing: Interactive demo for testing individual records
โœ… Hybrid Monorepo: Publishable @normalization/core package for reuse across projects
โœ… Optimized Performance: O(1) credential lookups using Sets and Maps
โœ… Accent Handling: Configurable accent preservation or ASCII conversion
โœ… Comprehensive Cleaning: Removes credentials, job titles, pronouns, fixes encoding issues
โœ… Multiple Export Formats: CSV, JSON with detailed repair logs
โœ… Statistics Dashboard: Track valid/invalid ratios, processing time, data quality metrics
โœ… Authentication: Secure user accounts with job history
โœ… S3 Storage: Scalable file storage for uploads and results
โœ… Real-Time Memory Monitoring ๐Ÿ“Š: Live dashboard tracking worker pool performance, memory usage, recycling events, retry statistics
โœ… Company Name Detection ๐Ÿข: Intelligent identification of company columns, no splitting, title case normalization with abbreviation preservation
โœ… Results Preservation ๐Ÿ’พ: Seamless navigation between results and monitoring dashboard without data loss

What's New in v3.50.0 ๐Ÿš€

Smart Column Mapping ๐Ÿค–
Intelligent pre-normalization feature that automatically detects fragmented columns and suggests combining them into complete data fields. Eliminates 5-10 minutes of manual Excel work with one-click acceptance of smart suggestions.

Key Features:

  • ๐Ÿ  Address Components: House + Street + Apt โ†’ Address (e.g., "65" + "MILL ST" + "306" โ†’ "65 MILL ST Apt 306")
  • ๐Ÿ‘ค Name Components: First + Middle + Last + Prefix + Suffix โ†’ Full Name (supports 15+ column name variations)
  • ๐Ÿ“ž Phone Components: Area Code + Number + Extension โ†’ Phone (e.g., "555" + "123-4567" โ†’ "(555) 123-4567")
  • ๐ŸŽฏ Pattern Matching: Case-insensitive detection with space/underscore support
  • ๐Ÿ“Š Confidence Scoring: High (โ‰ฅ80%), Medium (60-79%), Low (<60%) confidence indicators
  • ๐Ÿ‘๏ธ Preview Generation: Shows 3 sample combinations before acceptance
  • โšก Fast Detection: <50ms for typical CSV (10-20 columns)
  • ๐ŸŽจ SmartSuggestions UI: User-friendly interface with Accept/Customize/Ignore actions

UI Enhancements:

User Experience:

  • Before: 5-10 minutes of manual column combination in Excel
  • After: One-click "Accept" on smart suggestion
  • Eliminates manual Excel formula work and reduces errors

Test Coverage:

  • 22/22 comprehensive unit tests (100% pass rate)
  • Detection time: <50ms for typical CSV
  • Minimal memory overhead (only 5 sample rows per column)

What's New in v3.49.0 ๐Ÿš€

Large File Processing Fix ๐Ÿš€
Fixed critical memory issues preventing the app from processing 400k+ row files. Implemented server-side streaming architecture with automatic routing: files under 50k rows use fast client-side processing, while files 50k+ rows use memory-safe server-side streaming. The StreamingCSVWriter processes data in 10k row chunks with incremental S3 uploads, reducing memory usage from 1GB+ to just 265MB. Tested with 400,000 rows achieving 582 rows/sec processing speed with stable memory usage. The app can now handle 1M+ row files without crashes.

Key Features:

  • ๐ŸŽฏ Automatic Routing: < 50k rows โ†’ client-side, โ‰ฅ 50k rows โ†’ server-side streaming
  • ๐Ÿ’พ Memory-Safe Processing: 265MB heap usage (vs 1GB+ before)
  • โšก High Performance: 582 rows/sec processing speed
  • ๐Ÿ“ฆ Chunked Processing: 10k row buffers with incremental S3 uploads
  • ๐Ÿ”„ Background Jobs: Re-enabled BatchWorker with streaming support
  • โœ… Production Ready: Tested with 400k rows, supports 1M+ rows

Performance Improvements:

  • Memory: 1GB+ โ†’ 265MB (73% reduction)
  • Processing: 400k rows in 11.5 minutes
  • Stability: No crashes, no freezes at 50% or 100%
  • Download: Automatic S3 upload with presigned URLs

What's New in v3.48.0 ๐Ÿš€

URL Normalization Feature ๐ŸŒ
Comprehensive URL normalization that extracts clean domain names from URLs by removing protocols, www prefixes, paths, query parameters, and fragments. Auto-detects URL columns in CSV files with 95%+ accuracy and supports international domains (.co.uk, .com.au, etc.). Includes confidence scoring for URL validity and handles 18+ multi-part TLDs. All 40 tests passing with full integration into the intelligent normalization engine.

Key Features:

  • ๐ŸŒ Protocol Removal: Strips http://, https://, ftp://, and other protocols
  • ๐Ÿ”— WWW Prefix Removal: Removes www. from domain names (case-insensitive)
  • ๐ŸŽฏ Root Domain Extraction: Extracts only domain + extension (google.com)
  • ๐Ÿ—‘๏ธ Path/Query/Fragment Removal: Removes /paths, ?query=params, and #fragments
  • ๐ŸŒ International Domain Support: Handles .co.uk, .com.au, and 18+ multi-part TLDs
  • ๐Ÿค– Auto-Detection: Automatically identifies URL columns (Website, URL, Link, Homepage)
  • ๐Ÿ“Š Confidence Scoring: 0-1 confidence scores based on domain validity
  • โœ… 40 Tests Passing: Comprehensive coverage including real-world examples

Examples:

  • http://www.google.com โ†’ google.com
  • https://www.example.com/page?query=1 โ†’ example.com
  • www.facebook.com/profile#section โ†’ facebook.com
  • subdomain.site.co.uk/path โ†’ site.co.uk

What's New in v3.45.0 ๐Ÿš€

PO Box Normalization, ZIP Validation & Confidence Scoring ๐Ÿ“ฎ
Comprehensive address quality improvements with intelligent PO Box detection and normalization, ZIP code validation against state data, and confidence scoring for all address components. Introduced data quality flags to identify missing fields, ZIP/state mismatches, and ambiguous cities. All 37 v3.45.0 tests passing with full backward compatibility verified.

Key Features:

  • ๐Ÿ“ฎ PO Box Normalization: Detects and normalizes P.O. Box, POBox, PO Box, etc. to standard "PO Box" format
  • โœ… ZIP Code Validation: Validates ZIP codes against state data using @mardillu/us-cities-utils package
  • ๐ŸŽฏ Confidence Scoring: 0-1 confidence scores for each address component (street, city, state, zip)
  • ๐Ÿšฉ Data Quality Flags: Identifies missing fields, ZIP/state mismatches, ambiguous cities, and other issues
  • ๐Ÿ”„ Backward Compatible: All existing address normalization features preserved with enhanced validation
  • ๐Ÿ“Š 37 Tests Passing: Comprehensive test coverage for all new features and edge cases

What's New in v3.40.1 ๐Ÿš€

CRM Merge Jobs Database Fix ๐Ÿ”ง
Fixed critical bug preventing CRM merge jobs from being submitted and processed. The crmMergeJobs database table was missing, causing job submissions to fail silently. Added complete database schema with 19 columns for tracking merge jobs, created CRM-specific update functions (updateCRMMergeJobStatus, updateCRMMergeJobProgress), and fixed submitMergeJob endpoint to use the correct table. Background worker now properly initializes and processes CRM merge jobs with multiple enrichment files.

Key Fixes:

  • ๐Ÿ—„๏ธ Database Table Created: Added crmMergeJobs table with proper schema (migration 0006_organic_hardball.sql)
  • ๐Ÿ“ Correct Table Usage: Fixed submitMergeJob to insert into crmMergeJobs instead of jobs table
  • ๐Ÿ”„ CRM-Specific Functions: Added updateCRMMergeJobStatus, updateCRMMergeJobProgress, updateCRMMergeJobProgressSimple
  • ๐ŸŽฏ Worker Integration: Updated CRMMergeWorker to use CRM-specific database functions
  • ๐Ÿ“Š Proper Data Storage: Stores enrichedFileKeys and enrichedFileUrls as JSON arrays

What's New in v3.40.0 ๐Ÿš€

Batch Jobs Authentication Fix ๐Ÿ”’
Fixed critical authentication issue preventing access to the Batch Jobs page. Implemented server-side authentication fallback (matching CRM Sync pattern) that automatically uses owner credentials when no user is logged in. Removed client-side authentication check that was blocking page render. The Batch Jobs page now loads correctly with full access to job history, submission, and download features.

Key Improvements:

  • โœ… Server-Side Auth Fallback: Automatically uses owner ID from OWNER_OPEN_ID environment variable
  • ๐Ÿšช No Login Required: Page accessible without manual authentication during development
  • ๐Ÿ“Š Full Functionality: Job list, submission, cancellation, and downloads all working
  • ๐Ÿ”„ Consistent Pattern: Matches CRM Sync authentication approach for unified experience

๐Ÿ“š Documentation

๐Ÿš€ Quick Start

# Install dependencies
pnpm install

# Run development server
pnpm dev

# Run tests
pnpm test

# Build for production
pnpm build

๐Ÿ“ฆ Installation

git clone https://github.com/roALAB1/data-normalization-platform.git
cd data-normalization-platform
pnpm install

๐Ÿ”ง Configuration

Set up environment variables in .env:

DATABASE_URL=mysql://...
REDIS_HOST=localhost
REDIS_PORT=6379
JWT_SECRET=your-secret-key
OWNER_OPEN_ID=your-owner-id

๐Ÿงช Testing

# Run all tests
pnpm test

# Run specific test file
pnpm test tests/url-normalization.test.ts

# Run tests in watch mode
pnpm test --watch

๐Ÿ“Š Performance

  • Processing Speed: 1,000-5,000 rows/second (depends on CPU cores and data complexity)
  • Memory Usage: Constant footprint regardless of file size (streaming architecture)
  • Scalability: No row limit, linear scaling with CPU cores
  • Large Files: 400k+ rows tested, supports 1M+ rows

๐Ÿค Contributing

Contributions are welcome! Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built with React 19, Tailwind CSS 4, Express 4, tRPC 11
  • Uses libphonenumber-js for phone normalization
  • Uses @mardillu/us-cities-utils for ZIP validation
  • Inspired by Namefully library for name parsing

๐Ÿ“ž Support

For issues, questions, or feature requests, please open an issue on GitHub.

๐Ÿ”— Links

About

Production-ready data normalization platform for cleaning messy business data at scale

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •