A lightweight, full-stack web application designed to generate multilingual corpus data for economic and general terms. Built with FastAPI and Vue 3.
Supports 20+ languages including English, Traditional Chinese, Simplified Chinese, Japanese, Korean, Spanish, French, German, Russian, and more.
TL;DR:
- Install dependencies (Python + Node.js)
- Start backend and frontend servers
- Configure User-Agent in the Manage page (required by Wikipedia API)
- Start crawling!
- Instant Multilingual Search: Input a term (e.g., "Inflation") and retrieve its summary in 20+ languages simultaneously.
- Wikipedia Integration: Automatically fetches data from Wikipedia using the
wikipedia-apilibrary, leveraging language links for accurate cross-lingual mapping. - Multi-Language Interface: Clean, modern UI displaying multiple language definitions with flags and labels.
- Auto-Save to Markdown: Every search result is automatically saved as a Markdown file in the backend's
output/directory. - JSON Export: One-click export of current search results to a JSON file from the frontend.
- π Batch Import & Automation: Crawl hundreds of terms automatically via text input or file upload (CSV/TXT).
- π¨π³ Smart Chinese Conversion: Automatically converts Traditional Chinese (Wikipedia default) to Simplified Chinese using
zhconv. - π Real-time Monitoring: Dashboard to track crawling progress, success/failure rates, and current status.
- πΎ Database Persistence: Uses SQLite to store crawl history, allowing you to resume tasks or export data anytime.
- π₯ Robust Export: Download results as valid JSON files or UTF-8 encoded CSVs (Excel compatible).
- πΈοΈ Knowledge Graph Visualization: Interactive D3.js force-directed graph showing term relationships.
- π― Depth-Controlled Crawling: Configure crawl depth (1-3 levels) to automatically discover related terms from "See Also" and internal links.
- π Association Tracking: Stores term relationships (links, categories) in database for graph generation.
- οΏ½οΈ Multi-Format Export: Export knowledge graphs as PNG (high-res), SVG (editable), or JSON (data).
- π― Smart Label Display: Only shows labels for root and first-layer nodes to reduce visual clutter.
- π 20+ Language Support: Crawl Wikipedia content in 20+ languages including Traditional Chinese (ηΉι«δΈζ), Japanese (ζ₯ζ¬θͺ), Korean (νκ΅μ΄), Spanish, French, German, Russian, and more.
- πΉπΌ Traditional Chinese: Added Traditional Chinese support with automatic variant conversion using
zhconv. - π Dynamic Language Selection: Choose target languages before each crawl from an intuitive multi-select interface.
- π Auto-Translation Discovery: Uses Wikipedia's language links to find corresponding articles across all selected languages.
- π Multi-Language Display: View all translations side-by-side in the results table with language-specific flags and labels.
- πΎ Database Backup & Restore:
- Download complete database backups (
.dbfiles) - Upload and restore from previous backups with safety checks
- Automatic backup before restore operations
- Download complete database backups (
- π€ Enhanced Export Formats:
- JSON: Complete metadata including ID, status, timestamps, depth_level, and all translations
- JSONL: Machine learning ready format with dynamic language columns
- CSV/TSV: Excel-compatible with ID column and all selected languages
- TMX: Professional translation memory format for CAT tools
- TXT: Human-readable multilingual format
- π§Ή Data Quality Tools:
- Quality analysis dashboard showing completion rates and issues
- Clean data wizard to remove failed/incomplete entries
- Filter and view problematic terms
- π English UI: Complete interface localization (UI in English, content in selected languages)
- βοΈ System Configuration:
- Editable User-Agent settings (required by Wikipedia API)
- Settings persist across sessions
- No server restart required
- FastAPI: High-performance web framework.
- SQLite + aiosqlite: Async database for managing batch tasks.
- Wikipedia-API: Official MediaWiki API wrapper.
- zhconv: Advanced Traditional-to-Simplified Chinese conversion.
- Pydantic: Data validation.
- Vue 3 + Vite: Lightning fast frontend.
- TailwindCSS: Utility-first styling.
- D3.js: Knowledge graph visualization.
- Axios: HTTP client.
This tool is designed to strictly adhere to Wikipedia's User-Agent Policy and API Usage Guidelines:
- Official API: Uses the standard MediaWiki API endpoints, not screen scraping.
- User-Agent:
β οΈ You MUST configure your own User-Agent before using this tool- Access the Manage page β System Configuration to set your User-Agent
- Must include your project name and contact information (email or GitHub URL)
- Example:
YourProject/1.0 (your-email@example.com)orYourProject/1.0 (https://github.com/YourUsername/YourRepo) - See SETUP.md for detailed instructions
- Rate Limiting: Enforces a configurable delay (default 3s) between requests used in batch mode to prevent server overload.
- Sequential Processing: Batch tasks are processed serially to maintain a low concurrency footprint.
- Privacy: Database files are gitignored by default. No personal data is collected or transmitted.
- Batch Input Methods:
- Paste multiple terms (one per line) β
- Upload CSV/TXT files β
- Automation Controls:
- Concurrent crawling with rate limiting β
- Real-time progress monitoring β
- Automatic retry mechanism β
- Results Management:
- Batch export to JSON/CSV (Simplified Chinese support) β
- Database persistence β
- Link Discovery Strategies:
- "See also" sections from Wikipedia pages β
- High-frequency internal links (via Associations) β
- Category tags exploration β
- Cross-language related articles (via langlinks) β
- Crawl Depth Control:
- Configurable depth levels (1-3 layers) β
- Maximum terms per layer β
- Blacklist filtering for irrelevant terms (Basic filtering implemented) β
- Knowledge Graph Visualization:
- Force-directed graph of term relationships β
- Topic clustering display (via Force Layout) β
- Multi-format export: PNG, SVG, JSON β
- Smart full-graph capture (ignores zoom state) β
- Term Deduplication:
- Detect duplicate terms before batch crawling β
- UI warning for existing terms with skip/force options β
- Global duplicate check across all tasks β
- Data Quality Control:
- Automatic quality analysis (missing translations, short summaries) β
- Quality report dashboard β
- Data cleaning tools (remove failed/low-quality entries) β
- Batch Task Management:
- View all historical batch tasks β
- Delete/archive old tasks β
- Merge multiple tasks into unified corpus (Partial: export-based merging possible)
- Multi-Format Export:
- JSONL (one JSON object per line) - ML training ready β
- TMX (Translation Memory eXchange) - CAT tool compatible β
- TSV (Tab-separated values) - Excel/Pandas friendly β
- TXT (Plain text bilingual pairs) - Simple readable format β
Parquet (Optional) - Big data processing(Not implemented - not needed for current scale)
- Data Persistence:
- Database backup/restore functionality β
- Complete data reset with confirmation β
- Export entire corpus as portable file β
- Multi-Language Support:
- Support for 20+ Wikipedia languages β
- Traditional Chinese (zh-tw) and Simplified Chinese (zh) β
- Dynamic language selection per task β
- Automatic variant conversion (zhconv) β
- Language Detection & Linking:
- Use Wikipedia langlinks for translation discovery β
- Store translations in structured JSON format β
- Multi-language display in results table β
- User-Agent Configuration:
- Editable User-Agent in UI (Manage page) β
- Persistent settings storage in database β
- Wikipedia API compliance β
- Statistics Dashboard:
- Total terms / bilingual pairs count β (Basic stats implemented)
- Character count (EN/ZH separately)
- Average summary length
- Database size metrics β (Implemented)
- Knowledge graph node/edge counts β (Implemented)
- Coverage Analysis:
- Success rate visualization
- Missing translation tracking
- Domain distribution (if tagged)
Backend Enhancements:
backend/
βββ main.py # Existing FastAPI main file
βββ worker.py # New: Background task worker (Celery/RQ)
βββ models.py # New: Database models
βββ scheduler.py # New: Batch crawl scheduler
βββ database.py # New: Database connection
βββ utils/
βββ rate_limiter.py # Rate limiting control
βββ retry.py # Retry logic
Frontend Enhancements:
frontend/src/
βββ App.vue # Existing main component
βββ components/
βββ BatchImport.vue # Batch import interface
βββ ProgressMonitor.vue # Progress tracking
βββ ResultsTable.vue # Results data table
Database Schema (Current Implementation):
-- Batch tasks tracking
CREATE TABLE batch_tasks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
status TEXT NOT NULL,
total_terms INTEGER DEFAULT 0,
completed_terms INTEGER DEFAULT 0,
failed_terms INTEGER DEFAULT 0,
max_depth INTEGER DEFAULT 1,
target_languages TEXT DEFAULT 'en,zh', -- Comma-separated language codes
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Individual terms
CREATE TABLE terms (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_id INTEGER,
term TEXT NOT NULL,
status TEXT NOT NULL,
en_summary TEXT,
en_url TEXT,
zh_summary TEXT,
zh_url TEXT,
translations TEXT, -- JSON string: {"lang": {"summary": "...", "url": "..."}}
error_message TEXT,
depth_level INTEGER DEFAULT 0,
source_term_id INTEGER,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (task_id) REFERENCES batch_tasks(id)
);
-- Term associations (for knowledge graph)
CREATE TABLE term_associations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_term_id INTEGER,
target_term TEXT,
association_type TEXT,
weight REAL DEFAULT 1.0,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (source_term_id) REFERENCES terms(id)
);
-- System settings (User-Agent, etc.)
CREATE TABLE system_settings (
key TEXT PRIMARY KEY,
value TEXT,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);β Completed Phases:
- Phase 1 (v2.0): Batch Import & Automated Crawling - Core automation infrastructure
- Phase 2 (v2.1): Intelligent Association Crawling & Knowledge Graph - Self-growing knowledge base
- Phase 3 (v2.2): Corpus Quality & Data Management - Quality control and task management
- Phase 4 (v2.3): Advanced Export & Persistence - Professional export formats and backup/restore
- Phase 5 (v2.2): Multilingual Wikipedia Expansion - 20+ language support
- Phase 6 (v2.3): System Configuration & Compliance - User-Agent settings and API compliance
π― Future Enhancements (Phase 7+):
- Advanced statistics and analytics dashboard
- Character-level corpus analysis
- Domain tagging and classification
- Distributed crawling architecture (for 10,000+ terms scale)
- β Complete database backup and restore functionality
- β Enhanced export with full metadata (JSON, JSONL, CSV, TSV, TMX, TXT)
- β Data quality analysis and cleaning tools
- β System configuration panel for User-Agent settings
- β Complete English UI localization
- β Privacy protection: Removed personal info from default configs
- β Added .gitignore for database and sensitive files
- β Created SETUP.md with User-Agent configuration guide
- β Support for 20+ Wikipedia languages
- β Traditional Chinese (ηΉι«δΈζ) with automatic conversion
- β Dynamic language selection per crawl task
- β Multi-language results display with flags
- β Translations stored in structured JSON format
- β Interactive D3.js force-directed graph visualization
- β Depth-controlled intelligent association crawling
- β Multi-format graph export (PNG, SVG, JSON)
This project has evolved from a simple bilingual search tool to a comprehensive multilingual knowledge corpus management system supporting 20+ languages.