A Multi-Granularity Bilingual Corpus System for Economic Analysis
An intelligent corpus platform integrating terminology, policy documents, and market sentiment
EconMind Matrix is an innovative multilingual corpus platform for economics, integrating three dimensions of data:
| Layer | Name | Content | Data Source |
|---|---|---|---|
| Layer 1 | Terminology Knowledge Base | 20+ language economic term definitions + Knowledge Graph | Wikipedia |
| Layer 2 | Policy Parallel Corpus | Central bank report alignment (PBOC vs Fed) | Official Reports |
| Layer 3 | Sentiment & Trend Corpus | Financial news + Sentiment labels + Time series | News Media |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 3: Sentiment & Trend Corpus โ
โ ๐ฐ Financial News + Sentiment Labels + Term Trend Charts โ
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Time Series Correlation
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 2: Policy & Comparable Corpus โ
โ ๐ Central Bank Report Alignment (PBOC vs Fed) โ
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Term Linking
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 1: Terminology Knowledge Base โ
โ ๐ 20+ Language Definitions + Knowledge Graph โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Search for "Inflation" and get:
- Terminology Layer: 20+ language professional definitions + related concept knowledge graph
- Policy Layer: PBOC vs Federal Reserve related paragraph comparison
- Sentiment Layer: Last 30 days news headlines + sentiment trend chart
- Multilingual Support: Covers 20+ languages including English, Chinese, Japanese, Korean, French, German, Russian
- Chinese Conversion: Automatic Traditional to Simplified Chinese conversion
- Knowledge Graph: D3.js visualization of term relationship networks
- LLM Pre-annotation: Using Gemini/GPT for sentiment analysis and entity extraction
- Human-in-the-Loop: Doccano platform for expert verification
- Quality Control: Hybrid annotation accuracy > 90%
- JSONL: Machine learning training format
- TMX: Translation Memory (CAT tool compatible)
- CSV/TSV: Excel/Pandas friendly
- TXT: Human readable format
EconMind-Matrix/
โ
โโโ ๐ backend/ # Layer 1: Terminology Backend (Complete)
โ โโโ main.py # FastAPI server
โ โโโ database.py # Database operations
โ โโโ models.py # Data models
โ โโโ .env.example # Environment configuration template
โ โโโ output/ # Crawl results (Markdown)
โ
โโโ ๐ frontend/ # Layer 1: Vue.js Frontend (Complete)
โ โโโ src/
โ โ โโโ App.vue # Main component
โ โ โโโ components/ # UI components
โ โ โโโ services/api.js # Centralized API service
โ โโโ .env.development # Dev environment config
โ โโโ .env.production # Prod environment config
โ โโโ package.json
โ
โโโ ๐ shared/ # Shared Utilities (NEW)
โ โโโ __init__.py # Package exports
โ โโโ utils.py # Text utilities (clean_text)
โ โโโ schema.py # Centralized DB schemas (11 tables)
โ โโโ errors.py # Standardized error classes
โ โโโ config.py # Configuration constants
โ โโโ README.md # Module documentation
โ
โโโ ๐ layer2_policy/ # Layer 2: Policy Module (Complete)
โ โโโ backend/
โ โ โโโ api.py # Policy API endpoints
โ โ โโโ pdf_parser.py # Marker PDF parsing
โ โ โโโ alignment.py # Sentence-BERT paragraph alignment
โ โ โโโ models.py # Policy data models
โ โโโ data/
โ โโโ pboc/ # PBOC reports
โ โโโ fed/ # Federal Reserve reports
โ
โโโ ๐ layer3_sentiment/ # Layer 3: Sentiment Module (Complete)
โ โโโ backend/
โ โ โโโ api.py # FastAPI sentiment endpoints
โ โ โโโ database.py # Sentiment database operations
โ โ โโโ models.py # News & sentiment data models
โ โโโ crawler/ # News crawler
โ โ โโโ news_crawler.py # RSS feed crawler (Bloomberg, Reuters, etc.)
โ โโโ annotation/ # LLM annotation + Doccano integration
โ โ โโโ llm_annotator.py # Gemini API sentiment analysis
โ โ โโโ doccano_export.py # Doccano import/export scripts
โ โโโ analysis/ # Trend analysis
โ โโโ trend_analysis.py # Time series analysis module
โ
โโโ ๐ dataset/ # Dataset export directory
โ โโโ terminology.jsonl # Layer 1 data
โ โโโ policy_alignment.jsonl # Layer 2 data
โ โโโ news_sentiment.jsonl # Layer 3 data
โ
โโโ ๐ scripts/ # Automation scripts
โ โโโ export_dataset.py # Dataset export
โ โโโ crawl_all.py # Batch crawling
โ
โโโ ๐ docs/ # Project documentation
โ โโโ proposal.md # Project proposal
โ โโโ architecture.md # Technical architecture
โ โโโ api.md # API documentation
โ
โโโ pyproject.toml # Python package configuration
โโโ README.md # This file
โโโ SETUP.md # Installation guide
โโโ LICENSE # MIT License
- Python 3.9+
- Node.js 16+
- Git
# 1. Clone repository
git clone https://github.com/[your-username]/EconMind-Matrix.git
cd EconMind-Matrix
# 2. Install backend dependencies
cd backend
pip install -r requirements.txt
# 3. Install frontend dependencies
cd ../frontend
npm install
# 4. Start backend server
cd ../backend
python main.py # Runs on http://localhost:8000
# 5. Start frontend dev server
cd ../frontend
npm run dev # Runs on http://localhost:5173Visit the Manage page in the web interface to configure your User-Agent (required by Wikipedia API).
See SETUP.md for details.
Based on TermCorpusGenerator project
- Wikipedia multilingual term crawling
- 20+ language support (including Traditional/Simplified Chinese conversion)
- Batch import and automated crawling
- Intelligent association crawling (See Also, link analysis)
- D3.js knowledge graph visualization
- Multi-format export (JSON, JSONL, CSV, TSV, TMX, TXT)
- Data quality analysis and cleaning tools
- Database backup/restore functionality
Target: Mid December 2025
โ Code Implementation Completed (2024-12-14):
-
Data Models (
layer2-policy/backend/models.py)- PolicyReport, PolicyParagraph, PolicyAlignment dataclasses
- Database schema for Layer 2 tables
- 8 policy topics with bilingual keywords (inflation, employment, etc.)
- Topic detection via keyword matching
-
PDF Parsing Module (
layer2-policy/backend/pdf_parser.py)- Marker integration for AI-powered PDFโMarkdown conversion
- PyPDF2 fallback for basic text extraction
- Automatic title and date extraction
- Paragraph splitting with topic detection
- Section-aware parsing for PBOC and Fed reports
-
Paragraph Alignment Module (
layer2-policy/backend/alignment.py)- Sentence-BERT semantic similarity (multilingual)
- Topic-based alignment fallback
- Keyword overlap fallback
- Embedding caching for performance
- Alignment History tracking
- Custom Topic Pool (User defined topics)
-
Database Operations (
layer2-policy/backend/database.py)- Async CRUD for reports, paragraphs, alignments
- Statistics endpoint
- Term search across policy paragraphs
- Quality score calculation with language breakdown
-
API Endpoints (
layer2-policy/backend/api.py)- POST
/upload- Upload and parse PDF - POST
/upload-text- Upload text (testing) - GET
/reports- List reports - POST
/align- Run alignment - GET
/alignments- Query alignments - GET
/topics- List and manage topics - GET
/stats- Layer 2 statistics - GET
/export/*- Export Alignments (JSONL), Reports (JSONL), Parallel Corpus (TSV)
- POST
โ Completed Testing & Environment:
- Install dependencies:
torch,sentence-transformers(Successfully installed) - Test PDF parsing with Marker
- Test alignment with Sentence-BERT (High quality semantic matching enabled)
- Integrate Layer 2 router into main.py
- Frontend Component: PolicyCompare.vue with Topics, History, and Exports
Completed: December 2025
โ Full Implementation Completed (2025-12-16):
-
Data Models (
layer3-sentiment/backend/models.py)- NewsArticle, SentimentAnnotation, MarketContext dataclasses
- Database schema for Layer 3 tables
- Economic term variants (EN/ZH) for news filtering
- Sentiment labels: Bullish, Bearish, Neutral
-
News Crawler (
layer3-sentiment/crawler/news_crawler.py)- RSS feed integration (Bloomberg, Reuters, WSJ, FT, Xinhua, 21 sources)
- Async crawling with feedparser
- Term-based news filtering
- Automatic term detection from article content
- User-Agent rotation pool (8 browser UAs)
- Proxy pool support (http/https/socks5)
- Concurrency control (1-10 concurrent requests)
- Custom delay (0.5-10 seconds between requests)
- Manual start/stop control with verification
-
LLM Sentiment Annotator (
layer3-sentiment/annotation/llm_annotator.py)- Gemini API integration for sentiment analysis
- Bilingual prompt templates (EN/ZH)
- Rule-based fallback annotator (no API required)
- Hybrid annotator (optimizes API usage)
- Batch annotation with rate limiting
-
Doccano Integration (
layer3-sentiment/annotation/doccano_export.py)- JSONL export for Doccano platform
- CSV export for spreadsheet annotation
- Import verified annotations back to database
- Annotation quality checking
-
Trend Analysis (
layer3-sentiment/analysis/trend_analysis.py)- Daily term frequency calculation
- Sentiment distribution over time
- Trend direction detection (increasing/decreasing/stable)
- Market correlation analysis (optional)
- ECharts-compatible data generation
-
API Endpoints (
layer3-sentiment/backend/api.py)- POST
/crawl- Crawl news from sources - GET
/articles- List articles - POST
/annotate- Run sentiment annotation - GET
/trend/{term}- Get term trend analysis - GET
/trends/hot- Get hot terms - GET
/export/doccano- Export for Doccano
- POST
-
Frontend Component (
frontend/src/components/SentimentAnalysis.vue)- Dashboard with sentiment statistics
- News crawling interface with advanced options
- Articles list with sentiment labels (search, filter, group by source)
- Trend analysis visualization
- Export options (JSON, JSONL, CSV, Doccano)
- Running crawler detection on page load
- Force stop with verification polling
- Proxy pool configuration UI
Critical Distinction: Layer 4 is NOT a user interface - it is an offline batch processing engine that consumes completed data from Layers 1-3 and produces publication-ready aligned datasets.
Input โ Process โ Output Model:
Layer 1 Data (corpus.db) โโโ
Layer 2 Data (corpus.db) โโโผโโโ Alignment Engine โโโ Unified Dataset File
Layer 3 Data (corpus.db) โโโ (Batch Pipeline) (aligned_corpus.jsonl)
What Layer 4 Does:
- Enumerates all successfully crawled terms from Layer 1
- Searches Layer 2/3 for content related to each term (across ALL supported languages)
- Aligns using multiple strategies (LLM, vectors, rules) to determine semantic relevance
- Aggregates aligned evidence into structured "Knowledge Cells"
- Exports publication-ready datasets in standardized formats (JSONL, CSV, etc.)
- Reports data quality metrics (coverage, alignment scores, language distribution)
What Layer 4 Does NOT Do:
- โ Provide real-time user search interfaces (that's the frontend's job)
- โ Store data in its own database (reads from Layer 1-3 databases)
- โ Crawl or collect raw data (Layers 1-3 handle this)
layer4_alignment/
โโโ backend/
โ โโโ alignment_engine.py # Core orchestration logic
โ โโโ data_loader.py # Load data from Layer 1-3 databases
โ โโโ knowledge_cell.py # Knowledge Cell data model (Pydantic)
โ โโโ aligners/ # Pluggable alignment strategies
โ โ โโโ llm_aligner.py # Gemini/GPT-4 semantic judgment
โ โ โโโ vector_aligner.py # Sentence-BERT cosine similarity
โ โ โโโ rule_aligner.py # Keyword + TF-IDF matching
โ โ โโโ hybrid_aligner.py # Weighted ensemble of above methods
โ โโโ exporters/
โ โ โโโ jsonl_exporter.py # JSONL dataset export
โ โ โโโ csv_exporter.py # Spreadsheet-friendly export
โ โ โโโ quality_reporter.py # Statistics and quality metrics
โ โโโ utils/
โ โโโ wikidata_client.py # Fetch Wikidata QIDs for terms
โ โโโ text_processor.py # Multilingual text normalization
โโโ config/
โ โโโ alignment_config.yaml # Alignment strategy settings
โ โโโ language_support.yaml # Language priority and mappings
โโโ scripts/
โ โโโ run_full_alignment.py # Batch process all terms
โ โโโ incremental_update.py # Process newly added terms only
โ โโโ validate_output.py # Verify dataset integrity
โโโ README.md
Layer 4 employs 4 complementary alignment methods to maximize accuracy:
- Model: Gemini 1.5 Pro / GPT-4 Turbo
- Method: Present term definition + candidate texts to LLM, ask for relevance scoring (0-1)
- Prompt Example:
Term: "Inflation" (Definition: In economics, inflation is a general rise in prices...) Rate each policy paragraph's relevance to this concept (0-1 scale): [0] "Current inflation remains moderate, CPI rose 0.4% YoY..." โ Score: ? [1] "Export growth accelerated in Q3..." โ Score: ? - Advantages: Understands context, handles paraphrasing, detects conceptual matches
- Limitations: API costs, rate limits, requires careful prompt engineering
- Model:
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 - Method:
- Encode term definition into 768-dim vector
- Encode each candidate paragraph/article into vectors
- Calculate cosine similarity
- Accept matches above threshold (e.g., >0.65)
- Advantages: Fast, free, works offline, multilingual support
- Limitations: May miss conceptual matches if wording differs significantly
- Method:
- Extract keywords from term (+ synonyms from Layer 1's
related_terms) - Calculate TF-IDF scores in candidate texts
- Fuzzy matching for inflected forms (e.g., "inflate" โ "inflation")
- Extract keywords from term (+ synonyms from Layer 1's
- Advantages: Explainable, deterministic, no API dependencies
- Limitations: Purely lexical, misses semantic equivalents
- Method: Weighted vote of above 3 methods
- Formula:
Final_Score = 0.50รLLM + 0.30รVector + 0.15รRule + 0.05รEnsemble_Bonus - Ensemble Bonus: +0.05 if all 3 methods agree (high confidence indicator)
Filtering Threshold: Only matches with Final_Score โฅ 0.65 are included in the Knowledge Cell.
Each term produces one Knowledge Cell, which is the atomic unit of the aligned dataset:
{
"concept_id": "Q17127698", // Wikidata QID (or TERM_<id> if unavailable)
"primary_term": "Inflation", // English canonical term
"definitions": { // Layer 1: Multilingual definitions
"en": {
"term": "Inflation",
"summary": "In economics, inflation is a general rise in the price level...",
"url": "https://en.wikipedia.org/wiki/Inflation",
"source": "Wikipedia"
},
"zh": {
"term": "้่ดง่จ่",
"summary": "้่ดง่จ่ๆฏๆไธ่ฌ็ฉไปทๆฐดๅนณๅจไธๅฎๆถๆๅ
ๆ็ปญไธๆถจ...",
"url": "https://zh.wikipedia.org/wiki/้่ดง่จ่",
"source": "Wikipedia"
},
"ja": {...},
"ko": {...}
// All languages supported by Layer 1
},
"policy_evidence": [ // Layer 2: Aligned policy paragraphs
{
"source": "pboc",
"paragraph_id": 42,
"text": "ๅฝๅ้่ไฟๆๆธฉๅ๏ผCPIๅๆฏไธๆถจ0.4%๏ผๆ ธๅฟCPIไธๆถจ0.3%...",
"topic": "price_stability",
"alignment_scores": {
"llm": 0.92,
"vector": 0.78,
"rule": 0.85,
"final": 0.88
},
"alignment_method": "hybrid_ensemble",
"report_metadata": {
"title": "2024ๅนด็ฌฌไธๅญฃๅบฆไธญๅฝ่ดงๅธๆฟ็ญๆง่กๆฅๅ",
"date": "2024-11-08",
"section": "Part II: Monetary Policy Operations"
}
},
{
"source": "fed",
"paragraph_id": 156,
"text": "Prices continued to rise modestly across most districts. Retail prices increased...",
"topic": "inflation",
"alignment_scores": {...},
"report_metadata": {...}
}
],
"sentiment_evidence": [ // Layer 3: Aligned news articles
{
"article_id": 1523,
"title": "Fed signals slower pace of rate cuts amid sticky inflation",
"source": "Bloomberg",
"url": "https://www.bloomberg.com/...",
"published_date": "2024-12-13",
"sentiment": {
"label": "bearish",
"confidence": 0.82,
"annotator": "gemini-1.5-flash"
},
"alignment_scores": {
"llm": 0.95,
"vector": 0.89,
"rule": 0.72,
"final": 0.91
}
},
{...}
],
"metadata": {
"created_at": "2025-01-15T10:23:45Z",
"alignment_engine_version": "4.0.0",
"quality_metrics": {
"overall_score": 0.87, // Weighted avg of all alignment scores
"language_coverage": 8, // Number of languages with definitions
"policy_evidence_count": 12, // PBOC + Fed paragraphs aligned
"sentiment_evidence_count": 25, // News articles aligned (last 90 days)
"avg_policy_score": 0.84,
"avg_sentiment_score": 0.89
}
}
}File: layer4_alignment/config/alignment_config.yaml
# Alignment Strategy Settings
alignment_strategies:
llm_semantic:
enabled: true
provider: "gemini" # or "openai", "deepseek"
model: "gemini-1.5-pro"
api_key_env: "GEMINI_API_KEY"
temperature: 0.1
max_tokens: 500
batch_size: 10 # Process 10 candidates per LLM call
threshold: 0.70
weight: 0.50
vector_similarity:
enabled: true
model: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
device: "cuda" # or "cpu"
threshold: 0.65
weight: 0.30
keyword_matching:
enabled: true
use_fuzzy: true
fuzzy_threshold: 0.85
tfidf_top_k: 20
threshold: 0.60
weight: 0.15
# Global Settings
global:
min_final_score: 0.65 # Discard alignments below this
max_policy_evidence: 15 # Top N policy paragraphs per term
max_sentiment_evidence: 30 # Top N news articles per term
sentiment_time_window_days: 90 # Only recent news
# Language Support (inherits from Layer 1)
languages:
priority: ["en", "zh", "ja", "ko", "fr", "de", "es", "ru"]
fallback_language: "en"
# Output Settings
output:
format: "jsonl" # or "json", "csv"
output_dir: "dataset"
filename_template: "aligned_corpus_v{version}_{date}.jsonl"
include_metadata: true
compress: false # Set true to generate .jsonl.gz
# Quality Reporting
quality_report:
enabled: true
output_path: "dataset/quality_report.md"
visualizations: true # Generate charts if matplotlib availableFull Alignment Run (one-time or periodic):
cd layer4_alignment
python scripts/run_full_alignment.py --config config/alignment_config.yamlConsole Output Example:
========================================
Layer 4 Alignment Engine v4.0.0
========================================
[INFO] Loading configuration from alignment_config.yaml
[INFO] Initializing aligners: LLM (Gemini) + Vector (SBERT) + Rule
[INFO] Loading Layer 1 terms from corpus.db... Found 287 terms
[INFO] Loading Layer 2 policy corpus... 1,234 paragraphs (PBOC: 623, Fed: 611)
[INFO] Loading Layer 3 news articles... 4,567 articles (last 90 days)
[1/287] Aligning term: "Inflation" (8 languages)
โโ Layer 2: Found 45 candidate paragraphs
โ โโ LLM filtering: 12 relevant (scores 0.70-0.95)
โ โโ Vector filtering: 18 relevant (scores 0.65-0.88)
โ โโ Ensemble: 14 final matches (avg score 0.84)
โโ Layer 3: Found 128 candidate articles
โ โโ Ensemble: 27 final matches (avg score 0.87)
โโ Knowledge Cell quality: 0.86 โ
[2/287] Aligning term: "GDP" (7 languages)
...
[287/287] Aligning term: "Quantitative Easing" (5 languages)
โโ Knowledge Cell quality: 0.79 โ
========================================
Alignment Complete!
========================================
Output: dataset/aligned_corpus_v1_2025-01-15.jsonl
Total Knowledge Cells: 287
Avg Quality Score: 0.82
Time Elapsed: 2h 34m
Generating quality report... โ
Report saved to: dataset/quality_report.md
Incremental Update (for newly added terms):
python scripts/incremental_update.py --since "2025-01-15"- File:
aligned_corpus_v1_2025-01-15.jsonl - Structure: One Knowledge Cell per line (newline-delimited JSON)
- Size: ~500 KB per 100 terms (uncompressed)
- Use Case: LLM fine-tuning, batch processing, streaming ingestion
- File:
aligned_corpus_v1_2025-01-15.csv - Columns:
concept_id, term_en, term_zh, term_ja, ..., policy_count, sentiment_count, quality_score, top_policy_source, top_sentiment_label - Use Case: Excel analysis, Pandas dataframes, visualization
- File:
quality_report.md - Contents:
- Overall statistics (total cells, avg scores, language distribution)
- Top 10 highest quality cells
- Bottom 10 cells requiring manual review
- Alignment method performance comparison
- Visualizations (if enabled): bar charts, heatmaps
| Metric | Target | Description |
|---|---|---|
| Coverage Rate | โฅ 80% | % of Layer 1 terms with aligned Layer 2+3 data |
| Avg Alignment Score | โฅ 0.75 | Mean of all final_score values |
| Language Completeness | โฅ 5 langs/term | Average languages with definitions per cell |
| Policy Evidence Density | โฅ 3 paragraphs/term | Avg aligned policy paragraphs per cell |
| Sentiment Evidence Density | โฅ 10 articles/term | Avg aligned news articles per cell |
| Processing Speed | โค 30s/term | Time to align one term (all layers) |
Inputs from Previous Phases:
- Layer 1 โ Provides canonical terms + multilingual definitions + Wikidata QIDs
- Layer 2 โ Provides policy paragraphs tagged with topics
- Layer 3 โ Provides sentiment-annotated news + trend data
Outputs for Next Phase:
- Phase 5 โ Publication-ready datasets for competition submission
- Frontend โ (Optional) Pre-computed aligned data for fast UI loading
- External Users โ High-quality training data for domain-specific LLMs
Dependencies:
# Core
pydantic>=2.5.0
pyyaml>=6.0
aiosqlite>=0.19.0
# Alignment Methods
google-generativeai>=0.3.0 # Gemini API
openai>=1.6.0 # GPT-4 API (optional)
sentence-transformers>=2.2.0 # Vector embeddings
scikit-learn>=1.3.0 # TF-IDF, cosine similarity
# Utilities
requests>=2.31.0 # Wikidata API
tqdm>=4.66.0 # Progress bars
pandas>=2.0.0 # Data exportHardware Recommendations:
- CPU: 4+ cores (for parallel processing)
- RAM: 8GB+ (for embedding model caching)
- GPU: Optional but recommended for vector embeddings (CUDA-compatible)
- Storage: 2GB for models + 500MB for output datasets
-
Core Engine
-
AlignmentEngineclass with multi-strategy support -
KnowledgeCellPydantic model with full schema - Database loaders for Layer 1/2/3
- Wikidata QID fetcher and cacher
-
-
Alignment Strategies
- LLM aligner (Gemini + fallback to GPT-4)
- Vector aligner (Sentence-BERT)
- Rule-based aligner (keyword + TF-IDF)
- Hybrid ensemble aggregator
-
Export System
- JSONL exporter with compression support
- CSV exporter with multilingual handling
- Quality report generator (Markdown + charts)
-
Scripts & Tools
- Full alignment runner (
run_full_alignment.py) - Incremental updater (
incremental_update.py) - Output validator (
validate_output.py) - Configuration validator
- Full alignment runner (
-
Documentation
- README.md with usage examples
- Configuration guide (YAML options explained)
- Alignment strategy comparison table
- Troubleshooting guide
-
Testing & Validation
- Unit tests for each aligner
- Integration test with sample data
- Performance benchmarks
- Output schema validation
-
Documentation
- Technical architecture (
docs/architecture.md) - API documentation (
docs/api.md) - Full technical solution document (30-50 pages)
- Dataset description document
- Technical architecture (
-
Demo Preparation
- Online demo deployment (Vercel + Railway)
- Demo video production (5-10 min)
- PPT presentation materials
-
Data Scale Targets
- 500+ economic terms ร 20 languages
- 10+ policy report alignments
- 5000+ news sentiment annotations
Last Updated: 2024-12-16 23:00
| Component | Status | Files |
|---|---|---|
| Layer 1 Backend | โ Complete | backend/main.py, database.py, etc. |
| Layer 1 Frontend | โ Complete | frontend/src/ (6 components) |
| Layer 2 Models | โ Complete | layer2-policy/backend/models.py |
| Layer 2 PDF Parser | โ Complete | layer2-policy/backend/pdf_parser.py |
| Layer 2 Alignment | โ Complete | layer2-policy/backend/alignment.py |
| Layer 2 Database | โ Complete | layer2-policy/backend/database.py |
| Layer 2 API | โ Complete | layer2-policy/backend/api.py |
| Layer 2 Frontend | โ Complete | frontend/src/components/PolicyCompare.vue |
| Layer 3 Models | โ Complete | layer3_sentiment/backend/models.py |
| Layer 3 Database | โ Complete | layer3_sentiment/backend/database.py |
| Layer 3 Crawler | โ Complete | layer3_sentiment/crawler/news_crawler.py |
| Layer 3 Annotator | โ Complete | layer3_sentiment/annotation/llm_annotator.py |
| Layer 3 Doccano | โ Complete | layer3_sentiment/annotation/doccano_export.py |
| Layer 3 Trends | โ Complete | layer3_sentiment/analysis/trend_analysis.py |
| Layer 3 API | โ Complete | layer3_sentiment/backend/api.py |
| Layer 3 Frontend | โ Complete | frontend/src/components/SentimentAnalysis.vue |
| Export Scripts | ๐ง Framework | scripts/export_dataset.py |
| Documentation | โ Complete | docs/architecture.md, docs/api.md |
๐ Layer 4: Cross-Lingual Augmentation & LLM Training Export
โ Fully Localized LLM Exports (8 languages: EN, ZH, JA, KO, DE, FR, ES, RU)
- All LLM training formats (Alpaca, ShareGPT, OpenAI, Dolly, Text) now use localized templates
- Questions, instructions, and system prompts are dynamically translated per language
- Language-source filtering: ZH exports โ PBOC data only, EN exports โ FED data only
โ Cross-Lingual Augmentation Panel (็ฝฎ้กถไบLayer 4ไปช่กจ็)
- 3 Translation Modes:
Mode Description Requirements ๐ No Translation Export native data only None ๐ฅ๏ธ Local (Argos) Offline neural MT pip install argostranslate๐ API LLM translation (้ซ่ดจ้) OpenAI/Gemini API Key - Configure API provider (OpenAI/Gemini), model, and augmentation ratio
- View FED/PBOC record counts and latest output files
โ Per-Cell Translation Export
- Each Knowledge Cell can be exported with translation mode selection
- Supports real-time LLM translation via OpenAI/Gemini API (
httpxasync calls) - Local translation using argostranslate (free, offline)
โ New Backend Endpoints:
POST /api/v1/alignment/cell/{id}/export/local-translate # Argos offline translation
POST /api/v1/alignment/cell/{id}/export/cross-lingual # LLM API translation
POST /api/v1/alignment/augmentation/run # Batch augmentation
GET /api/v1/alignment/augmentation/status # Check status
โ
Batch Cross-Lingual Augmentation Script (layer4_alignment/scripts/cross_lingual_augmentor.py)
- Async OpenAI/Gemini API calls with retry logic
- 70/30 mixing ratio (native + augmented data)
- ShareGPT output format with term metadata
๐ฆ New Dependencies:
pip install argostranslate # Local offline translation
pip install httpx # Async HTTP for LLM APIs๐ง Technical Debt Remediation Complete:
- โ
Created
shared/module with centralized utilities - โ
Centralized database schemas (11 tables in
shared/schema.py) - โ
Standardized error handling (
shared/errors.py) - โ Replaced all hardcoded API URLs with environment-aware configuration
- โ Added type hints to core functions
- โ Environment-aware CORS configuration
- โ
Centralized configuration constants (
shared/config.py)
| Technology | Purpose |
|---|---|
| FastAPI | Web framework |
| SQLite + aiosqlite | Async database |
| Wikipedia-API | Term crawling |
| zhconv | Chinese conversion |
| Marker | PDF parsing (Layer 2) |
| Sentence-BERT | Semantic alignment (Layer 2) |
| Gemini API | Sentiment annotation (Layer 3) |
| Technology | Purpose |
|---|---|
| Vue 3 + Vite | Frontend framework |
| TailwindCSS | UI styling |
| D3.js | Knowledge graph visualization |
| ECharts | Trend charts (Layer 3) |
| Axios | HTTP client |
| Format | Purpose |
|---|---|
| JSONL | Primary data format (ML friendly) |
| TMX | Translation Memory (CAT tools) |
| CSV/TSV | General tables (Excel) |
{
"id": 1,
"term": "Inflation",
"definitions": {
"en": {"summary": "In economics, inflation is...", "url": "https://..."},
"zh": {"summary": "้่ดง่จ่ๆฏๆ...", "url": "https://..."},
"ja": {"summary": "ใคใณใใฌใผใทใงใณใจใฏ...", "url": "https://..."}
},
"related_terms": ["Deflation", "CPI", "Monetary_Policy"],
"categories": ["Macroeconomics"]
}{
"term": "Inflation",
"pboc": {
"source": "2024Q3 Monetary Policy Report",
"text": "Current inflation remains moderate, CPI rose 0.4% YoY..."
},
"fed": {
"source": "2024 December Beige Book",
"text": "Prices continued to rise modestly across most districts..."
},
"similarity": 0.85
}{
"id": 1,
"title": "Fed signals slower pace of rate cuts amid sticky inflation",
"source": "Bloomberg",
"date": "2024-12-13",
"related_terms": ["Inflation", "Interest_Rate"],
"sentiment": {"label": "bearish", "score": 0.82},
"market_context": {"sp500_change": -0.54}
}- Breaking the single-dimension limitation of traditional corpora
- Full chain tracking from "term definition โ policy application โ market reaction"
- Marker solves PDF table/formula parsing challenges
- LLM pre-annotation + human verification, 10x efficiency improvement
- Term frequency overlaid with market index analysis
- Corpus with economic forecasting potential
- Researchers: Policy comparison + trend analysis
- Translators: TMX translation memory
- Analysts: Sentiment monitoring dashboard
- โ Only collecting public data (government reports, Wikipedia)
- โ News stores only summaries/headlines + original links
- โ Compliant with Wikipedia API User-Agent Policy
- โ Non-commercial academic research project
We welcome contributors with the following backgrounds:
- Economics/Trade: Term selection, policy interpretation
- Languages/Translation: Doccano annotation verification
- Computer Science: Algorithm optimization, visualization
- Fork this repository
- Create feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add some AmazingFeature') - Push branch (
git push origin feature/AmazingFeature) - Create Pull Request
This project is licensed under the MIT License - see LICENSE for details.
- PDF Parsing: Marker
- Annotation Platform: Doccano
- Semantic Model: Sentence-BERT
- Base Project: TermCorpusGenerator
โญ If this project helps you, please give us a Star!