Skip to content

silentflarecom/EconMind-Matrix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

19 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

EconMind Matrix

A Multi-Granularity Bilingual Corpus System for Economic Analysis
An intelligent corpus platform integrating terminology, policy documents, and market sentiment

License Python Vue Status


๐ŸŽฏ Project Overview

EconMind Matrix is an innovative multilingual corpus platform for economics, integrating three dimensions of data:

Layer Name Content Data Source
Layer 1 Terminology Knowledge Base 20+ language economic term definitions + Knowledge Graph Wikipedia
Layer 2 Policy Parallel Corpus Central bank report alignment (PBOC vs Fed) Official Reports
Layer 3 Sentiment & Trend Corpus Financial news + Sentiment labels + Time series News Media
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Layer 3: Sentiment & Trend Corpus                          โ”‚
โ”‚  ๐Ÿ“ฐ Financial News + Sentiment Labels + Term Trend Charts   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚ Time Series Correlation
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Layer 2: Policy & Comparable Corpus                        โ”‚
โ”‚  ๐Ÿ“Š Central Bank Report Alignment (PBOC vs Fed)             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚ Term Linking
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Layer 1: Terminology Knowledge Base                        โ”‚
โ”‚  ๐Ÿ“š 20+ Language Definitions + Knowledge Graph              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœจ Core Features

๐Ÿ” Three-Layer Integrated Search

Search for "Inflation" and get:

  • Terminology Layer: 20+ language professional definitions + related concept knowledge graph
  • Policy Layer: PBOC vs Federal Reserve related paragraph comparison
  • Sentiment Layer: Last 30 days news headlines + sentiment trend chart

๐Ÿ“Š Intelligent Data Processing

  • Multilingual Support: Covers 20+ languages including English, Chinese, Japanese, Korean, French, German, Russian
  • Chinese Conversion: Automatic Traditional to Simplified Chinese conversion
  • Knowledge Graph: D3.js visualization of term relationship networks

๐Ÿค– AI-Driven Annotation

  • LLM Pre-annotation: Using Gemini/GPT for sentiment analysis and entity extraction
  • Human-in-the-Loop: Doccano platform for expert verification
  • Quality Control: Hybrid annotation accuracy > 90%

๐Ÿ’พ Professional Export Formats

  • JSONL: Machine learning training format
  • TMX: Translation Memory (CAT tool compatible)
  • CSV/TSV: Excel/Pandas friendly
  • TXT: Human readable format

๐Ÿ“ Project Structure

EconMind-Matrix/
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ backend/                   # Layer 1: Terminology Backend (Complete)
โ”‚   โ”œโ”€โ”€ main.py                   # FastAPI server
โ”‚   โ”œโ”€โ”€ database.py               # Database operations
โ”‚   โ”œโ”€โ”€ models.py                 # Data models
โ”‚   โ”œโ”€โ”€ .env.example              # Environment configuration template
โ”‚   โ””โ”€โ”€ output/                   # Crawl results (Markdown)
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ frontend/                  # Layer 1: Vue.js Frontend (Complete)
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ App.vue               # Main component
โ”‚   โ”‚   โ”œโ”€โ”€ components/           # UI components
โ”‚   โ”‚   โ””โ”€โ”€ services/api.js       # Centralized API service
โ”‚   โ”œโ”€โ”€ .env.development          # Dev environment config
โ”‚   โ”œโ”€โ”€ .env.production           # Prod environment config
โ”‚   โ””โ”€โ”€ package.json
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ shared/                    # Shared Utilities (NEW)
โ”‚   โ”œโ”€โ”€ __init__.py               # Package exports
โ”‚   โ”œโ”€โ”€ utils.py                  # Text utilities (clean_text)
โ”‚   โ”œโ”€โ”€ schema.py                 # Centralized DB schemas (11 tables)
โ”‚   โ”œโ”€โ”€ errors.py                 # Standardized error classes
โ”‚   โ”œโ”€โ”€ config.py                 # Configuration constants
โ”‚   โ””โ”€โ”€ README.md                 # Module documentation
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ layer2_policy/             # Layer 2: Policy Module (Complete)
โ”‚   โ”œโ”€โ”€ backend/
โ”‚   โ”‚   โ”œโ”€โ”€ api.py                # Policy API endpoints
โ”‚   โ”‚   โ”œโ”€โ”€ pdf_parser.py         # Marker PDF parsing
โ”‚   โ”‚   โ”œโ”€โ”€ alignment.py          # Sentence-BERT paragraph alignment
โ”‚   โ”‚   โ””โ”€โ”€ models.py             # Policy data models
โ”‚   โ””โ”€โ”€ data/
โ”‚       โ”œโ”€โ”€ pboc/                 # PBOC reports
โ”‚       โ””โ”€โ”€ fed/                  # Federal Reserve reports
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ layer3_sentiment/          # Layer 3: Sentiment Module (Complete)
โ”‚   โ”œโ”€โ”€ backend/
โ”‚   โ”‚   โ”œโ”€โ”€ api.py                # FastAPI sentiment endpoints
โ”‚   โ”‚   โ”œโ”€โ”€ database.py           # Sentiment database operations
โ”‚   โ”‚   โ””โ”€โ”€ models.py             # News & sentiment data models
โ”‚   โ”œโ”€โ”€ crawler/                  # News crawler
โ”‚   โ”‚   โ””โ”€โ”€ news_crawler.py       # RSS feed crawler (Bloomberg, Reuters, etc.)
โ”‚   โ”œโ”€โ”€ annotation/               # LLM annotation + Doccano integration
โ”‚   โ”‚   โ”œโ”€โ”€ llm_annotator.py      # Gemini API sentiment analysis
โ”‚   โ”‚   โ””โ”€โ”€ doccano_export.py     # Doccano import/export scripts
โ”‚   โ””โ”€โ”€ analysis/                 # Trend analysis
โ”‚       โ””โ”€โ”€ trend_analysis.py     # Time series analysis module
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ dataset/                   # Dataset export directory
โ”‚   โ”œโ”€โ”€ terminology.jsonl         # Layer 1 data
โ”‚   โ”œโ”€โ”€ policy_alignment.jsonl    # Layer 2 data
โ”‚   โ””โ”€โ”€ news_sentiment.jsonl      # Layer 3 data
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ scripts/                   # Automation scripts
โ”‚   โ”œโ”€โ”€ export_dataset.py         # Dataset export
โ”‚   โ””โ”€โ”€ crawl_all.py              # Batch crawling
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ docs/                      # Project documentation
โ”‚   โ”œโ”€โ”€ proposal.md               # Project proposal
โ”‚   โ”œโ”€โ”€ architecture.md           # Technical architecture
โ”‚   โ””โ”€โ”€ api.md                    # API documentation
โ”‚
โ”œโ”€โ”€ pyproject.toml                # Python package configuration
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ SETUP.md                      # Installation guide
โ””โ”€โ”€ LICENSE                       # MIT License

๐Ÿš€ Quick Start

Requirements

  • Python 3.9+
  • Node.js 16+
  • Git

Installation

# 1. Clone repository
git clone https://github.com/[your-username]/EconMind-Matrix.git
cd EconMind-Matrix

# 2. Install backend dependencies
cd backend
pip install -r requirements.txt

# 3. Install frontend dependencies
cd ../frontend
npm install

# 4. Start backend server
cd ../backend
python main.py  # Runs on http://localhost:8000

# 5. Start frontend dev server
cd ../frontend
npm run dev  # Runs on http://localhost:5173

โš ๏ธ Important Configuration

Visit the Manage page in the web interface to configure your User-Agent (required by Wikipedia API).
See SETUP.md for details.


๐Ÿ“ˆ Development Roadmap

โœ… Phase 1: Terminology Knowledge Base (Complete)

Based on TermCorpusGenerator project

  • Wikipedia multilingual term crawling
  • 20+ language support (including Traditional/Simplified Chinese conversion)
  • Batch import and automated crawling
  • Intelligent association crawling (See Also, link analysis)
  • D3.js knowledge graph visualization
  • Multi-format export (JSON, JSONL, CSV, TSV, TMX, TXT)
  • Data quality analysis and cleaning tools
  • Database backup/restore functionality

๐Ÿ”„ Phase 2: Policy Parallel Corpus (Feature Complete & Tested)

Target: Mid December 2025

โœ… Code Implementation Completed (2024-12-14):

  • Data Models (layer2-policy/backend/models.py)

    • PolicyReport, PolicyParagraph, PolicyAlignment dataclasses
    • Database schema for Layer 2 tables
    • 8 policy topics with bilingual keywords (inflation, employment, etc.)
    • Topic detection via keyword matching
  • PDF Parsing Module (layer2-policy/backend/pdf_parser.py)

    • Marker integration for AI-powered PDFโ†’Markdown conversion
    • PyPDF2 fallback for basic text extraction
    • Automatic title and date extraction
    • Paragraph splitting with topic detection
    • Section-aware parsing for PBOC and Fed reports
  • Paragraph Alignment Module (layer2-policy/backend/alignment.py)

    • Sentence-BERT semantic similarity (multilingual)
    • Topic-based alignment fallback
    • Keyword overlap fallback
    • Embedding caching for performance
    • Alignment History tracking
    • Custom Topic Pool (User defined topics)
  • Database Operations (layer2-policy/backend/database.py)

    • Async CRUD for reports, paragraphs, alignments
    • Statistics endpoint
    • Term search across policy paragraphs
    • Quality score calculation with language breakdown
  • API Endpoints (layer2-policy/backend/api.py)

    • POST /upload - Upload and parse PDF
    • POST /upload-text - Upload text (testing)
    • GET /reports - List reports
    • POST /align - Run alignment
    • GET /alignments - Query alignments
    • GET /topics - List and manage topics
    • GET /stats - Layer 2 statistics
    • GET /export/* - Export Alignments (JSONL), Reports (JSONL), Parallel Corpus (TSV)

โœ… Completed Testing & Environment:

  • Install dependencies: torch, sentence-transformers (Successfully installed)
  • Test PDF parsing with Marker
  • Test alignment with Sentence-BERT (High quality semantic matching enabled)
  • Integrate Layer 2 router into main.py
  • Frontend Component: PolicyCompare.vue with Topics, History, and Exports

โœ… Phase 3: Sentiment & Trend Corpus (Complete)

Completed: December 2025

โœ… Full Implementation Completed (2025-12-16):

  • Data Models (layer3-sentiment/backend/models.py)

    • NewsArticle, SentimentAnnotation, MarketContext dataclasses
    • Database schema for Layer 3 tables
    • Economic term variants (EN/ZH) for news filtering
    • Sentiment labels: Bullish, Bearish, Neutral
  • News Crawler (layer3-sentiment/crawler/news_crawler.py)

    • RSS feed integration (Bloomberg, Reuters, WSJ, FT, Xinhua, 21 sources)
    • Async crawling with feedparser
    • Term-based news filtering
    • Automatic term detection from article content
    • User-Agent rotation pool (8 browser UAs)
    • Proxy pool support (http/https/socks5)
    • Concurrency control (1-10 concurrent requests)
    • Custom delay (0.5-10 seconds between requests)
    • Manual start/stop control with verification
  • LLM Sentiment Annotator (layer3-sentiment/annotation/llm_annotator.py)

    • Gemini API integration for sentiment analysis
    • Bilingual prompt templates (EN/ZH)
    • Rule-based fallback annotator (no API required)
    • Hybrid annotator (optimizes API usage)
    • Batch annotation with rate limiting
  • Doccano Integration (layer3-sentiment/annotation/doccano_export.py)

    • JSONL export for Doccano platform
    • CSV export for spreadsheet annotation
    • Import verified annotations back to database
    • Annotation quality checking
  • Trend Analysis (layer3-sentiment/analysis/trend_analysis.py)

    • Daily term frequency calculation
    • Sentiment distribution over time
    • Trend direction detection (increasing/decreasing/stable)
    • Market correlation analysis (optional)
    • ECharts-compatible data generation
  • API Endpoints (layer3-sentiment/backend/api.py)

    • POST /crawl - Crawl news from sources
    • GET /articles - List articles
    • POST /annotate - Run sentiment annotation
    • GET /trend/{term} - Get term trend analysis
    • GET /trends/hot - Get hot terms
    • GET /export/doccano - Export for Doccano
  • Frontend Component (frontend/src/components/SentimentAnalysis.vue)

    • Dashboard with sentiment statistics
    • News crawling interface with advanced options
    • Articles list with sentiment labels (search, filter, group by source)
    • Trend analysis visualization
    • Export options (JSON, JSONL, CSV, Doccano)
    • Running crawler detection on page load
    • Force stop with verification polling
    • Proxy pool configuration UI

๐ŸŽฏ Phase 4: Offline Multi-Dimensional Semantic Alignment Pipeline (January-February 2026)

Critical Distinction: Layer 4 is NOT a user interface - it is an offline batch processing engine that consumes completed data from Layers 1-3 and produces publication-ready aligned datasets.


๐Ÿญ Architectural Role: The "Alignment Factory"

Input โ†’ Process โ†’ Output Model:

Layer 1 Data (corpus.db)  โ”€โ”€โ”
Layer 2 Data (corpus.db)  โ”€โ”€โ”ผโ”€โ”€โ†’ Alignment Engine โ”€โ”€โ†’ Unified Dataset File
Layer 3 Data (corpus.db)  โ”€โ”€โ”˜   (Batch Pipeline)       (aligned_corpus.jsonl)

What Layer 4 Does:

  1. Enumerates all successfully crawled terms from Layer 1
  2. Searches Layer 2/3 for content related to each term (across ALL supported languages)
  3. Aligns using multiple strategies (LLM, vectors, rules) to determine semantic relevance
  4. Aggregates aligned evidence into structured "Knowledge Cells"
  5. Exports publication-ready datasets in standardized formats (JSONL, CSV, etc.)
  6. Reports data quality metrics (coverage, alignment scores, language distribution)

What Layer 4 Does NOT Do:

  • โŒ Provide real-time user search interfaces (that's the frontend's job)
  • โŒ Store data in its own database (reads from Layer 1-3 databases)
  • โŒ Crawl or collect raw data (Layers 1-3 handle this)

๐Ÿ—‚๏ธ Module Structure

layer4_alignment/
โ”œโ”€โ”€ backend/
โ”‚   โ”œโ”€โ”€ alignment_engine.py       # Core orchestration logic
โ”‚   โ”œโ”€โ”€ data_loader.py             # Load data from Layer 1-3 databases
โ”‚   โ”œโ”€โ”€ knowledge_cell.py          # Knowledge Cell data model (Pydantic)
โ”‚   โ”œโ”€โ”€ aligners/                  # Pluggable alignment strategies
โ”‚   โ”‚   โ”œโ”€โ”€ llm_aligner.py         # Gemini/GPT-4 semantic judgment
โ”‚   โ”‚   โ”œโ”€โ”€ vector_aligner.py      # Sentence-BERT cosine similarity
โ”‚   โ”‚   โ”œโ”€โ”€ rule_aligner.py        # Keyword + TF-IDF matching
โ”‚   โ”‚   โ””โ”€โ”€ hybrid_aligner.py      # Weighted ensemble of above methods
โ”‚   โ”œโ”€โ”€ exporters/
โ”‚   โ”‚   โ”œโ”€โ”€ jsonl_exporter.py      # JSONL dataset export
โ”‚   โ”‚   โ”œโ”€โ”€ csv_exporter.py        # Spreadsheet-friendly export
โ”‚   โ”‚   โ””โ”€โ”€ quality_reporter.py    # Statistics and quality metrics
โ”‚   โ””โ”€โ”€ utils/
โ”‚       โ”œโ”€โ”€ wikidata_client.py     # Fetch Wikidata QIDs for terms
โ”‚       โ””โ”€โ”€ text_processor.py      # Multilingual text normalization
โ”œโ”€โ”€ config/
โ”‚   โ”œโ”€โ”€ alignment_config.yaml      # Alignment strategy settings
โ”‚   โ””โ”€โ”€ language_support.yaml      # Language priority and mappings
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ run_full_alignment.py      # Batch process all terms
โ”‚   โ”œโ”€โ”€ incremental_update.py      # Process newly added terms only
โ”‚   โ””โ”€โ”€ validate_output.py         # Verify dataset integrity
โ””โ”€โ”€ README.md

โš™๏ธ Alignment Strategies (Multi-Method Ensemble)

Layer 4 employs 4 complementary alignment methods to maximize accuracy:

1. LLM Semantic Alignment (Primary, Weight: 50%)
  • Model: Gemini 1.5 Pro / GPT-4 Turbo
  • Method: Present term definition + candidate texts to LLM, ask for relevance scoring (0-1)
  • Prompt Example:
    Term: "Inflation" (Definition: In economics, inflation is a general rise in prices...)
    
    Rate each policy paragraph's relevance to this concept (0-1 scale):
    [0] "Current inflation remains moderate, CPI rose 0.4% YoY..."  โ†’ Score: ?
    [1] "Export growth accelerated in Q3..."                        โ†’ Score: ?
    
  • Advantages: Understands context, handles paraphrasing, detects conceptual matches
  • Limitations: API costs, rate limits, requires careful prompt engineering
2. Vector Similarity Alignment (Secondary, Weight: 30%)
  • Model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
  • Method:
    1. Encode term definition into 768-dim vector
    2. Encode each candidate paragraph/article into vectors
    3. Calculate cosine similarity
    4. Accept matches above threshold (e.g., >0.65)
  • Advantages: Fast, free, works offline, multilingual support
  • Limitations: May miss conceptual matches if wording differs significantly
3. Rule-Based Keyword Matching (Fallback, Weight: 15%)
  • Method:
    1. Extract keywords from term (+ synonyms from Layer 1's related_terms)
    2. Calculate TF-IDF scores in candidate texts
    3. Fuzzy matching for inflected forms (e.g., "inflate" โ†’ "inflation")
  • Advantages: Explainable, deterministic, no API dependencies
  • Limitations: Purely lexical, misses semantic equivalents
4. Hybrid Ensemble (Weight: 5% as tie-breaker)
  • Method: Weighted vote of above 3 methods
  • Formula: Final_Score = 0.50ร—LLM + 0.30ร—Vector + 0.15ร—Rule + 0.05ร—Ensemble_Bonus
  • Ensemble Bonus: +0.05 if all 3 methods agree (high confidence indicator)

Filtering Threshold: Only matches with Final_Score โ‰ฅ 0.65 are included in the Knowledge Cell.


๐Ÿ“ Knowledge Cell Data Model

Each term produces one Knowledge Cell, which is the atomic unit of the aligned dataset:

{
  "concept_id": "Q17127698",                    // Wikidata QID (or TERM_<id> if unavailable)
  "primary_term": "Inflation",                  // English canonical term
  
  "definitions": {                              // Layer 1: Multilingual definitions
    "en": {
      "term": "Inflation",
      "summary": "In economics, inflation is a general rise in the price level...",
      "url": "https://en.wikipedia.org/wiki/Inflation",
      "source": "Wikipedia"
    },
    "zh": {
      "term": "้€š่ดง่†จ่ƒ€",
      "summary": "้€š่ดง่†จ่ƒ€ๆ˜ฏๆŒ‡ไธ€่ˆฌ็‰ฉไปทๆฐดๅนณๅœจไธ€ๅฎšๆ—ถๆœŸๅ†…ๆŒ็ปญไธŠๆถจ...",
      "url": "https://zh.wikipedia.org/wiki/้€š่ดง่†จ่ƒ€",
      "source": "Wikipedia"
    },
    "ja": {...},
    "ko": {...}
    // All languages supported by Layer 1
  },
  
  "policy_evidence": [                          // Layer 2: Aligned policy paragraphs
    {
      "source": "pboc",
      "paragraph_id": 42,
      "text": "ๅฝ“ๅ‰้€š่ƒ€ไฟๆŒๆธฉๅ’Œ๏ผŒCPIๅŒๆฏ”ไธŠๆถจ0.4%๏ผŒๆ ธๅฟƒCPIไธŠๆถจ0.3%...",
      "topic": "price_stability",
      "alignment_scores": {
        "llm": 0.92,
        "vector": 0.78,
        "rule": 0.85,
        "final": 0.88
      },
      "alignment_method": "hybrid_ensemble",
      "report_metadata": {
        "title": "2024ๅนด็ฌฌไธ‰ๅญฃๅบฆไธญๅ›ฝ่ดงๅธๆ”ฟ็ญ–ๆ‰ง่กŒๆŠฅๅ‘Š",
        "date": "2024-11-08",
        "section": "Part II: Monetary Policy Operations"
      }
    },
    {
      "source": "fed",
      "paragraph_id": 156,
      "text": "Prices continued to rise modestly across most districts. Retail prices increased...",
      "topic": "inflation",
      "alignment_scores": {...},
      "report_metadata": {...}
    }
  ],
  
  "sentiment_evidence": [                       // Layer 3: Aligned news articles
    {
      "article_id": 1523,
      "title": "Fed signals slower pace of rate cuts amid sticky inflation",
      "source": "Bloomberg",
      "url": "https://www.bloomberg.com/...",
      "published_date": "2024-12-13",
      "sentiment": {
        "label": "bearish",
        "confidence": 0.82,
        "annotator": "gemini-1.5-flash"
      },
      "alignment_scores": {
        "llm": 0.95,
        "vector": 0.89,
        "rule": 0.72,
        "final": 0.91
      }
    },
    {...}
  ],
  
  "metadata": {
    "created_at": "2025-01-15T10:23:45Z",
    "alignment_engine_version": "4.0.0",
    "quality_metrics": {
      "overall_score": 0.87,              // Weighted avg of all alignment scores
      "language_coverage": 8,              // Number of languages with definitions
      "policy_evidence_count": 12,         // PBOC + Fed paragraphs aligned
      "sentiment_evidence_count": 25,      // News articles aligned (last 90 days)
      "avg_policy_score": 0.84,
      "avg_sentiment_score": 0.89
    }
  }
}

๐Ÿ”ง Configuration System

File: layer4_alignment/config/alignment_config.yaml

# Alignment Strategy Settings
alignment_strategies:
  llm_semantic:
    enabled: true
    provider: "gemini"              # or "openai", "deepseek"
    model: "gemini-1.5-pro"
    api_key_env: "GEMINI_API_KEY"
    temperature: 0.1
    max_tokens: 500
    batch_size: 10                  # Process 10 candidates per LLM call
    threshold: 0.70
    weight: 0.50
    
  vector_similarity:
    enabled: true
    model: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
    device: "cuda"                  # or "cpu"
    threshold: 0.65
    weight: 0.30
    
  keyword_matching:
    enabled: true
    use_fuzzy: true
    fuzzy_threshold: 0.85
    tfidf_top_k: 20
    threshold: 0.60
    weight: 0.15

# Global Settings
global:
  min_final_score: 0.65             # Discard alignments below this
  max_policy_evidence: 15           # Top N policy paragraphs per term
  max_sentiment_evidence: 30        # Top N news articles per term
  sentiment_time_window_days: 90    # Only recent news
  
# Language Support (inherits from Layer 1)
languages:
  priority: ["en", "zh", "ja", "ko", "fr", "de", "es", "ru"]
  fallback_language: "en"

# Output Settings
output:
  format: "jsonl"                   # or "json", "csv"
  output_dir: "dataset"
  filename_template: "aligned_corpus_v{version}_{date}.jsonl"
  include_metadata: true
  compress: false                   # Set true to generate .jsonl.gz

# Quality Reporting
quality_report:
  enabled: true
  output_path: "dataset/quality_report.md"
  visualizations: true              # Generate charts if matplotlib available

๐Ÿš€ Execution Workflow

Full Alignment Run (one-time or periodic):

cd layer4_alignment
python scripts/run_full_alignment.py --config config/alignment_config.yaml

Console Output Example:

========================================
Layer 4 Alignment Engine v4.0.0
========================================
[INFO] Loading configuration from alignment_config.yaml
[INFO] Initializing aligners: LLM (Gemini) + Vector (SBERT) + Rule
[INFO] Loading Layer 1 terms from corpus.db... Found 287 terms
[INFO] Loading Layer 2 policy corpus... 1,234 paragraphs (PBOC: 623, Fed: 611)
[INFO] Loading Layer 3 news articles... 4,567 articles (last 90 days)

[1/287] Aligning term: "Inflation" (8 languages)
  โ”œโ”€ Layer 2: Found 45 candidate paragraphs
  โ”‚   โ”œโ”€ LLM filtering: 12 relevant (scores 0.70-0.95)
  โ”‚   โ”œโ”€ Vector filtering: 18 relevant (scores 0.65-0.88)
  โ”‚   โ””โ”€ Ensemble: 14 final matches (avg score 0.84)
  โ”œโ”€ Layer 3: Found 128 candidate articles
  โ”‚   โ””โ”€ Ensemble: 27 final matches (avg score 0.87)
  โ””โ”€ Knowledge Cell quality: 0.86 โœ“

[2/287] Aligning term: "GDP" (7 languages)
  ...

[287/287] Aligning term: "Quantitative Easing" (5 languages)
  โ””โ”€ Knowledge Cell quality: 0.79 โœ“

========================================
Alignment Complete!
========================================
Output: dataset/aligned_corpus_v1_2025-01-15.jsonl
Total Knowledge Cells: 287
Avg Quality Score: 0.82
Time Elapsed: 2h 34m

Generating quality report... โœ“
Report saved to: dataset/quality_report.md

Incremental Update (for newly added terms):

python scripts/incremental_update.py --since "2025-01-15"

๐Ÿ“Š Output Dataset Formats

Format 1: JSONL (Primary, ML-Ready)
  • File: aligned_corpus_v1_2025-01-15.jsonl
  • Structure: One Knowledge Cell per line (newline-delimited JSON)
  • Size: ~500 KB per 100 terms (uncompressed)
  • Use Case: LLM fine-tuning, batch processing, streaming ingestion
Format 2: CSV (Analysis-Friendly)
  • File: aligned_corpus_v1_2025-01-15.csv
  • Columns:
    concept_id, term_en, term_zh, term_ja, ..., 
    policy_count, sentiment_count, quality_score, 
    top_policy_source, top_sentiment_label
    
  • Use Case: Excel analysis, Pandas dataframes, visualization
Format 3: Quality Report (Markdown)
  • File: quality_report.md
  • Contents:
    • Overall statistics (total cells, avg scores, language distribution)
    • Top 10 highest quality cells
    • Bottom 10 cells requiring manual review
    • Alignment method performance comparison
    • Visualizations (if enabled): bar charts, heatmaps

๐Ÿ“ˆ Success Metrics

Metric Target Description
Coverage Rate โ‰ฅ 80% % of Layer 1 terms with aligned Layer 2+3 data
Avg Alignment Score โ‰ฅ 0.75 Mean of all final_score values
Language Completeness โ‰ฅ 5 langs/term Average languages with definitions per cell
Policy Evidence Density โ‰ฅ 3 paragraphs/term Avg aligned policy paragraphs per cell
Sentiment Evidence Density โ‰ฅ 10 articles/term Avg aligned news articles per cell
Processing Speed โ‰ค 30s/term Time to align one term (all layers)

๐Ÿ”„ Integration with Other Phases

Inputs from Previous Phases:

  • Layer 1 โ†’ Provides canonical terms + multilingual definitions + Wikidata QIDs
  • Layer 2 โ†’ Provides policy paragraphs tagged with topics
  • Layer 3 โ†’ Provides sentiment-annotated news + trend data

Outputs for Next Phase:

  • Phase 5 โ†’ Publication-ready datasets for competition submission
  • Frontend โ†’ (Optional) Pre-computed aligned data for fast UI loading
  • External Users โ†’ High-quality training data for domain-specific LLMs

๐Ÿ› ๏ธ Technical Requirements

Dependencies:

# Core
pydantic>=2.5.0
pyyaml>=6.0
aiosqlite>=0.19.0

# Alignment Methods
google-generativeai>=0.3.0      # Gemini API
openai>=1.6.0                   # GPT-4 API (optional)
sentence-transformers>=2.2.0    # Vector embeddings
scikit-learn>=1.3.0             # TF-IDF, cosine similarity

# Utilities
requests>=2.31.0                # Wikidata API
tqdm>=4.66.0                    # Progress bars
pandas>=2.0.0                   # Data export

Hardware Recommendations:

  • CPU: 4+ cores (for parallel processing)
  • RAM: 8GB+ (for embedding model caching)
  • GPU: Optional but recommended for vector embeddings (CUDA-compatible)
  • Storage: 2GB for models + 500MB for output datasets

๐ŸŽฏ Deliverables (Phase 4 Completion Checklist)

  • Core Engine

    • AlignmentEngine class with multi-strategy support
    • KnowledgeCell Pydantic model with full schema
    • Database loaders for Layer 1/2/3
    • Wikidata QID fetcher and cacher
  • Alignment Strategies

    • LLM aligner (Gemini + fallback to GPT-4)
    • Vector aligner (Sentence-BERT)
    • Rule-based aligner (keyword + TF-IDF)
    • Hybrid ensemble aggregator
  • Export System

    • JSONL exporter with compression support
    • CSV exporter with multilingual handling
    • Quality report generator (Markdown + charts)
  • Scripts & Tools

    • Full alignment runner (run_full_alignment.py)
    • Incremental updater (incremental_update.py)
    • Output validator (validate_output.py)
    • Configuration validator
  • Documentation

    • README.md with usage examples
    • Configuration guide (YAML options explained)
    • Alignment strategy comparison table
    • Troubleshooting guide
  • Testing & Validation

    • Unit tests for each aligner
    • Integration test with sample data
    • Performance benchmarks
    • Output schema validation

๐Ÿ† Phase 5: Competition Submission (March 2026)

  • Documentation

    • Technical architecture (docs/architecture.md)
    • API documentation (docs/api.md)
    • Full technical solution document (30-50 pages)
    • Dataset description document
  • Demo Preparation

    • Online demo deployment (Vercel + Railway)
    • Demo video production (5-10 min)
    • PPT presentation materials
  • Data Scale Targets

    • 500+ economic terms ร— 20 languages
    • 10+ policy report alignments
    • 5000+ news sentiment annotations

๐Ÿ“‹ Current Development Status

Last Updated: 2024-12-16 23:00

What's Completed

Component Status Files
Layer 1 Backend โœ… Complete backend/main.py, database.py, etc.
Layer 1 Frontend โœ… Complete frontend/src/ (6 components)
Layer 2 Models โœ… Complete layer2-policy/backend/models.py
Layer 2 PDF Parser โœ… Complete layer2-policy/backend/pdf_parser.py
Layer 2 Alignment โœ… Complete layer2-policy/backend/alignment.py
Layer 2 Database โœ… Complete layer2-policy/backend/database.py
Layer 2 API โœ… Complete layer2-policy/backend/api.py
Layer 2 Frontend โœ… Complete frontend/src/components/PolicyCompare.vue
Layer 3 Models โœ… Complete layer3_sentiment/backend/models.py
Layer 3 Database โœ… Complete layer3_sentiment/backend/database.py
Layer 3 Crawler โœ… Complete layer3_sentiment/crawler/news_crawler.py
Layer 3 Annotator โœ… Complete layer3_sentiment/annotation/llm_annotator.py
Layer 3 Doccano โœ… Complete layer3_sentiment/annotation/doccano_export.py
Layer 3 Trends โœ… Complete layer3_sentiment/analysis/trend_analysis.py
Layer 3 API โœ… Complete layer3_sentiment/backend/api.py
Layer 3 Frontend โœ… Complete frontend/src/components/SentimentAnalysis.vue
Export Scripts ๐Ÿ”ง Framework scripts/export_dataset.py
Documentation โœ… Complete docs/architecture.md, docs/api.md

Latest Updates (2026-01-01)

๐ŸŒ Layer 4: Cross-Lingual Augmentation & LLM Training Export

โœ… Fully Localized LLM Exports (8 languages: EN, ZH, JA, KO, DE, FR, ES, RU)

  • All LLM training formats (Alpaca, ShareGPT, OpenAI, Dolly, Text) now use localized templates
  • Questions, instructions, and system prompts are dynamically translated per language
  • Language-source filtering: ZH exports โ†’ PBOC data only, EN exports โ†’ FED data only

โœ… Cross-Lingual Augmentation Panel (็ฝฎ้กถไบŽLayer 4ไปช่กจ็›˜)

  • 3 Translation Modes:
    Mode Description Requirements
    ๐Ÿ“„ No Translation Export native data only None
    ๐Ÿ–ฅ๏ธ Local (Argos) Offline neural MT pip install argostranslate
    ๐ŸŒ API LLM translation (้ซ˜่ดจ้‡) OpenAI/Gemini API Key
  • Configure API provider (OpenAI/Gemini), model, and augmentation ratio
  • View FED/PBOC record counts and latest output files

โœ… Per-Cell Translation Export

  • Each Knowledge Cell can be exported with translation mode selection
  • Supports real-time LLM translation via OpenAI/Gemini API (httpx async calls)
  • Local translation using argostranslate (free, offline)

โœ… New Backend Endpoints:

POST /api/v1/alignment/cell/{id}/export/local-translate  # Argos offline translation
POST /api/v1/alignment/cell/{id}/export/cross-lingual    # LLM API translation
POST /api/v1/alignment/augmentation/run                  # Batch augmentation
GET  /api/v1/alignment/augmentation/status               # Check status

โœ… Batch Cross-Lingual Augmentation Script (layer4_alignment/scripts/cross_lingual_augmentor.py)

  • Async OpenAI/Gemini API calls with retry logic
  • 70/30 mixing ratio (native + augmented data)
  • ShareGPT output format with term metadata

๐Ÿ“ฆ New Dependencies:

pip install argostranslate  # Local offline translation
pip install httpx           # Async HTTP for LLM APIs

Previous Updates (2024-12-29)

๐Ÿ”ง Technical Debt Remediation Complete:

  • โœ… Created shared/ module with centralized utilities
  • โœ… Centralized database schemas (11 tables in shared/schema.py)
  • โœ… Standardized error handling (shared/errors.py)
  • โœ… Replaced all hardcoded API URLs with environment-aware configuration
  • โœ… Added type hints to core functions
  • โœ… Environment-aware CORS configuration
  • โœ… Centralized configuration constants (shared/config.py)

๐Ÿ› ๏ธ Tech Stack

Backend

Technology Purpose
FastAPI Web framework
SQLite + aiosqlite Async database
Wikipedia-API Term crawling
zhconv Chinese conversion
Marker PDF parsing (Layer 2)
Sentence-BERT Semantic alignment (Layer 2)
Gemini API Sentiment annotation (Layer 3)

Frontend

Technology Purpose
Vue 3 + Vite Frontend framework
TailwindCSS UI styling
D3.js Knowledge graph visualization
ECharts Trend charts (Layer 3)
Axios HTTP client

Data Formats

Format Purpose
JSONL Primary data format (ML friendly)
TMX Translation Memory (CAT tools)
CSV/TSV General tables (Excel)

๐Ÿ“Š Dataset Preview

Layer 1: Terminology Data Sample

{
  "id": 1,
  "term": "Inflation",
  "definitions": {
    "en": {"summary": "In economics, inflation is...", "url": "https://..."},
    "zh": {"summary": "้€š่ดง่†จ่ƒ€ๆ˜ฏๆŒ‡...", "url": "https://..."},
    "ja": {"summary": "ใ‚คใƒณใƒ•ใƒฌใƒผใ‚ทใƒงใƒณใจใฏ...", "url": "https://..."}
  },
  "related_terms": ["Deflation", "CPI", "Monetary_Policy"],
  "categories": ["Macroeconomics"]
}

Layer 2: Policy Alignment Data Sample

{
  "term": "Inflation",
  "pboc": {
    "source": "2024Q3 Monetary Policy Report",
    "text": "Current inflation remains moderate, CPI rose 0.4% YoY..."
  },
  "fed": {
    "source": "2024 December Beige Book",
    "text": "Prices continued to rise modestly across most districts..."
  },
  "similarity": 0.85
}

Layer 3: News Sentiment Data Sample

{
  "id": 1,
  "title": "Fed signals slower pace of rate cuts amid sticky inflation",
  "source": "Bloomberg",
  "date": "2024-12-13",
  "related_terms": ["Inflation", "Interest_Rate"],
  "sentiment": {"label": "bearish", "score": 0.82},
  "market_context": {"sp500_change": -0.54}
}

๐ŸŒŸ Innovation Highlights

1. Three-Layer Vertical Architecture

  • Breaking the single-dimension limitation of traditional corpora
  • Full chain tracking from "term definition โ†’ policy application โ†’ market reaction"

2. AI-Driven Efficiency Boost

  • Marker solves PDF table/formula parsing challenges
  • LLM pre-annotation + human verification, 10x efficiency improvement

3. Time Series Economic Insights

  • Term frequency overlaid with market index analysis
  • Corpus with economic forecasting potential

4. Multi-Scenario Applications

  • Researchers: Policy comparison + trend analysis
  • Translators: TMX translation memory
  • Analysts: Sentiment monitoring dashboard

โš–๏ธ Data Compliance

  • โœ… Only collecting public data (government reports, Wikipedia)
  • โœ… News stores only summaries/headlines + original links
  • โœ… Compliant with Wikipedia API User-Agent Policy
  • โœ… Non-commercial academic research project

๐Ÿค Contributing

We welcome contributors with the following backgrounds:

  • Economics/Trade: Term selection, policy interpretation
  • Languages/Translation: Doccano annotation verification
  • Computer Science: Algorithm optimization, visualization

Contribution Process

  1. Fork this repository
  2. Create feature branch (git checkout -b feature/AmazingFeature)
  3. Commit changes (git commit -m 'Add some AmazingFeature')
  4. Push branch (git push origin feature/AmazingFeature)
  5. Create Pull Request

๐Ÿ“š Documentation


๐Ÿ“„ License

This project is licensed under the MIT License - see LICENSE for details.


๐Ÿ™ Acknowledgments


โญ If this project helps you, please give us a Star!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages