EconMind Matrix

A Multi-Granularity Bilingual Corpus System for Economic Analysis
An intelligent corpus platform integrating terminology, policy documents, and market sentiment

🎯 Project Overview

EconMind Matrix is an innovative multilingual corpus platform for economics, integrating three dimensions of data:

Layer	Name	Content	Data Source
Layer 1	Terminology Knowledge Base	20+ language economic term definitions + Knowledge Graph	Wikipedia
Layer 2	Policy Parallel Corpus	Central bank report alignment (PBOC vs Fed)	Official Reports
Layer 3	Sentiment & Trend Corpus	Financial news + Sentiment labels + Time series	News Media

┌─────────────────────────────────────────────────────────────┐
│  Layer 3: Sentiment & Trend Corpus                          │
│  📰 Financial News + Sentiment Labels + Term Trend Charts   │
└──────────────────────┬──────────────────────────────────────┘
                       │ Time Series Correlation
┌──────────────────────▼──────────────────────────────────────┐
│  Layer 2: Policy & Comparable Corpus                        │
│  📊 Central Bank Report Alignment (PBOC vs Fed)             │
└──────────────────────┬──────────────────────────────────────┘
                       │ Term Linking
┌──────────────────────▼──────────────────────────────────────┐
│  Layer 1: Terminology Knowledge Base                        │
│  📚 20+ Language Definitions + Knowledge Graph              │
└─────────────────────────────────────────────────────────────┘

✨ Core Features

🔍 Three-Layer Integrated Search

Search for "Inflation" and get:

Terminology Layer: 20+ language professional definitions + related concept knowledge graph
Policy Layer: PBOC vs Federal Reserve related paragraph comparison
Sentiment Layer: Last 30 days news headlines + sentiment trend chart

📊 Intelligent Data Processing

Multilingual Support: Covers 20+ languages including English, Chinese, Japanese, Korean, French, German, Russian
Chinese Conversion: Automatic Traditional to Simplified Chinese conversion
Knowledge Graph: D3.js visualization of term relationship networks

🤖 AI-Driven Annotation

LLM Pre-annotation: Using Gemini/GPT for sentiment analysis and entity extraction
Human-in-the-Loop: Doccano platform for expert verification
Quality Control: Hybrid annotation accuracy > 90%

💾 Professional Export Formats

JSONL: Machine learning training format
TMX: Translation Memory (CAT tool compatible)
CSV/TSV: Excel/Pandas friendly
TXT: Human readable format

📁 Project Structure

EconMind-Matrix/
│
├── 📂 backend/                   # Layer 1: Terminology Backend (Complete)
│   ├── main.py                   # FastAPI server
│   ├── database.py               # Database operations
│   ├── models.py                 # Data models
│   ├── .env.example              # Environment configuration template
│   └── output/                   # Crawl results (Markdown)
│
├── 📂 frontend/                  # Layer 1: Vue.js Frontend (Complete)
│   ├── src/
│   │   ├── App.vue               # Main component
│   │   ├── components/           # UI components
│   │   └── services/api.js       # Centralized API service
│   ├── .env.development          # Dev environment config
│   ├── .env.production           # Prod environment config
│   └── package.json
│
├── 📂 shared/                    # Shared Utilities (NEW)
│   ├── __init__.py               # Package exports
│   ├── utils.py                  # Text utilities (clean_text)
│   ├── schema.py                 # Centralized DB schemas (11 tables)
│   ├── errors.py                 # Standardized error classes
│   ├── config.py                 # Configuration constants
│   └── README.md                 # Module documentation
│
├── 📂 layer2_policy/             # Layer 2: Policy Module (Complete)
│   ├── backend/
│   │   ├── api.py                # Policy API endpoints
│   │   ├── pdf_parser.py         # Marker PDF parsing
│   │   ├── alignment.py          # Sentence-BERT paragraph alignment
│   │   └── models.py             # Policy data models
│   └── data/
│       ├── pboc/                 # PBOC reports
│       └── fed/                  # Federal Reserve reports
│
├── 📂 layer3_sentiment/          # Layer 3: Sentiment Module (Complete)
│   ├── backend/
│   │   ├── api.py                # FastAPI sentiment endpoints
│   │   ├── database.py           # Sentiment database operations
│   │   └── models.py             # News & sentiment data models
│   ├── crawler/                  # News crawler
│   │   └── news_crawler.py       # RSS feed crawler (Bloomberg, Reuters, etc.)
│   ├── annotation/               # LLM annotation + Doccano integration
│   │   ├── llm_annotator.py      # Gemini API sentiment analysis
│   │   └── doccano_export.py     # Doccano import/export scripts
│   └── analysis/                 # Trend analysis
│       └── trend_analysis.py     # Time series analysis module
│
├── 📂 dataset/                   # Dataset export directory
│   ├── terminology.jsonl         # Layer 1 data
│   ├── policy_alignment.jsonl    # Layer 2 data
│   └── news_sentiment.jsonl      # Layer 3 data
│
├── 📂 scripts/                   # Automation scripts
│   ├── export_dataset.py         # Dataset export
│   └── crawl_all.py              # Batch crawling
│
├── 📂 docs/                      # Project documentation
│   ├── proposal.md               # Project proposal
│   ├── architecture.md           # Technical architecture
│   └── api.md                    # API documentation
│
├── pyproject.toml                # Python package configuration
├── README.md                     # This file
├── SETUP.md                      # Installation guide
└── LICENSE                       # MIT License

🚀 Quick Start

Requirements

Python 3.9+
Node.js 16+
Git

Installation

# 1. Clone repository
git clone https://github.com/[your-username]/EconMind-Matrix.git
cd EconMind-Matrix

# 2. Install backend dependencies
cd backend
pip install -r requirements.txt

# 3. Install frontend dependencies
cd ../frontend
npm install

# 4. Start backend server
cd ../backend
python main.py  # Runs on http://localhost:8000

# 5. Start frontend dev server
cd ../frontend
npm run dev  # Runs on http://localhost:5173

⚠️ Important Configuration

Visit the Manage page in the web interface to configure your User-Agent (required by Wikipedia API).
See SETUP.md for details.

📈 Development Roadmap

✅ Phase 1: Terminology Knowledge Base (Complete)

Based on TermCorpusGenerator project

Wikipedia multilingual term crawling
20+ language support (including Traditional/Simplified Chinese conversion)
Batch import and automated crawling
Intelligent association crawling (See Also, link analysis)
D3.js knowledge graph visualization
Multi-format export (JSON, JSONL, CSV, TSV, TMX, TXT)
Data quality analysis and cleaning tools
Database backup/restore functionality

🔄 Phase 2: Policy Parallel Corpus (Feature Complete & Tested)

Target: Mid December 2025

✅ Code Implementation Completed (2024-12-14):

✅ Completed Testing & Environment:

Install dependencies: torch, sentence-transformers (Successfully installed)
Test PDF parsing with Marker
Test alignment with Sentence-BERT (High quality semantic matching enabled)
Integrate Layer 2 router into main.py
Frontend Component: PolicyCompare.vue with Topics, History, and Exports

✅ Phase 3: Sentiment & Trend Corpus (Complete)

Completed: December 2025

✅ Full Implementation Completed (2025-12-16):

🎯 Phase 4: Offline Multi-Dimensional Semantic Alignment Pipeline (January-February 2026)

Critical Distinction: Layer 4 is NOT a user interface - it is an offline batch processing engine that consumes completed data from Layers 1-3 and produces publication-ready aligned datasets.

🏭 Architectural Role: The "Alignment Factory"

Input → Process → Output Model:

Layer 1 Data (corpus.db)  ──┐
Layer 2 Data (corpus.db)  ──┼──→ Alignment Engine ──→ Unified Dataset File
Layer 3 Data (corpus.db)  ──┘   (Batch Pipeline)       (aligned_corpus.jsonl)

What Layer 4 Does:

Enumerates all successfully crawled terms from Layer 1
Searches Layer 2/3 for content related to each term (across ALL supported languages)
Aligns using multiple strategies (LLM, vectors, rules) to determine semantic relevance
Aggregates aligned evidence into structured "Knowledge Cells"
Exports publication-ready datasets in standardized formats (JSONL, CSV, etc.)
Reports data quality metrics (coverage, alignment scores, language distribution)

What Layer 4 Does NOT Do:

❌ Provide real-time user search interfaces (that's the frontend's job)
❌ Store data in its own database (reads from Layer 1-3 databases)
❌ Crawl or collect raw data (Layers 1-3 handle this)

🗂️ Module Structure

layer4_alignment/
├── backend/
│   ├── alignment_engine.py       # Core orchestration logic
│   ├── data_loader.py             # Load data from Layer 1-3 databases
│   ├── knowledge_cell.py          # Knowledge Cell data model (Pydantic)
│   ├── aligners/                  # Pluggable alignment strategies
│   │   ├── llm_aligner.py         # Gemini/GPT-4 semantic judgment
│   │   ├── vector_aligner.py      # Sentence-BERT cosine similarity
│   │   ├── rule_aligner.py        # Keyword + TF-IDF matching
│   │   └── hybrid_aligner.py      # Weighted ensemble of above methods
│   ├── exporters/
│   │   ├── jsonl_exporter.py      # JSONL dataset export
│   │   ├── csv_exporter.py        # Spreadsheet-friendly export
│   │   └── quality_reporter.py    # Statistics and quality metrics
│   └── utils/
│       ├── wikidata_client.py     # Fetch Wikidata QIDs for terms
│       └── text_processor.py      # Multilingual text normalization
├── config/
│   ├── alignment_config.yaml      # Alignment strategy settings
│   └── language_support.yaml      # Language priority and mappings
├── scripts/
│   ├── run_full_alignment.py      # Batch process all terms
│   ├── incremental_update.py      # Process newly added terms only
│   └── validate_output.py         # Verify dataset integrity
└── README.md

⚙️ Alignment Strategies (Multi-Method Ensemble)

Layer 4 employs 4 complementary alignment methods to maximize accuracy:

1. LLM Semantic Alignment (Primary, Weight: 50%)

Model: Gemini 1.5 Pro / GPT-4 Turbo
Method: Present term definition + candidate texts to LLM, ask for relevance scoring (0-1)

Prompt Example:

Term: "Inflation" (Definition: In economics, inflation is a general rise in prices...)

Rate each policy paragraph's relevance to this concept (0-1 scale):
[0] "Current inflation remains moderate, CPI rose 0.4% YoY..."  → Score: ?
[1] "Export growth accelerated in Q3..."                        → Score: ?

Advantages: Understands context, handles paraphrasing, detects conceptual matches
Limitations: API costs, rate limits, requires careful prompt engineering

2. Vector Similarity Alignment (Secondary, Weight: 30%)

Model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Method:
1. Encode term definition into 768-dim vector
2. Encode each candidate paragraph/article into vectors
3. Calculate cosine similarity
4. Accept matches above threshold (e.g., >0.65)
Advantages: Fast, free, works offline, multilingual support
Limitations: May miss conceptual matches if wording differs significantly

3. Rule-Based Keyword Matching (Fallback, Weight: 15%)

Method:
1. Extract keywords from term (+ synonyms from Layer 1's related_terms)
2. Calculate TF-IDF scores in candidate texts
3. Fuzzy matching for inflected forms (e.g., "inflate" → "inflation")
Advantages: Explainable, deterministic, no API dependencies
Limitations: Purely lexical, misses semantic equivalents

4. Hybrid Ensemble (Weight: 5% as tie-breaker)

Method: Weighted vote of above 3 methods
Formula: Final_Score = 0.50×LLM + 0.30×Vector + 0.15×Rule + 0.05×Ensemble_Bonus
Ensemble Bonus: +0.05 if all 3 methods agree (high confidence indicator)

Filtering Threshold: Only matches with Final_Score ≥ 0.65 are included in the Knowledge Cell.

📐 Knowledge Cell Data Model

Each term produces one Knowledge Cell, which is the atomic unit of the aligned dataset:

{
  "concept_id": "Q17127698",                    // Wikidata QID (or TERM_<id> if unavailable)
  "primary_term": "Inflation",                  // English canonical term
  
  "definitions": {                              // Layer 1: Multilingual definitions
    "en": {
      "term": "Inflation",
      "summary": "In economics, inflation is a general rise in the price level...",
      "url": "https://en.wikipedia.org/wiki/Inflation",
      "source": "Wikipedia"
    },
    "zh": {
      "term": "通货膨胀",
      "summary": "通货膨胀是指一般物价水平在一定时期内持续上涨...",
      "url": "https://zh.wikipedia.org/wiki/通货膨胀",
      "source": "Wikipedia"
    },
    "ja": {...},
    "ko": {...}
    // All languages supported by Layer 1
  },
  
  "policy_evidence": [                          // Layer 2: Aligned policy paragraphs
    {
      "source": "pboc",
      "paragraph_id": 42,
      "text": "当前通胀保持温和，CPI同比上涨0.4%，核心CPI上涨0.3%...",
      "topic": "price_stability",
      "alignment_scores": {
        "llm": 0.92,
        "vector": 0.78,
        "rule": 0.85,
        "final": 0.88
      },
      "alignment_method": "hybrid_ensemble",
      "report_metadata": {
        "title": "2024年第三季度中国货币政策执行报告",
        "date": "2024-11-08",
        "section": "Part II: Monetary Policy Operations"
      }
    },
    {
      "source": "fed",
      "paragraph_id": 156,
      "text": "Prices continued to rise modestly across most districts. Retail prices increased...",
      "topic": "inflation",
      "alignment_scores": {...},
      "report_metadata": {...}
    }
  ],
  
  "sentiment_evidence": [                       // Layer 3: Aligned news articles
    {
      "article_id": 1523,
      "title": "Fed signals slower pace of rate cuts amid sticky inflation",
      "source": "Bloomberg",
      "url": "https://www.bloomberg.com/...",
      "published_date": "2024-12-13",
      "sentiment": {
        "label": "bearish",
        "confidence": 0.82,
        "annotator": "gemini-1.5-flash"
      },
      "alignment_scores": {
        "llm": 0.95,
        "vector": 0.89,
        "rule": 0.72,
        "final": 0.91
      }
    },
    {...}
  ],
  
  "metadata": {
    "created_at": "2025-01-15T10:23:45Z",
    "alignment_engine_version": "4.0.0",
    "quality_metrics": {
      "overall_score": 0.87,              // Weighted avg of all alignment scores
      "language_coverage": 8,              // Number of languages with definitions
      "policy_evidence_count": 12,         // PBOC + Fed paragraphs aligned
      "sentiment_evidence_count": 25,      // News articles aligned (last 90 days)
      "avg_policy_score": 0.84,
      "avg_sentiment_score": 0.89
    }
  }
}

🔧 Configuration System

File: layer4_alignment/config/alignment_config.yaml

# Alignment Strategy Settings
alignment_strategies:
  llm_semantic:
    enabled: true
    provider: "gemini"              # or "openai", "deepseek"
    model: "gemini-1.5-pro"
    api_key_env: "GEMINI_API_KEY"
    temperature: 0.1
    max_tokens: 500
    batch_size: 10                  # Process 10 candidates per LLM call
    threshold: 0.70
    weight: 0.50
    
  vector_similarity:
    enabled: true
    model: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
    device: "cuda"                  # or "cpu"
    threshold: 0.65
    weight: 0.30
    
  keyword_matching:
    enabled: true
    use_fuzzy: true
    fuzzy_threshold: 0.85
    tfidf_top_k: 20
    threshold: 0.60
    weight: 0.15

# Global Settings
global:
  min_final_score: 0.65             # Discard alignments below this
  max_policy_evidence: 15           # Top N policy paragraphs per term
  max_sentiment_evidence: 30        # Top N news articles per term
  sentiment_time_window_days: 90    # Only recent news
  
# Language Support (inherits from Layer 1)
languages:
  priority: ["en", "zh", "ja", "ko", "fr", "de", "es", "ru"]
  fallback_language: "en"

# Output Settings
output:
  format: "jsonl"                   # or "json", "csv"
  output_dir: "dataset"
  filename_template: "aligned_corpus_v{version}_{date}.jsonl"
  include_metadata: true
  compress: false                   # Set true to generate .jsonl.gz

# Quality Reporting
quality_report:
  enabled: true
  output_path: "dataset/quality_report.md"
  visualizations: true              # Generate charts if matplotlib available

🚀 Execution Workflow

Full Alignment Run (one-time or periodic):

cd layer4_alignment
python scripts/run_full_alignment.py --config config/alignment_config.yaml

Console Output Example:

========================================
Layer 4 Alignment Engine v4.0.0
========================================
[INFO] Loading configuration from alignment_config.yaml
[INFO] Initializing aligners: LLM (Gemini) + Vector (SBERT) + Rule
[INFO] Loading Layer 1 terms from corpus.db... Found 287 terms
[INFO] Loading Layer 2 policy corpus... 1,234 paragraphs (PBOC: 623, Fed: 611)
[INFO] Loading Layer 3 news articles... 4,567 articles (last 90 days)

[1/287] Aligning term: "Inflation" (8 languages)
  ├─ Layer 2: Found 45 candidate paragraphs
  │   ├─ LLM filtering: 12 relevant (scores 0.70-0.95)
  │   ├─ Vector filtering: 18 relevant (scores 0.65-0.88)
  │   └─ Ensemble: 14 final matches (avg score 0.84)
  ├─ Layer 3: Found 128 candidate articles
  │   └─ Ensemble: 27 final matches (avg score 0.87)
  └─ Knowledge Cell quality: 0.86 ✓

[2/287] Aligning term: "GDP" (7 languages)
  ...

[287/287] Aligning term: "Quantitative Easing" (5 languages)
  └─ Knowledge Cell quality: 0.79 ✓

========================================
Alignment Complete!
========================================
Output: dataset/aligned_corpus_v1_2025-01-15.jsonl
Total Knowledge Cells: 287
Avg Quality Score: 0.82
Time Elapsed: 2h 34m

Generating quality report... ✓
Report saved to: dataset/quality_report.md

Incremental Update (for newly added terms):

python scripts/incremental_update.py --since "2025-01-15"

📊 Output Dataset Formats

Format 1: JSONL (Primary, ML-Ready)

File: aligned_corpus_v1_2025-01-15.jsonl
Structure: One Knowledge Cell per line (newline-delimited JSON)
Size: ~500 KB per 100 terms (uncompressed)
Use Case: LLM fine-tuning, batch processing, streaming ingestion

Format 2: CSV (Analysis-Friendly)

File: aligned_corpus_v1_2025-01-15.csv

Columns:

concept_id, term_en, term_zh, term_ja, ..., 
policy_count, sentiment_count, quality_score, 
top_policy_source, top_sentiment_label

Use Case: Excel analysis, Pandas dataframes, visualization

Format 3: Quality Report (Markdown)

File: quality_report.md
Contents:
- Overall statistics (total cells, avg scores, language distribution)
- Top 10 highest quality cells
- Bottom 10 cells requiring manual review
- Alignment method performance comparison
- Visualizations (if enabled): bar charts, heatmaps

📈 Success Metrics

Metric	Target	Description
Coverage Rate	≥ 80%	% of Layer 1 terms with aligned Layer 2+3 data
Avg Alignment Score	≥ 0.75	Mean of all `final_score` values
Language Completeness	≥ 5 langs/term	Average languages with definitions per cell
Policy Evidence Density	≥ 3 paragraphs/term	Avg aligned policy paragraphs per cell
Sentiment Evidence Density	≥ 10 articles/term	Avg aligned news articles per cell
Processing Speed	≤ 30s/term	Time to align one term (all layers)

🔄 Integration with Other Phases

Inputs from Previous Phases:

Layer 1 → Provides canonical terms + multilingual definitions + Wikidata QIDs
Layer 2 → Provides policy paragraphs tagged with topics
Layer 3 → Provides sentiment-annotated news + trend data

Outputs for Next Phase:

Phase 5 → Publication-ready datasets for competition submission
Frontend → (Optional) Pre-computed aligned data for fast UI loading
External Users → High-quality training data for domain-specific LLMs

🛠️ Technical Requirements

Dependencies:

# Core
pydantic>=2.5.0
pyyaml>=6.0
aiosqlite>=0.19.0

# Alignment Methods
google-generativeai>=0.3.0      # Gemini API
openai>=1.6.0                   # GPT-4 API (optional)
sentence-transformers>=2.2.0    # Vector embeddings
scikit-learn>=1.3.0             # TF-IDF, cosine similarity

# Utilities
requests>=2.31.0                # Wikidata API
tqdm>=4.66.0                    # Progress bars
pandas>=2.0.0                   # Data export

Hardware Recommendations:

CPU: 4+ cores (for parallel processing)
RAM: 8GB+ (for embedding model caching)
GPU: Optional but recommended for vector embeddings (CUDA-compatible)
Storage: 2GB for models + 500MB for output datasets

🎯 Deliverables (Phase 4 Completion Checklist)

🏆 Phase 5: Competition Submission (March 2026)

📋 Current Development Status

Last Updated: 2024-12-16 23:00

What's Completed

Component	Status	Files
Layer 1 Backend	✅ Complete	`backend/main.py`, `database.py`, etc.
Layer 1 Frontend	✅ Complete	`frontend/src/` (6 components)
Layer 2 Models	✅ Complete	`layer2-policy/backend/models.py`
Layer 2 PDF Parser	✅ Complete	`layer2-policy/backend/pdf_parser.py`
Layer 2 Alignment	✅ Complete	`layer2-policy/backend/alignment.py`
Layer 2 Database	✅ Complete	`layer2-policy/backend/database.py`
Layer 2 API	✅ Complete	`layer2-policy/backend/api.py`
Layer 2 Frontend	✅ Complete	`frontend/src/components/PolicyCompare.vue`
Layer 3 Models	✅ Complete	`layer3_sentiment/backend/models.py`
Layer 3 Database	✅ Complete	`layer3_sentiment/backend/database.py`
Layer 3 Crawler	✅ Complete	`layer3_sentiment/crawler/news_crawler.py`
Layer 3 Annotator	✅ Complete	`layer3_sentiment/annotation/llm_annotator.py`
Layer 3 Doccano	✅ Complete	`layer3_sentiment/annotation/doccano_export.py`
Layer 3 Trends	✅ Complete	`layer3_sentiment/analysis/trend_analysis.py`
Layer 3 API	✅ Complete	`layer3_sentiment/backend/api.py`
Layer 3 Frontend	✅ Complete	`frontend/src/components/SentimentAnalysis.vue`
Export Scripts	🔧 Framework	`scripts/export_dataset.py`
Documentation	✅ Complete	`docs/architecture.md`, `docs/api.md`

Latest Updates (2026-01-01)

🌐 Layer 4: Cross-Lingual Augmentation & LLM Training Export

✅ Fully Localized LLM Exports (8 languages: EN, ZH, JA, KO, DE, FR, ES, RU)

All LLM training formats (Alpaca, ShareGPT, OpenAI, Dolly, Text) now use localized templates
Questions, instructions, and system prompts are dynamically translated per language
Language-source filtering: ZH exports → PBOC data only, EN exports → FED data only

✅ Cross-Lingual Augmentation Panel (置顶于Layer 4仪表盘)

3 Translation Modes:

Mode	Description	Requirements
📄 No Translation	Export native data only	None
🖥️ Local (Argos)	Offline neural MT	`pip install argostranslate`
🌐 API	LLM translation (高质量)	OpenAI/Gemini API Key

Configure API provider (OpenAI/Gemini), model, and augmentation ratio
View FED/PBOC record counts and latest output files

✅ Per-Cell Translation Export

Each Knowledge Cell can be exported with translation mode selection
Supports real-time LLM translation via OpenAI/Gemini API (httpx async calls)
Local translation using argostranslate (free, offline)

✅ New Backend Endpoints:

POST /api/v1/alignment/cell/{id}/export/local-translate  # Argos offline translation
POST /api/v1/alignment/cell/{id}/export/cross-lingual    # LLM API translation
POST /api/v1/alignment/augmentation/run                  # Batch augmentation
GET  /api/v1/alignment/augmentation/status               # Check status

✅ Batch Cross-Lingual Augmentation Script (layer4_alignment/scripts/cross_lingual_augmentor.py)

Async OpenAI/Gemini API calls with retry logic
70/30 mixing ratio (native + augmented data)
ShareGPT output format with term metadata

📦 New Dependencies:

pip install argostranslate  # Local offline translation
pip install httpx           # Async HTTP for LLM APIs

Previous Updates (2024-12-29)

🔧 Technical Debt Remediation Complete:

✅ Created shared/ module with centralized utilities
✅ Centralized database schemas (11 tables in shared/schema.py)
✅ Standardized error handling (shared/errors.py)
✅ Replaced all hardcoded API URLs with environment-aware configuration
✅ Added type hints to core functions
✅ Environment-aware CORS configuration
✅ Centralized configuration constants (shared/config.py)

🛠️ Tech Stack

Backend

Technology	Purpose
FastAPI	Web framework
SQLite + aiosqlite	Async database
Wikipedia-API	Term crawling
zhconv	Chinese conversion
Marker	PDF parsing (Layer 2)
Sentence-BERT	Semantic alignment (Layer 2)
Gemini API	Sentiment annotation (Layer 3)

Frontend

Technology	Purpose
Vue 3 + Vite	Frontend framework
TailwindCSS	UI styling
D3.js	Knowledge graph visualization
ECharts	Trend charts (Layer 3)
Axios	HTTP client

Data Formats

Format	Purpose
JSONL	Primary data format (ML friendly)
TMX	Translation Memory (CAT tools)
CSV/TSV	General tables (Excel)

📊 Dataset Preview

Layer 1: Terminology Data Sample

{
  "id": 1,
  "term": "Inflation",
  "definitions": {
    "en": {"summary": "In economics, inflation is...", "url": "https://..."},
    "zh": {"summary": "通货膨胀是指...", "url": "https://..."},
    "ja": {"summary": "インフレーションとは...", "url": "https://..."}
  },
  "related_terms": ["Deflation", "CPI", "Monetary_Policy"],
  "categories": ["Macroeconomics"]
}

Layer 2: Policy Alignment Data Sample

{
  "term": "Inflation",
  "pboc": {
    "source": "2024Q3 Monetary Policy Report",
    "text": "Current inflation remains moderate, CPI rose 0.4% YoY..."
  },
  "fed": {
    "source": "2024 December Beige Book",
    "text": "Prices continued to rise modestly across most districts..."
  },
  "similarity": 0.85
}

Layer 3: News Sentiment Data Sample

{
  "id": 1,
  "title": "Fed signals slower pace of rate cuts amid sticky inflation",
  "source": "Bloomberg",
  "date": "2024-12-13",
  "related_terms": ["Inflation", "Interest_Rate"],
  "sentiment": {"label": "bearish", "score": 0.82},
  "market_context": {"sp500_change": -0.54}
}

🌟 Innovation Highlights

1. Three-Layer Vertical Architecture

Breaking the single-dimension limitation of traditional corpora
Full chain tracking from "term definition → policy application → market reaction"

2. AI-Driven Efficiency Boost

Marker solves PDF table/formula parsing challenges
LLM pre-annotation + human verification, 10x efficiency improvement

3. Time Series Economic Insights

Term frequency overlaid with market index analysis
Corpus with economic forecasting potential

4. Multi-Scenario Applications

Researchers: Policy comparison + trend analysis
Translators: TMX translation memory
Analysts: Sentiment monitoring dashboard

⚖️ Data Compliance

✅ Only collecting public data (government reports, Wikipedia)
✅ News stores only summaries/headlines + original links
✅ Compliant with Wikipedia API User-Agent Policy
✅ Non-commercial academic research project

🤝 Contributing

We welcome contributors with the following backgrounds:

Economics/Trade: Term selection, policy interpretation
Languages/Translation: Doccano annotation verification
Computer Science: Algorithm optimization, visualization

Contribution Process

Fork this repository
Create feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add some AmazingFeature')
Push branch (git push origin feature/AmazingFeature)
Create Pull Request

📚 Documentation

📄 License

This project is licensed under the MIT License - see LICENSE for details.

🙏 Acknowledgments

PDF Parsing: Marker
Annotation Platform: Doccano
Semantic Model: Sentence-BERT
Base Project: TermCorpusGenerator

⭐ If this project helps you, please give us a Star!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
backend		backend
dataset		dataset
frontend		frontend
layer2_policy		layer2_policy
layer3_sentiment		layer3_sentiment
layer4_alignment		layer4_alignment
scripts		scripts
shared		shared
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
pyproject.toml		pyproject.toml

License

silentflarecom/EconMind-Matrix

Folders and files

Latest commit

History

Repository files navigation