Skip to content

Comprehensive framework for mathematical reasoning research with dual research capabilities

License

Notifications You must be signed in to change notification settings

hoadm-net/MathCoRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

50 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MathCoRL - Mathematical Intelligence with Reinforcement Learning

Python 3.8+ License: MIT

Research framework for mathematical reasoning with multiple LLM backends (OpenAI API, Claude API, Open-Source HuggingFace models) and reinforcement learning-based example selection.

🎯 Multi-Backend Research Framework

MathCoRL supports three LLM backends for comprehensive mathematical reasoning research:

πŸ”Œ LLM Provider Support

1. OpenAI API

  • Models: GPT-4o, GPT-4, GPT-3.5-turbo (all variants)
  • Features: Complete API integration with accurate token counting
  • Status: βœ… Fully supported and tested

2. Claude API

  • Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
  • Features: Native Anthropic API integration via LangChain
  • Status: βœ… Fully supported and tested

3. Open-Source Models (HuggingFace)

  • Models:
    • DeepSeek-R1 (1.5B, 7B, 14B)
    • Qwen2.5-Math (7B, 72B)
  • Features: Local GPU inference, zero API cost
  • Requirements: CUDA GPU recommended (tested on RTX 3090 24GB)
  • Status: βœ… Fully supported with unified interface

πŸ“š Prompting Methods

Compare different prompting techniques:

  • Zero-Shot: Direct problem solving without examples
  • Few-Shot: Random example selection from candidate pool
  • FPP (Function Prototype Prompting): With policy network example selection
  • CoT, PAL, PoT: Additional baseline methods (API models only)

🧠 In-Context Learning (ICL) Research

Compare example selection strategies:

  • Policy Network: Reinforcement learning-based selection
  • KATE: K-nearest neighbors with embeddings
  • CDS: Clustering-based diverse selection
  • Random: Baseline random sampling

πŸ“Š Supported Research Datasets

Dataset Domain Size Description ICL k Both Providers
GSM8K Elementary Math 8.5K Grade School Math word problems 2 βœ…
SVAMP Arithmetic 1K Simple arithmetic word problems with variations 2 βœ…
TabMWP Tabular Math 38K Math problems involving tables and charts 2 βœ…
TAT-QA Financial QA 16K Table-and-text QA for financial documents 3 βœ…
FinQA Financial Analysis 8K Complex financial reasoning and calculations 2 βœ…

Each dataset includes:

  • Training set: For candidate generation and policy training
  • Test set: For evaluation and comparison
  • Cross-provider evaluation: Test with both OpenAI and Claude
  • API cost tracking: Monitor usage across providers

πŸš€ Quick Start

Requirements

  • Python: 3.8+ (tested on 3.10, 3.11, 3.13)
  • Memory: 4GB minimum, 8GB recommended for Policy Network training
  • Storage: 2GB for datasets and embeddings
  • API Keys: OpenAI or Anthropic account with API access

Installation

# Clone repository
git clone https://github.com/your-username/MathCoRL.git
cd MathCoRL

# Install dependencies
pip install -r requirements.txt

# Configure API keys (optional for open-source models)
cp env.example .env
# Edit .env with your API keys:
# OPENAI_API_KEY=your_openai_key        # For API models
# ANTHROPIC_API_KEY=your_anthropic_key  # For Claude
# LLM_PROVIDER=openai                   # Default provider (openai/claude)

Quick Start Examples

Option 1: Open-Source Models (Zero Cost)

# Test with DeepSeek-R1 7B on GSM8K
python mathcorl_os.py test --method zero_shot --model deepseek_r1_7b --dataset GSM8K --samples 10

# Compare all 3 methods (zero-shot, few-shot, fpp+policy)
python mathcorl_os.py compare --model deepseek_r1_7b --dataset GSM8K --samples 50

# Test with Qwen2.5-Math 7B
python mathcorl_os.py compare --model qwen_math_7b --dataset TAT-QA --samples 50

# Available models: deepseek_r1_7b, deepseek_r1_1.5b, qwen_math_7b, qwen_math_72b

Option 2: API Models (OpenAI/Claude)

# Single problem solving
python -m mint.cli solve --method fpp --question "What is 15 + 27?" --provider openai
python -m mint.cli solve --method cot --question "John has 20 apples..." --provider claude

# Dataset evaluation
python -m mint.cli test --method fpp --dataset SVAMP --limit 100 --provider openai
python -m mint.cli test --method cot --dataset GSM8K --limit 50 --provider claude

# Interactive mode
python -m mint.cli interactive --provider openai

Policy Network Training & ICL Research

# Step 1: Generate candidate examples with embeddings
python generate_candidates.py --dataset TAT-QA --n-candidates 30 --seed 42

# Step 2: Train Policy Network for example selection  
python train_policy.py --dataset TAT-QA --epochs 20 --seed 42

# Step 3: Compare ICL methods (works with both API and open-source)
python run_comparison.py --dataset TAT-QA --samples 101 --seed 42

# Test with open-source models + policy network
python mathcorl_os.py test --method fpp_policy --model deepseek_r1_7b --dataset GSM8K --samples 50

πŸ”§ Advanced Features

API Tracking & Cost Monitoring (API Models)

# Real-time usage statistics
python -m mint.cli stats                    # All providers, last 24h
python -m mint.cli stats --hours 12         # Last 12 hours
python -m mint.cli stats --provider claude  # Claude only

# Export detailed usage data
python -m mint.cli export --format csv      # CSV export
python -m mint.cli export --format json     # JSON export

Ablation Studies

# Pool size ablation (ICL research)
python run_pool_size_ablation.py --dataset GSM8K --samples 101

# Method comparison ablation
python run_ablation_study.py --dataset SVAMP --methods fpp,cot,pal

πŸ“ˆ Research Methodology

Prompting Methods

  • Zero-Shot: Direct problem solving without examples
  • Few-Shot: Random k examples from candidate pool
  • FPP (Function Prototype Prompting): Structured reasoning with math functions + policy network selection
  • CoT (Chain-of-Thought): Step-by-step natural language reasoning (API only)
  • PAL/PoT: Program-based reasoning (API only)

ICL Example Selection Strategies

  • Policy Network: Reinforcement learning-based adaptive selection (1536Dβ†’768D transformer)
  • KATE: k-Nearest neighbors with embedding similarity
  • CDS: Clustering-based diverse selection
  • Random: Baseline random sampling

Multi-Backend Architecture

  • API Models: OpenAI/Claude via REST APIs with token tracking
  • Open-Source: HuggingFace models with local GPU inference
  • Unified Interface: Same prompting methods across all backends
  • Cost Comparison: $0 for open-source vs API pricing

πŸ› οΈ Technical Architecture

Core Components

mint/                              # Core package
β”œβ”€β”€ cli.py                         # Unified command-line interface
β”œβ”€β”€ config.py                      # Multi-provider configuration
β”œβ”€β”€ tracking.py                    # Universal API tracking
β”œβ”€β”€ reproducibility.py             # Seed fixing for reproducibility
β”œβ”€β”€ core.py                        # FPP implementation
β”œβ”€β”€ cot.py, pal.py, pot.py        # Alternative prompting methods
β”œβ”€β”€ zero_shot.py                   # Zero-shot baseline
β”œβ”€β”€ icrl/                          # In-Context RL components
β”‚   β”œβ”€β”€ candidate_generator.py     # Training example extraction
β”‚   β”œβ”€β”€ policy_network.py         # Neural selection model
β”‚   β”œβ”€β”€ trainer.py                # PPO training implementation
β”‚   └── evaluator.py              # Multi-method evaluation
β”œβ”€β”€ utils.py                       # Evaluation utilities
└── testing.py                    # Testing framework

Multi-Provider Workflow

CLI Interface β†’ Provider Selection β†’ Method Execution β†’ Universal Tracking β†’ Results
     ↓                 ↓                    ↓                   ↓
   User Input    [OpenAI|Claude]    [FPP|CoT|PAL|PoT]    Cost/Token Tracking

πŸ† Key Features

Comprehensive Functionality

  • βœ… Dual LLM Provider Support: Full OpenAI and Claude integration
  • βœ… Universal API Tracking: Accurate cost monitoring across providers
  • βœ… Reproducibility: Comprehensive seed fixing for consistent results
  • βœ… Complete Method Suite: 5 prompting methods + 5 ICL strategies
  • βœ… Interactive CLI: Real-time problem solving and testing
  • βœ… Advanced Visualization: Charts, exports, and analysis tools
  • βœ… Reinforcement Learning: Policy network training for example selection
  • βœ… Production Ready: Comprehensive logging, error handling, and documentation

Research Capabilities

  • πŸ”¬ Method Comparison: Systematic evaluation of reasoning approaches
  • πŸ“Š Cross-Provider Analysis: Performance comparison between OpenAI and Claude
  • πŸ’° Cost Optimization: Detailed tracking for budget-conscious research
  • 🎯 ICL Research: Advanced in-context learning with neural selection
  • πŸ“ˆ Scalability: Support for large-scale dataset evaluation
  • πŸ”„ Reproducibility: Comprehensive configuration and result tracking

πŸ“š Documentation

Comprehensive guides available in docs/ directory:

πŸŽ“ Research Applications

Prompting Research

  • Compare structured vs. free-form reasoning approaches
  • Evaluate mathematical reasoning capabilities across different LLMs
  • Study cost-effectiveness of different prompting strategies
  • Analyze reasoning quality and interpretability

In-Context Learning Research

  • Investigate optimal example selection strategies
  • Study reinforcement learning for demonstration selection
  • Compare neural vs. similarity-based selection methods
  • Explore curriculum learning effects in mathematical reasoning

Cross-Provider Analysis

  • Evaluate reasoning capabilities: OpenAI vs Claude
  • Compare cost efficiency across providers and methods
  • Study model-specific optimal prompting strategies
  • Analyze scaling laws for mathematical reasoning

Cost Optimization Research

  • Track accuracy per dollar across methods and providers
  • Optimize API usage for budget-constrained environments
  • Study token efficiency patterns in mathematical reasoning

πŸ› οΈ Configuration Options

Environment Variables

# Provider configuration
LLM_PROVIDER=openai                    # Default: openai | claude
OPENAI_API_KEY=your_openai_key         # Required for OpenAI
ANTHROPIC_API_KEY=your_anthropic_key   # Required for Claude

# Model selection
OPENAI_MODEL=gpt-4o-mini              # OpenAI model choice
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022  # Claude model choice

# Generation parameters  
TEMPERATURE=0.1                        # Response randomness
MAX_TOKENS=4000                       # Maximum response length

Advanced Configuration

# Programmatic configuration
from mint.config import create_llm_client, get_config

# Create provider-specific clients
openai_client = create_llm_client(provider="openai")
claude_client = create_llm_client(provider="claude")

# Access configuration
config = get_config()
print(f"Current provider: {config.provider}")
print(f"Current model: {config.get_current_model_name()}")

🀝 Contributing

See CONTRIBUTING.md for guidelines on:

  • Code style and testing requirements
  • Pull request process
  • Research contribution areas

πŸ› Troubleshooting

Common Issues

Import Error: ModuleNotFoundError: No module named 'mint'

pip install -e .  # Install package in development mode

API Key Error: openai.error.AuthenticationError

# Verify .env file exists and contains valid keys
cat .env | grep API_KEY
export OPENAI_API_KEY=your_key_here  # Set directly if needed

CUDA/MPS Device Error: RuntimeError: MPS backend out of memory

# Use CPU instead of GPU
export PYTORCH_ENABLE_MPS_FALLBACK=1
# Or reduce batch size in configs/hyperparameters.yaml

Embedding Generation Slow: Taking too long on large datasets

# Use smaller candidate pools
python generate_candidates.py --n-candidates 50  # Default is 100

Policy Network Training Unstable: Loss not decreasing

# Adjust learning rate and epochs in configs/hyperparameters.yaml
# Try: lr: 0.0001 (lower) or epochs: 5 (more training)

For additional support, see documentation or open an issue on GitHub.

🀝 Contributing

MathCoRL welcomes contributions in:

  • New Prompting Methods: Additional structured reasoning approaches
  • LLM Provider Integration: Support for new language models
  • ICL Strategies: Novel example selection algorithms
  • Datasets: Additional mathematical reasoning domains
  • Evaluation Metrics: Advanced correctness and efficiency measures
  • Cost Optimization: More efficient API usage patterns

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Comprehensive framework for mathematical reasoning research with dual research capabilities

Topics

Resources

License

Contributing

Stars

Watchers

Forks