-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Epic: Restructure Classifier Training Pipeline for Deterministic Generalization Control
This pipeline draws inspiration from margin-theory based classifier selection and capacity control, applied here in a discrete collision-driven system. After 4 months of empirical research across 140 languages, we've identified the core principle: capacity is trivial, pressure is everything. Our current forward training pipeline discards sum information and trains neurons in suboptimal order, leading to memorization rather than generalization. This epic implements a scientifically-backed backward pipeline with capacity control based on margin theory.
Core Research Findings:
- Training must start with the final neuron to establish dataset bias (phase 0)
- Backward training order (final → layer N → ... → layer 1) is optimal for information flow
- Premodulo = capacity² controls generalization via forced collisions
- Sum cache enables margin-based neuron selection and capacity tuning
- Generalization phase begins when train accuracy is intentionally capped (~75% train, ~65% eval) - Note: The pipeline doesn’t enforce a fixed train accuracy — it naturally saturates due to capacity constraints.
- The "always yes" final neuron is not a bug - it's the bias term learning dataset prior
Detailed 15-Day Implementation Plan
Day 1-2: Foundation & Architecture
Milestone: Core Structures and Interfaces
- Create new branch:
feat/backward-pipeline-v2 - Design the Pipeline interface that can work in both backward and forward modes
- Define data structures for SumCache (neuron sums, margins, timestamps)
- Create configuration system with feature flags for gradual rollout
- Establish logging framework specifically for margin tracking and capacity decisions
Day 3-4: Sum Cache Implementation
Milestone: Persistent Margin History Storage
- Implement SumCache as an in memory caching store (RAM only, not SQLite/BoltDB) tracking:
- Neuron ID and layer position
- Historical sums (S values) from [-N, +N] range
- Dataset size (N) at time of recording
- Calculated margin (m = |S|/N)
- Premodulo value used during that training run
- Training epoch and timestamp
- Design efficient query patterns:
- Get latest sum for a neuron
- Get margin trends over last N runs
- Find neurons in "gold zone" (ε < |m| < τ)
- Calculate layer-wise margin distributions
- Implement cache pruning to prevent unbounded growth
Day 5-6: Capacity Control Engine
Milestone: Premodulo Decision System
- Implement the capacity control law derived from research:
- Target capacity = N / (|S| + 1)
- Premodulo = capacity²
- Clamp changes to prevent oscillation (max 10x change per step, later add more options such as)
- use moving average of last k sums
- exponential smoothing instead of instantaneous sum
- limit premodulo change by a fixed percent per step
- Create three operational regimes:
- Collapsed neurons (|S| ≈ N): Increase collisions (lower capacity)
- Noisy neurons (|S| ≈ 0): Increase capacity (reduce collisions)
- Gold zone neurons (ε < |S| < τ): Fine-tune capacity
- Add safety bounds: capacity between 1 and N
- Implement change dampening using weighted moving averages
Day 7-8: Neuron Selection Algorithm
Milestone: Intelligent Training Order Decisions
- Implement "gold zone" detection algorithm:
- Noise floor (ε) = √N
- Collapse threshold (τ) = 0.8N
- Gold zone = neurons where ε < |S| < τ
- Create neuron prioritization logic:
- First priority: Neurons in gold zone (highest information gain)
- Second priority: Neurons with moderate margins
- Last resort: Random selection for exploration
- For layer selection: Always train backward (final → ... → input)
- Stretch goal: Pairwise selection for anti-correlated neuron pairs
Day 9-10: Backward Pipeline Orchestration
Milestone: Complete Pipeline Restructuring
- Implement Phase 0: Bootstrap final neuron
- Set initial premodulo = N (dataset size)
- Train with heavy collisions to learn dataset prior
- This establishes the bias term
- Implement Phase 1: Backward progression
- After final neuron, move to previous layer
- For each layer, select optimal neuron using above algorithm
- Calculate appropriate premodulo based on last known sum
- Train, record results to sum cache, move backward
- Add pipeline state persistence for crash recovery
- Implement forward skip mechanism if backward gets stuck
Day 11-12: Integration & Validation Testing
Milestone: End-to-End Working System
- Replace forward training loops with backward orchestration
- Create comprehensive test suite:
- Test final neuron trains first in all scenarios
- Test premodulo adaptation follows capacity law
- Test backward ordering is maintained
- Test sum cache consistency across restarts
- Test generalization gap preservation
- Performance benchmarking:
- Compare training time vs current pipeline
- Measure memory usage of sum cache
- Validate no deadlocks or infinite loops
- A/B testing framework for gradual rollout
Day 13: Monitoring & Observability
Milestone: Real-time Training Insights
- Implement dashboard showing:
- Current training layer and neuron
- Margin distribution across layers
- Premodulo values and capacity calculations
- Gold zone neuron count per layer
- Train vs eval accuracy divergence
- Add alerting for critical conditions:
- Train accuracy > 95% (memorization risk)
- Margin stagnation (>3 epochs no change)
- Premodulo oscillation detected
- Pipeline stuck in one layer
- Create visualization of backward flow through network
Day 14: Polish & Production Readiness
Milestone: Deployment Preparation
- Complete API documentation for new pipeline
- Write migration guide for existing 140 language models
- Create configuration templates with sensible defaults:
- Pipeline direction (backward/forward)
- Gold zone thresholds
- Premodulo adaptation aggressiveness
- Cache retention policies
- Performance optimizations:
- Batch updates to sum cache
- Async logging for performance-critical paths
- Memory-efficient margin calculations
- Add feature flags for controlled rollout
Day 15: Deployment & Live Validation
Milestone: Successful Migration
- Deploy to staging environment with 3 representative languages
- Run A/B tests: 50% backward pipeline, 50% current pipeline
- Validate success metrics:
- Generalization gap maintained
- Training order correct
- Capacity adaptation working
- Performance within acceptable bounds
- Gradual rollout to all 140 languages
- Monitor for 24 hours with enhanced logging
- Final verification: All languages successfully migrated
Technical Architecture
Data Flow
- Initialization: Load model, initialize sum cache, set final neuron premodulo = N
- Phase 0: Train final neuron with high collisions, record sum
- Phase 1: For each layer (backward):
- Query sum cache for neuron margins
- Select neuron using gold zone algorithm
- Calculate new premodulo using capacity law
- Train neuron with calculated premodulo
- Record results to sum cache
- Move to previous layer
- Monitoring: Continuously update dashboards, check alerts
Key Algorithms
-
Gold Zone Detection:
- Input: Neuron sum S, dataset size N (proxy for dataset complexity)
- Calculate: margin = |S|/N
- Gold zone: √N/N < margin < 0.8 (plausible defaults, but they’re still heuristics)
- Output: Boolean (is in gold zone)
-
Capacity Calculation:
- Input: S (last sum), N (dataset size), current premodulo
- Calculate: target_capacity = N/(|S|+1)
- Calculate: new_premodulo = target_capacity²
- Apply: clamping and dampening
- Output: New premodulo value
-
Neuron Selection:
- Input: List of neurons in current layer
- Filter: Find neurons in gold zone
- If found: Select neuron with margin closest to middle of gold zone
- If not: Select neuron with moderate margin
- Output: Selected neuron ID
Storage Schema
The SumCache stores:
- Neuron identifier (layer.neuron_position)
- Training run identifier
- Timestamp
- S value (sum of boolean outputs)
- N value (dataset size at time)
- Calculated margin (derived column)
- Premodulo used
- Training duration
- Result metrics (accuracy, etc.)
Success Metrics
Primary Metrics (Must Achieve):
- Training Order Compliance: 100% of runs start with final neuron
- Capacity Adaptation: Premodulo values correlate with margin (R² > 0.7)
- Generalization Preservation: Train accuracy capped at 60-80%, eval rising
- Backward Flow: Layers trained in correct backward order
- All Languages Work: All 140 languages train successfully
Secondary Metrics (Should Achieve):
- Performance: <10% training time increase vs current pipeline
- Cache Efficiency: <100MB memory for sum cache at scale
- Stability: No pipeline deadlocks or crashes
- Observability: All metrics available in dashboard
Risk Assessment & Mitigation
| Risk | Probability | Impact | Mitigation Strategy |
|---|---|---|---|
| Stale sum data | Medium | High | Use weighted average of last 3 runs, mark stale data |
| Pipeline deadlock | Low | Critical | Timeout + forward skip after 3 attempts |
| Cache corruption | Low | High | Regular backups, checksum validation |
| Premodulo oscillation | Medium | Medium | Change dampening, bounds checking |
| Performance degradation | Medium | Medium | Feature flag rollback, optimization passes |
| Migration failure | Low | Critical | A/B testing, gradual rollout, rollback plan |
Dependencies
- Existing neuron table format and storage
- Current dataset loaders and preprocessing
- Evaluation metrics system
- Model persistence layer
Rollback Plan
Three-tier rollback strategy:
- Soft Rollback: Feature flag disabled → revert to forward pipeline
- Medium Rollback: Remove sum cache influence but keep structure
- Hard Rollback: Complete code revert to previous commit
Each tier has decreasing impact and increasing implementation time.
Resources Required
- Development: 1 senior Go engineer (15 days)
- Testing: 1 QA engineer (5 days overlap)
- Infrastructure: Minor database resources for sum cache
- Monitoring: Enhanced dashboard development
Related Work
- Previous research: 4 months of empirical testing across 140 languages
- Research findings documented in internal wiki
Acceptance Criteria
Must Have:
- Final neuron trains first in 100% of training runs
- Sum cache persists across training sessions and restarts
- Premodulo adapts according to capacity law (N/(|S|+1))²
- Neurons selected from gold zone when available
- Backward ordering maintained: final → layer N → ... → layer 1
- Generalization gap preserved (not chasing 100% train accuracy)
- All 140 existing languages train successfully
- Performance within 10% of current pipeline
Should Have:
- Real-time monitoring dashboard
- Alerting for memorization risk (train > 95%)
- Configuration system for tuning parameters
- Migration path for existing models
- Comprehensive test coverage (>80%)
Could Have:
- Pairwise neuron training for anti-correlated pairs
- Predictive generalization phase detection
- Automated capacity tuning recommendations
- Historical analysis of margin trends
Assignees
- Pipeline Architecture: @neurlang
- Sum Cache Implementation: @neurlang
- Testing & Validation: @neurlang
- Monitoring & Dashboard: @neurlang
Timeline
Total: 15 days (aggressive but achievable)
Start Date: ASAP
Expected Completion: ASAP + 15 days
Priority: P0 (Critical for research progress)
Confidence: High (backed by extensive empirical evidence)
Impact: Transformational (moves from heuristic to theory-driven training)
This implementation represents the culmination of 4 months of research. The backward pipeline with sum cache and capacity control is the most important feature needed right now.