Skip to content

Implement Backward Training Pipeline with Sum Cache and Capacity Control #6

@neurlang

Description

@neurlang

Epic: Restructure Classifier Training Pipeline for Deterministic Generalization Control

This pipeline draws inspiration from margin-theory based classifier selection and capacity control, applied here in a discrete collision-driven system. After 4 months of empirical research across 140 languages, we've identified the core principle: capacity is trivial, pressure is everything. Our current forward training pipeline discards sum information and trains neurons in suboptimal order, leading to memorization rather than generalization. This epic implements a scientifically-backed backward pipeline with capacity control based on margin theory.

Core Research Findings:

  1. Training must start with the final neuron to establish dataset bias (phase 0)
  2. Backward training order (final → layer N → ... → layer 1) is optimal for information flow
  3. Premodulo = capacity² controls generalization via forced collisions
  4. Sum cache enables margin-based neuron selection and capacity tuning
  5. Generalization phase begins when train accuracy is intentionally capped (~75% train, ~65% eval) - Note: The pipeline doesn’t enforce a fixed train accuracy — it naturally saturates due to capacity constraints.
  6. The "always yes" final neuron is not a bug - it's the bias term learning dataset prior

Detailed 15-Day Implementation Plan

Day 1-2: Foundation & Architecture

Milestone: Core Structures and Interfaces

  • Create new branch: feat/backward-pipeline-v2
  • Design the Pipeline interface that can work in both backward and forward modes
  • Define data structures for SumCache (neuron sums, margins, timestamps)
  • Create configuration system with feature flags for gradual rollout
  • Establish logging framework specifically for margin tracking and capacity decisions

Day 3-4: Sum Cache Implementation

Milestone: Persistent Margin History Storage

  • Implement SumCache as an in memory caching store (RAM only, not SQLite/BoltDB) tracking:
    • Neuron ID and layer position
    • Historical sums (S values) from [-N, +N] range
    • Dataset size (N) at time of recording
    • Calculated margin (m = |S|/N)
    • Premodulo value used during that training run
    • Training epoch and timestamp
  • Design efficient query patterns:
    • Get latest sum for a neuron
    • Get margin trends over last N runs
    • Find neurons in "gold zone" (ε < |m| < τ)
    • Calculate layer-wise margin distributions
  • Implement cache pruning to prevent unbounded growth

Day 5-6: Capacity Control Engine

Milestone: Premodulo Decision System

  • Implement the capacity control law derived from research:
    1. Target capacity = N / (|S| + 1)
    2. Premodulo = capacity²
    3. Clamp changes to prevent oscillation (max 10x change per step, later add more options such as)
    • use moving average of last k sums
    • exponential smoothing instead of instantaneous sum
    • limit premodulo change by a fixed percent per step
  • Create three operational regimes:
    1. Collapsed neurons (|S| ≈ N): Increase collisions (lower capacity)
    2. Noisy neurons (|S| ≈ 0): Increase capacity (reduce collisions)
    3. Gold zone neurons (ε < |S| < τ): Fine-tune capacity
  • Add safety bounds: capacity between 1 and N
  • Implement change dampening using weighted moving averages

Day 7-8: Neuron Selection Algorithm

Milestone: Intelligent Training Order Decisions

  • Implement "gold zone" detection algorithm:
    • Noise floor (ε) = √N
    • Collapse threshold (τ) = 0.8N
    • Gold zone = neurons where ε < |S| < τ
  • Create neuron prioritization logic:
    1. First priority: Neurons in gold zone (highest information gain)
    2. Second priority: Neurons with moderate margins
    3. Last resort: Random selection for exploration
  • For layer selection: Always train backward (final → ... → input)
  • Stretch goal: Pairwise selection for anti-correlated neuron pairs

Day 9-10: Backward Pipeline Orchestration

Milestone: Complete Pipeline Restructuring

  • Implement Phase 0: Bootstrap final neuron
    • Set initial premodulo = N (dataset size)
    • Train with heavy collisions to learn dataset prior
    • This establishes the bias term
  • Implement Phase 1: Backward progression
    • After final neuron, move to previous layer
    • For each layer, select optimal neuron using above algorithm
    • Calculate appropriate premodulo based on last known sum
    • Train, record results to sum cache, move backward
  • Add pipeline state persistence for crash recovery
  • Implement forward skip mechanism if backward gets stuck

Day 11-12: Integration & Validation Testing

Milestone: End-to-End Working System

  • Replace forward training loops with backward orchestration
  • Create comprehensive test suite:
    • Test final neuron trains first in all scenarios
    • Test premodulo adaptation follows capacity law
    • Test backward ordering is maintained
    • Test sum cache consistency across restarts
    • Test generalization gap preservation
  • Performance benchmarking:
    • Compare training time vs current pipeline
    • Measure memory usage of sum cache
    • Validate no deadlocks or infinite loops
  • A/B testing framework for gradual rollout

Day 13: Monitoring & Observability

Milestone: Real-time Training Insights

  • Implement dashboard showing:
    • Current training layer and neuron
    • Margin distribution across layers
    • Premodulo values and capacity calculations
    • Gold zone neuron count per layer
    • Train vs eval accuracy divergence
  • Add alerting for critical conditions:
    • Train accuracy > 95% (memorization risk)
    • Margin stagnation (>3 epochs no change)
    • Premodulo oscillation detected
    • Pipeline stuck in one layer
  • Create visualization of backward flow through network

Day 14: Polish & Production Readiness

Milestone: Deployment Preparation

  • Complete API documentation for new pipeline
  • Write migration guide for existing 140 language models
  • Create configuration templates with sensible defaults:
    • Pipeline direction (backward/forward)
    • Gold zone thresholds
    • Premodulo adaptation aggressiveness
    • Cache retention policies
  • Performance optimizations:
    • Batch updates to sum cache
    • Async logging for performance-critical paths
    • Memory-efficient margin calculations
  • Add feature flags for controlled rollout

Day 15: Deployment & Live Validation

Milestone: Successful Migration

  • Deploy to staging environment with 3 representative languages
  • Run A/B tests: 50% backward pipeline, 50% current pipeline
  • Validate success metrics:
    1. Generalization gap maintained
    2. Training order correct
    3. Capacity adaptation working
    4. Performance within acceptable bounds
  • Gradual rollout to all 140 languages
  • Monitor for 24 hours with enhanced logging
  • Final verification: All languages successfully migrated

Technical Architecture

Data Flow

  1. Initialization: Load model, initialize sum cache, set final neuron premodulo = N
  2. Phase 0: Train final neuron with high collisions, record sum
  3. Phase 1: For each layer (backward):
    • Query sum cache for neuron margins
    • Select neuron using gold zone algorithm
    • Calculate new premodulo using capacity law
    • Train neuron with calculated premodulo
    • Record results to sum cache
    • Move to previous layer
  4. Monitoring: Continuously update dashboards, check alerts

Key Algorithms

  1. Gold Zone Detection:

    • Input: Neuron sum S, dataset size N (proxy for dataset complexity)
    • Calculate: margin = |S|/N
    • Gold zone: √N/N < margin < 0.8 (plausible defaults, but they’re still heuristics)
    • Output: Boolean (is in gold zone)
  2. Capacity Calculation:

    • Input: S (last sum), N (dataset size), current premodulo
    • Calculate: target_capacity = N/(|S|+1)
    • Calculate: new_premodulo = target_capacity²
    • Apply: clamping and dampening
    • Output: New premodulo value
  3. Neuron Selection:

    • Input: List of neurons in current layer
    • Filter: Find neurons in gold zone
    • If found: Select neuron with margin closest to middle of gold zone
    • If not: Select neuron with moderate margin
    • Output: Selected neuron ID

Storage Schema

The SumCache stores:

  • Neuron identifier (layer.neuron_position)
  • Training run identifier
  • Timestamp
  • S value (sum of boolean outputs)
  • N value (dataset size at time)
  • Calculated margin (derived column)
  • Premodulo used
  • Training duration
  • Result metrics (accuracy, etc.)

Success Metrics

Primary Metrics (Must Achieve):

  1. Training Order Compliance: 100% of runs start with final neuron
  2. Capacity Adaptation: Premodulo values correlate with margin (R² > 0.7)
  3. Generalization Preservation: Train accuracy capped at 60-80%, eval rising
  4. Backward Flow: Layers trained in correct backward order
  5. All Languages Work: All 140 languages train successfully

Secondary Metrics (Should Achieve):

  1. Performance: <10% training time increase vs current pipeline
  2. Cache Efficiency: <100MB memory for sum cache at scale
  3. Stability: No pipeline deadlocks or crashes
  4. Observability: All metrics available in dashboard

Risk Assessment & Mitigation

Risk Probability Impact Mitigation Strategy
Stale sum data Medium High Use weighted average of last 3 runs, mark stale data
Pipeline deadlock Low Critical Timeout + forward skip after 3 attempts
Cache corruption Low High Regular backups, checksum validation
Premodulo oscillation Medium Medium Change dampening, bounds checking
Performance degradation Medium Medium Feature flag rollback, optimization passes
Migration failure Low Critical A/B testing, gradual rollout, rollback plan

Dependencies

  • Existing neuron table format and storage
  • Current dataset loaders and preprocessing
  • Evaluation metrics system
  • Model persistence layer

Rollback Plan

Three-tier rollback strategy:

  1. Soft Rollback: Feature flag disabled → revert to forward pipeline
  2. Medium Rollback: Remove sum cache influence but keep structure
  3. Hard Rollback: Complete code revert to previous commit

Each tier has decreasing impact and increasing implementation time.

Resources Required

  • Development: 1 senior Go engineer (15 days)
  • Testing: 1 QA engineer (5 days overlap)
  • Infrastructure: Minor database resources for sum cache
  • Monitoring: Enhanced dashboard development

Related Work

  • Previous research: 4 months of empirical testing across 140 languages
  • Research findings documented in internal wiki

Acceptance Criteria

Must Have:

  • Final neuron trains first in 100% of training runs
  • Sum cache persists across training sessions and restarts
  • Premodulo adapts according to capacity law (N/(|S|+1))²
  • Neurons selected from gold zone when available
  • Backward ordering maintained: final → layer N → ... → layer 1
  • Generalization gap preserved (not chasing 100% train accuracy)
  • All 140 existing languages train successfully
  • Performance within 10% of current pipeline

Should Have:

  • Real-time monitoring dashboard
  • Alerting for memorization risk (train > 95%)
  • Configuration system for tuning parameters
  • Migration path for existing models
  • Comprehensive test coverage (>80%)

Could Have:

  • Pairwise neuron training for anti-correlated pairs
  • Predictive generalization phase detection
  • Automated capacity tuning recommendations
  • Historical analysis of margin trends

Assignees

Timeline

Total: 15 days (aggressive but achievable)
Start Date: ASAP
Expected Completion: ASAP + 15 days


Priority: P0 (Critical for research progress)
Confidence: High (backed by extensive empirical evidence)
Impact: Transformational (moves from heuristic to theory-driven training)

This implementation represents the culmination of 4 months of research. The backward pipeline with sum cache and capacity control is the most important feature needed right now.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions