β οΈ Note: Themainbranch is read-only. No formulas or metric implementations are released here.
π
Collapse Index (CI): A Diagnostic Framework for Bounded, Lightweight, and Reproducible Evaluation of System Instability
https://doi.org/10.5281/zenodo.17718180
A diagnostic framework for instability in complex systems.
- π€ What is CI?
- β‘οΈ Collapse Index (CI) Workflow
- π Positioning CI
- οΏ½ What is SRI?
- οΏ½π What is the Collapse Log?
- π Collapse Log in Context
- β Why CI + Collapse Log Matter
- π― Public Validation
- β FAQ
- πΊοΈ Roadmap 2025
β οΈ Official Status- π License & Attribution
- π§π»βπ¬ Author
- π Citation
- π Sponsors
CI is a diagnostic framework for detecting when complex systems suddenly fail under small, ordinary stresses.
- Bounded scores (0β1): Clear, comparable measure of instability.
- Lightweight stressors: Simple, benign perturbations (no heavy adversarial pipelines).
- Reproducibility: Each run produces sealed artifact bundles (logs, hashes, plots) for independent verification.
CI complements existing metrics like calibration, robustness, and OOD detection by acting as a tripwire for hidden brittleness.
It is designed for audit, governance, and deployment settings, not leaderboard gaming.
The Collapse Index (CI) is more than a metric: itβs a pipeline.
Each run produces both a bounded CI score and a collapse log (row-level ledger of outcomes),
then seals everything into an audit-grade bundle.
This flowchart shows how CI integrates into evaluation, from setup to governance.
flowchart LR
A["<b>Setup</b><br>Prepare environment + model"] --> B["<b>Generation</b><br>Baseline + stress variants"]
B --> C["<b>Metrics</b><br>Compute collapse signals"]
C --> D["<b>Logging</b><br>Write collapse_log.csv (row-level)"]
D --> Y["<b>Collapse Log</b><br>per-prompt ledger (CSV)"] & X["<b>CI Score</b><br>[0,1] aggregate from log"]
X --> E["<b>Analysis</b><br>Stability vs. collapse"]
Y --> E
E --> F["<b>Reporting</b><br>Summaries Β· plots Β· tables"]
F --> G["<b>Archival</b><br>Sealed bundle Β· checksum"]
G --> H["<b>Governance</b><br>Licenses Β· disclosure"]
E L_E_C_0@-. iteration .-> C
H L_H_A_0@-. policy/reqs .-> A
A:::eval
B:::eval
C:::eval
D:::eval
Y:::outputs
X:::outputs
E:::eval
F:::audit
G:::audit
H:::audit
L_E_C_0@{ animation: fast }
L_H_A_0@{ animation: fast }
The CI framework integrates into the evaluation pipeline at two points:
β’ Metrics (CI score): collapse quantified into a bounded [0,1] score.
β’ Collapse Log: detailed, row-level record of every prediction and outcome.These plug into the broader evaluation cycle (analysis β reporting β archival β governance), producing sealed, audit-grade evidence of system stability.
| Method / Paper | Bounded | Stress-based | Lightweight | Audit-aligned | Modality-agnostic |
|---|---|---|---|---|---|
| Collapse Index (CI) | β | β | β | β | β |
| HELM | β | β | β | β | β |
| Calibration / Confidence | β | β | β | β | β |
| OOD Detection | β | Partial | β | β | β |
| Adversarial Robustness | β | β | β | β | β |
| Audit / Repro Standards | β | β | β | β | β |
| Industry Auditors | β | Partial | β | β | β |
- Collapse Index (CI) β Defines collapse as structured instability; integrates reproducibility into the diagnostic itself
- HELM β Large-scale, multi-metric evaluation; not bounded, not collapse-specific
- Calibration / Confidence β Improves probability alignment but misrepresents brittleness under stress
- OOD Detection β Captures distributional shift; lacks bounded collapse diagnostics
- Adversarial Robustness β Reveals fragility but computationally heavy; not suited to lightweight diagnostics
- Audit / Repro Standards β Define research process; do not provide diagnostic metrics
- Industry Auditors β Proprietary scores; not bounded or reproducible
Structural Retention Index (SRI) is CI's complementary metric for measuring internal reasoning stability.
- CI measures: How much your model cracks under meaning-preserving perturbations
- SRI measures: How well your model holds its decision structure across variants
- Perfect complementarity: CI + SRI = 1.0 (exact)
Models can output consistent predictions while internal reasoning collapses.
CI catches when your model cracks. SRI catches structural decay.
Together, they reveal failures invisible to traditional metrics.
Key insight: A model can have stable predictions but collapsing internal reasoning.
These are the cases that pass QA but fail in production under real-world stress.
π SRI validation dataset & generation code: github.com/collapseindex/ci-sri
π Published paper: DOI: 10.5281/zenodo.18016507
Every run produces a Collapse Log β an audit-grade CSV file that
records per-prompt diagnostics, predictions, and human-friendly notes.
Think of it as a flight recorder for brittleness:
- Row-level evidence β Each base input is logged with its
confidence, entropy, and error status. - Interpretive notes β The log adds a plain-English tag (e.g. βerror, brittle caseβ,
βcorrect, high confβ) so the file can be skimmed by both humans and machines. - Receipts-grade β The file is bundled alongside hashes and snapshots, ensuring that
results are verifiable and audit-ready. - Portable β CSV format, lightweight, and works across pipelines.
| prompt_id | label | confidence | entropy | is_error | notes |
|---|---|---|---|---|---|
| 12 | 1 | 0.9453 | 0.1124 | 0 | correct, high conf |
| 27 | 0 | 0.4187 | 0.6932 | 1 | error, brittle case |
| 35 | 1 | 0.7321 | 0.3558 | 0 | correct, stable |
Collapse Log strengthens any metric by making results transparent and auditable.
Hereβs how it compares across common baselines vs. CI:
| Metric Family | Without Collapse Log | With Collapse Log | Added Value |
|---|---|---|---|
| Confidence / Entropy | Detects low certainty, but hides row-level behavior | Every prediction, confidence, and entropy recorded | Turns a black-box score into an auditable ledger |
| Calibration / OOD | Reports AUROC or coverage curves only | Logs corner cases, OOD spikes, and per-sample traces | Adds traceability β reviewers can see where failures happened |
| Adversarial Robustness | Heavy compute, aggregate-only | Row-level evidence of stress-test outcomes | Makes robustness runs inspectable without reruns |
| Collapse Index (CI) π | Aggregated CI signals | Full collapse-sensitive forensic record (spikes, flips, margins logged) | Collapse Log + CI = audit-grade diagnostics |
π Takeaway: Collapse Log alone adds accountability,
but Collapse Log + CI unlocks a unique diagnostic ledger regulators and reviewers can trust.
AI models donβt fail quietly β they collapse.
Traditional metrics often miss brittleness until it causes real-world harm.
- Benchmarks β Reality β models that ace leaderboards can still collapse.
- Liability Risk β a single collapse may trigger recalls, lawsuits, or penalties.
- Audit Gap β standard metrics donβt leave receipts; Collapse Logβ’ does.
- Efficiency β lightweight stressors mean continuous monitoring without massive compute.
- Trust β regulators and enterprises need a score they can verify and a log they can audit.
π CI + Collapse Log make collapse measurable, reproducible, and audit-ready before it becomes a public liability.
We ran Collapse Index on DistilBERT-SST2 (90%+ benchmark accuracy) with 500 sentiment examples from the SST-2 validation set.
Results:
- 42.8% flip rate - Nearly half of predictions change under typos/paraphrases
- CI Score: 0.275 - Minor drift detected
- 13 silent failures - High confidence (>90%) BUT CI detects collapse (CI β€ 0.45). These bypass traditional monitoring. (13 of 35 total high-conf errors)
- AUC(CI): 0.698 vs AUC(Confidence): 0.515 - CI predicts brittleness 18% better than confidence scores
The gap: Benchmarks say "ship it," but real-world input variations expose massive instability.
π Full reproducible dataset & analysis: github.com/collapseindex/ci-sst2
β Is CI just another benchmark?
β‘οΈ No. CI is not a leaderboard metric β itβs a diagnostic. It reveals brittleness under benign stress.
β Does CI replace calibration, OOD, or adversarial robustness?
β‘οΈ No. CI complements these methods. It adds a collapse-sensitivity axis and receipts (Collapse Logβ’).
β Is CI adversarial?
β‘οΈ No. CI relies on lightweight, domain-appropriate perturbations (e.g., paraphrases, pixel shifts). Collapse is measured without adversarial tuning.
β How reproducible are CI runs?
β‘οΈ Every run emits a full artifact bundle: logs, plots, cryptographic hashes, and a Collapse Log.
β Does CI scale?
β‘οΈ Yes. CI stabilizes at a small perturbation budget, so continuous monitoring is feasible without massive compute overhead.
- Finalize framework draft and publish β
- Run additional experiments β scaling to frontier models (e.g., Qwen 7B, Grok, ChatGPT, etc.) β
- Collaborate with labs and organizations β external validation and pilots
- Build diagnostic software/app β packaging CI + Collapse Log as a tool β
Collapse Index (CI) and Collapse Log are not released as open-source software.
There is no official repository providing formulas or internals.
Any third-party code claiming to implement CI or Collapse Log is:
π« Unofficial, unverified, and not endorsed.
- The terms Collapse Indexβ’ (CI) and Collapse Logβ’ are reserved by the author.
- Unauthorized use or misrepresentation is prohibited.
- This repo does not contain source code or formulas.
π See LICENSE.md and CITATION.md.
Collapse Index Labs (Alex Kwon)
- Website β collapseindex.org
- ORCID β 0009-0002-2566-5538
For evals, datasets, collaborations, or pilots, contact:
π© ask@collapseindex.org
If you reference Collapse Index (CI) in your research or evaluations, please cite:
Kwon, A. (2025). Collapse Index (CI): A Diagnostic Framework for Bounded, Lightweight, and Reproducible Evaluation of System Instability (v1.0). Collapse Index Labs. https://doi.org/10.5281/zenodo.17718180
Collapse Index research is made possible through community support.
Be the first founding Transmission sponsor.
Be the first founding Feedback sponsor.
π Sponsor CI on GitHub
