Skip to content
View collapseindex's full-sized avatar

Block or report collapseindex

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
collapseindex/README.md

Collapse Index Banner

⚠️ Note: The main branch is read-only. No formulas or metric implementations are released here.

DOI

πŸ“„ Collapse Index (CI): A Diagnostic Framework for Bounded, Lightweight, and Reproducible Evaluation of System Instability
https://doi.org/10.5281/zenodo.17718180


🏷️ Project Identity

Website ORCID Contact

πŸ“¦ Repository Status

Status License Terms CoC Security

πŸ’¬ Community & Support

Sponsor Discussions Contributing


Collapse Index (CI)

A diagnostic framework for instability in complex systems.


πŸ“š Table of Contents


πŸ€” What is CI?

CI is a diagnostic framework for detecting when complex systems suddenly fail under small, ordinary stresses.

  • Bounded scores (0–1): Clear, comparable measure of instability.
  • Lightweight stressors: Simple, benign perturbations (no heavy adversarial pipelines).
  • Reproducibility: Each run produces sealed artifact bundles (logs, hashes, plots) for independent verification.

CI complements existing metrics like calibration, robustness, and OOD detection by acting as a tripwire for hidden brittleness.
It is designed for audit, governance, and deployment settings, not leaderboard gaming.


Collapse Index (CI) Workflow

The Collapse Index (CI) is more than a metric: it’s a pipeline.
Each run produces both a bounded CI score and a collapse log (row-level ledger of outcomes),
then seals everything into an audit-grade bundle.

This flowchart shows how CI integrates into evaluation, from setup to governance.

flowchart LR
    A["<b>Setup</b><br>Prepare environment + model"] --> B["<b>Generation</b><br>Baseline + stress variants"]
    B --> C["<b>Metrics</b><br>Compute collapse signals"]
    C --> D["<b>Logging</b><br>Write collapse_log.csv (row-level)"]
    D --> Y["<b>Collapse Log</b><br>per-prompt ledger (CSV)"] & X["<b>CI Score</b><br>[0,1] aggregate from log"]
    X --> E["<b>Analysis</b><br>Stability vs. collapse"]
    Y --> E
    E --> F["<b>Reporting</b><br>Summaries Β· plots Β· tables"]
    F --> G["<b>Archival</b><br>Sealed bundle Β· checksum"]
    G --> H["<b>Governance</b><br>Licenses Β· disclosure"]
    E L_E_C_0@-. iteration .-> C
    H L_H_A_0@-. policy/reqs .-> A
     A:::eval
     B:::eval
     C:::eval
     D:::eval
     Y:::outputs
     X:::outputs
     E:::eval
     F:::audit
     G:::audit
     H:::audit
    L_E_C_0@{ animation: fast } 
    L_H_A_0@{ animation: fast }




Loading

The CI framework integrates into the evaluation pipeline at two points:
β€’ Metrics (CI score): collapse quantified into a bounded [0,1] score.
β€’ Collapse Log: detailed, row-level record of every prediction and outcome.

These plug into the broader evaluation cycle (analysis β†’ reporting β†’ archival β†’ governance), producing sealed, audit-grade evidence of system stability.


πŸ“ Positioning CI

Diagnostic Features

Method / Paper Bounded Stress-based Lightweight Audit-aligned Modality-agnostic
Collapse Index (CI) βœ“ βœ“ βœ“ βœ“ βœ“
HELM βœ— βœ— βœ— βœ— βœ“
Calibration / Confidence βœ— βœ— βœ“ βœ— βœ“
OOD Detection βœ— Partial βœ“ βœ— βœ“
Adversarial Robustness βœ— βœ“ βœ— βœ— βœ“
Audit / Repro Standards βœ— βœ— βœ— βœ“ βœ—
Industry Auditors βœ— Partial βœ— βœ— βœ—

Notes

  • Collapse Index (CI) β†’ Defines collapse as structured instability; integrates reproducibility into the diagnostic itself
  • HELM β†’ Large-scale, multi-metric evaluation; not bounded, not collapse-specific
  • Calibration / Confidence β†’ Improves probability alignment but misrepresents brittleness under stress
  • OOD Detection β†’ Captures distributional shift; lacks bounded collapse diagnostics
  • Adversarial Robustness β†’ Reveals fragility but computationally heavy; not suited to lightweight diagnostics
  • Audit / Repro Standards β†’ Define research process; do not provide diagnostic metrics
  • Industry Auditors β†’ Proprietary scores; not bounded or reproducible

οΏ½ What is SRI?

Structural Retention Index (SRI) is CI's complementary metric for measuring internal reasoning stability.

  • CI measures: How much your model cracks under meaning-preserving perturbations
  • SRI measures: How well your model holds its decision structure across variants
  • Perfect complementarity: CI + SRI = 1.0 (exact)

Why SRI + CI?

Models can output consistent predictions while internal reasoning collapses.
CI catches when your model cracks. SRI catches structural decay.
Together, they reveal failures invisible to traditional metrics.

Key insight: A model can have stable predictions but collapsing internal reasoning.
These are the cases that pass QA but fail in production under real-world stress.

πŸ‘‰ SRI validation dataset & generation code: github.com/collapseindex/ci-sri
πŸ“„ Published paper: DOI: 10.5281/zenodo.18016507


οΏ½πŸ“‘ What is the Collapse Log?

Every run produces a Collapse Log β€” an audit-grade CSV file that
records per-prompt diagnostics, predictions, and human-friendly notes.

Think of it as a flight recorder for brittleness:

  • Row-level evidence β€” Each base input is logged with its
    confidence, entropy, and error status.
  • Interpretive notes β€” The log adds a plain-English tag (e.g. β€œerror, brittle case”,
    β€œcorrect, high conf”) so the file can be skimmed by both humans and machines.
  • Receipts-grade β€” The file is bundled alongside hashes and snapshots, ensuring that
    results are verifiable and audit-ready.
  • Portable β€” CSV format, lightweight, and works across pipelines.

πŸ“‹ Collapse Log Example (snippet)

prompt_id label confidence entropy is_error notes
12 1 0.9453 0.1124 0 correct, high conf
27 0 0.4187 0.6932 1 error, brittle case
35 1 0.7321 0.3558 0 correct, stable

πŸ“Š Collapse Log in Context

Collapse Log strengthens any metric by making results transparent and auditable.
Here’s how it compares across common baselines vs. CI:

Metric Family Without Collapse Log With Collapse Log Added Value
Confidence / Entropy Detects low certainty, but hides row-level behavior Every prediction, confidence, and entropy recorded Turns a black-box score into an auditable ledger
Calibration / OOD Reports AUROC or coverage curves only Logs corner cases, OOD spikes, and per-sample traces Adds traceability β€” reviewers can see where failures happened
Adversarial Robustness Heavy compute, aggregate-only Row-level evidence of stress-test outcomes Makes robustness runs inspectable without reruns
Collapse Index (CI) πŸš€ Aggregated CI signals Full collapse-sensitive forensic record (spikes, flips, margins logged) Collapse Log + CI = audit-grade diagnostics

πŸ‘‰ Takeaway: Collapse Log alone adds accountability,
but Collapse Log + CI unlocks a unique diagnostic ledger regulators and reviewers can trust.


⭐ Why CI + Collapse Log Matter

AI models don’t fail quietly β€” they collapse.
Traditional metrics often miss brittleness until it causes real-world harm.

  • Benchmarks β‰  Reality β†’ models that ace leaderboards can still collapse.
  • Liability Risk β†’ a single collapse may trigger recalls, lawsuits, or penalties.
  • Audit Gap β†’ standard metrics don’t leave receipts; Collapse Logβ„’ does.
  • Efficiency β†’ lightweight stressors mean continuous monitoring without massive compute.
  • Trust β†’ regulators and enterprises need a score they can verify and a log they can audit.

πŸ‘‰ CI + Collapse Log make collapse measurable, reproducible, and audit-ready before it becomes a public liability.


🎯 Public Validation

We ran Collapse Index on DistilBERT-SST2 (90%+ benchmark accuracy) with 500 sentiment examples from the SST-2 validation set.

Results:

  • 42.8% flip rate - Nearly half of predictions change under typos/paraphrases
  • CI Score: 0.275 - Minor drift detected
  • 13 silent failures - High confidence (>90%) BUT CI detects collapse (CI ≀ 0.45). These bypass traditional monitoring. (13 of 35 total high-conf errors)
  • AUC(CI): 0.698 vs AUC(Confidence): 0.515 - CI predicts brittleness 18% better than confidence scores

The gap: Benchmarks say "ship it," but real-world input variations expose massive instability.

πŸ‘‰ Full reproducible dataset & analysis: github.com/collapseindex/ci-sst2


❓ FAQ

❔ Is CI just another benchmark?
➑️ No. CI is not a leaderboard metric β€” it’s a diagnostic. It reveals brittleness under benign stress.

❔ Does CI replace calibration, OOD, or adversarial robustness?
➑️ No. CI complements these methods. It adds a collapse-sensitivity axis and receipts (Collapse Logβ„’).

❔ Is CI adversarial?
➑️ No. CI relies on lightweight, domain-appropriate perturbations (e.g., paraphrases, pixel shifts). Collapse is measured without adversarial tuning.

❔ How reproducible are CI runs?
➑️ Every run emits a full artifact bundle: logs, plots, cryptographic hashes, and a Collapse Log.

❔ Does CI scale?
➑️ Yes. CI stabilizes at a small perturbation budget, so continuous monitoring is feasible without massive compute overhead.


πŸ—ΊοΈ Roadmap 2025

  • Finalize framework draft and publish βœ…
  • Run additional experiments β€” scaling to frontier models (e.g., Qwen 7B, Grok, ChatGPT, etc.) βœ…
  • Collaborate with labs and organizations β€” external validation and pilots
  • Build diagnostic software/app β€” packaging CI + Collapse Log as a tool βœ…

⚠️ Official Status

Collapse Index (CI) and Collapse Log are not released as open-source software.
There is no official repository providing formulas or internals.

Any third-party code claiming to implement CI or Collapse Log is:
🚫 Unofficial, unverified, and not endorsed.


πŸ“„ License & Attribution

  • The terms Collapse Indexβ„’ (CI) and Collapse Logβ„’ are reserved by the author.
  • Unauthorized use or misrepresentation is prohibited.
  • This repo does not contain source code or formulas.

πŸ“„ See LICENSE.md and CITATION.md.


πŸ§‘πŸ»β€πŸ”¬ Author

Collapse Index Labs (Alex Kwon)

For evals, datasets, collaborations, or pilots, contact:
πŸ“© ask@collapseindex.org


πŸ“š Citation

If you reference Collapse Index (CI) in your research or evaluations, please cite:

Kwon, A. (2025). Collapse Index (CI): A Diagnostic Framework for Bounded, Lightweight, and Reproducible Evaluation of System Instability (v1.0). Collapse Index Labs. https://doi.org/10.5281/zenodo.17718180


πŸ’– Sponsors

Collapse Index research is made possible through community support.

πŸ“‘ Transmission Tier (Major Sponsors)

Be the first founding Transmission sponsor.

πŸ“» Feedback Tier (Contributors)

Be the first founding Feedback sponsor.

πŸ‘‰ Sponsor CI on GitHub


Popular repositories Loading

  1. collapseindex collapseindex Public

    Collapse Index (CI) & Structural Retention Index (SRI): Advanced diagnostic frameworks for bounded, lightweight evaluation of AI model instability beyond accuracy, revealing hidden brittleness in p…

    2

  2. ci-sst2 ci-sst2 Public

    Public validation of Collapse Index (CI) on SST-2 dataset: 42.8% flip rate, AUC 0.698. Reveals model brittleness beyond 90%+ accuracy under perturbations!

    Python 2

  3. ci-sri ci-sri Public

    Public validation of SRI (Structural Retention) and CI (Collapse Index) metrics on AG News dataset: 90.4% accuracy, 9.2% flip rate, revealing model brittleness beyond standard benchmarks with AUC 0…

    Python 2