Skip to content

Proposal: Retrieval-Conditioned Confidence Metric (RCCS) for RAG evaluation #736

@devarakondasrikanth

Description

@devarakondasrikanth

Motivation

Current evaluation metrics in the ecosystem emphasize answer-level quality (accuracy, F1, BLEU, ROUGE) but lack a lightweight, model-agnostic diagnostic for confidence calibration in Retrieval-Augmented Generation (RAG) pipelines. In practice, answer reliability depends on both retrieval relevance and model confidence; therefore a metric that evaluates their joint alignment is valuable.

Proposed metric: Retrieval-Conditioned Confidence Score (RCCS)

RCCS measures alignment between:

  • Retrieval relevance score R (per example)
  • Model confidence C (per example; calibrated probability recommended)
  • Ground-truth correctness A (0/1)

Two suggested outputs:

  • rccs_correlation: Pearson correlation between (R * C) and A.
  • confidence_calibration_error: mean absolute error mean(|R*C - A|).

Rationale:

  • High R with high C should correlate with A=1 (correct).
  • High C but low R indicates hallucination risk.
  • The metric is model-agnostic, simple to compute, and useful for benchmarking and diagnostics.

Proposed scope for initial PR

  • Add directory metrics/rccs/ implementing the metric as evaluate.Metric.
  • Minimal API: accept columns retrieval_score, confidence_score, correctness.
  • Return rccs_correlation, confidence_calibration_error, mean_rc, n.
  • Include minimal unit tests and a metric card README with an example notebook.
  • Keep implementation lightweight; follow repository conventions and tests.

Example formulas

  • rccs_correlation = PearsonCorr(R * C, A)
  • confidence_calibration_error = mean(|R * C - A|)

Notes

  • I will provide example Colab / notebook showing RCCS with a simple RAG pipeline.
  • I can open an initial PR implementing a minimal version if maintainers agree with scope and naming.

If this direction makes sense, I’m happy to implement whichever option aligns best with the project’s design goals and contribution guidelines. Please let me know the preferred option.

Thanks for your time, and happy to iterate on the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions