-
Notifications
You must be signed in to change notification settings - Fork 312
Open
Description
Motivation
Current evaluation metrics in the ecosystem emphasize answer-level quality (accuracy, F1, BLEU, ROUGE) but lack a lightweight, model-agnostic diagnostic for confidence calibration in Retrieval-Augmented Generation (RAG) pipelines. In practice, answer reliability depends on both retrieval relevance and model confidence; therefore a metric that evaluates their joint alignment is valuable.
Proposed metric: Retrieval-Conditioned Confidence Score (RCCS)
RCCS measures alignment between:
- Retrieval relevance score
R(per example) - Model confidence
C(per example; calibrated probability recommended) - Ground-truth correctness
A(0/1)
Two suggested outputs:
- rccs_correlation: Pearson correlation between
(R * C)andA. - confidence_calibration_error: mean absolute error
mean(|R*C - A|).
Rationale:
- High
Rwith highCshould correlate withA=1(correct). - High
Cbut lowRindicates hallucination risk. - The metric is model-agnostic, simple to compute, and useful for benchmarking and diagnostics.
Proposed scope for initial PR
- Add directory
metrics/rccs/implementing the metric asevaluate.Metric. - Minimal API: accept columns
retrieval_score,confidence_score,correctness. - Return
rccs_correlation,confidence_calibration_error,mean_rc,n. - Include minimal unit tests and a metric card README with an example notebook.
- Keep implementation lightweight; follow repository conventions and tests.
Example formulas
rccs_correlation = PearsonCorr(R * C, A)confidence_calibration_error = mean(|R * C - A|)
Notes
- I will provide example Colab / notebook showing RCCS with a simple RAG pipeline.
- I can open an initial PR implementing a minimal version if maintainers agree with scope and naming.
If this direction makes sense, I’m happy to implement whichever option aligns best with the project’s design goals and contribution guidelines. Please let me know the preferred option.
Thanks for your time, and happy to iterate on the approach.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels