Calibration Metrics

Calibration metrics measure whether confidence values are trustworthy for decision-making.

Why This Matters

If a model predicts 0.90 confidence, teams often act as if it is correct 90 percent of the time. Calibration metrics test whether that assumption is true.

When to Use

  • when confidence values drive downstream thresholds

  • when ranking or triage depends on probability quality

  • when validating reliability before deployment

Inputs and Assumptions

  • y_true: ground-truth labels

  • y_prob: predicted probabilities

  • Multiclass Support: TrustLens v0.4.0 supports multiclass calibration using top-label ECE and Multiclass Brier Score.

Output and Interpretation

Key outputs include:

  • Brier score: lower values indicate better probabilistic accuracy

  • ECE: lower values indicate confidence is better aligned with observed accuracy

  • Reliability curve data: supports visual overconfidence or underconfidence inspection

Limitations and Caveats

  • calibration quality estimates are less stable on very small datasets

  • poor probability estimation upstream can dominate all calibration outputs

API Reference

trustlens.metrics.calibration.

Calibration metrics for probabilistic classifiers.

Calibration measures how well a model’s predicted probabilities reflect the true likelihood of outcomes. A perfectly calibrated model that predicts 80% confidence for a set of samples should be correct ~80% of the time.

Metrics implemented

  • brier_score — proper scoring rule for probabilistic forecasts

  • expected_calibration_error — binned confidence vs accuracy gap

  • reliability_curve — data for reliability (calibration) diagrams

References

  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.

  • Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. ICML.

  • Guo, C., et al. (2017). On calibration of modern neural networks. ICML.

trustlens.metrics.calibration.brier_score(y_true: ndarray, y_prob: ndarray) float[source]

Compute the Brier Score for a binary probabilistic classifier.

The Brier Score is the mean squared difference between predicted probabilities and actual outcomes. Lower is better; a perfect forecaster scores 0.0, a random coin-flip scores ~0.25.

\[\begin{split}\\text{BS} = \\frac{1}{N} \\sum_{i=1}^{N} \\bigl(\\hat{p}_i - y_i\\bigr)^2\end{split}\]
Parameters:
  • y_true (np.ndarray) – Binary ground-truth labels (0 or 1), shape (n_samples,).

  • y_prob (np.ndarray) – Predicted probabilities for the positive class, shape (n_samples,).

Returns:

Brier Score in [0, 1].

Return type:

float

Raises:

ValueError – If y_true and y_prob have different lengths, or if y_true contains values outside {0, 1}.

Examples

>>> import numpy as np
>>> from trustlens.metrics.calibration import brier_score
>>> y_true = np.array([1, 0, 1, 1, 0])
>>> y_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
>>> brier_score(y_true, y_prob)
0.036
trustlens.metrics.calibration.expected_calibration_error(y_true: ndarray, y_prob: ndarray, n_bins: int = 10, strategy: str = 'uniform') float[source]

Compute the Expected Calibration Error (ECE).

ECE measures the weighted average absolute difference between predicted confidence and actual accuracy across probability bins.

\[\begin{split}\\text{ECE} = \\sum_{b=1}^{B} \\frac{|\\mathcal{B}_b|}{N} \\left|\\text{acc}(\\mathcal{B}_b) - \\text{conf}(\\mathcal{B}_b)\\right|\end{split}\]
Parameters:
  • y_true (np.ndarray) – Binary ground-truth labels (0 or 1), shape (n_samples,).

  • y_prob (np.ndarray) – Predicted probabilities for the positive class, shape (n_samples,).

  • n_bins (int) – Number of confidence bins. Default 10.

  • strategy (str) – Binning strategy — "uniform" (equal-width) or "quantile" (equal-frequency). Default "uniform".

Returns:

ECE value in [0, 1]. Lower is better.

Return type:

float

Examples

>>> from trustlens.metrics.calibration import expected_calibration_error
>>> ece = expected_calibration_error(y_true, y_prob, n_bins=10)
trustlens.metrics.calibration.reliability_curve(y_true: ndarray, y_prob: ndarray, n_bins: int = 10, strategy: str = 'uniform') tuple[ndarray, ndarray, ndarray][source]

Compute the reliability (calibration) curve data.

Returns the mean predicted probability, fraction of positives, and bin counts for each confidence bin. Use this data with trustlens.visualization.plot_reliability_diagram to render a calibration plot.

Parameters:
  • y_true (np.ndarray) – Binary ground-truth labels (0 or 1).

  • y_prob (np.ndarray) – Predicted probabilities for the positive class.

  • n_bins (int) – Number of confidence bins. Default 10.

  • strategy (str) – "uniform" or "quantile". Default "uniform".

Returns:

  • fraction_of_positives (np.ndarray) – Actual fraction of positive samples in each bin.

  • mean_predicted_value (np.ndarray) – Mean predicted probability in each bin.

  • bin_counts (np.ndarray) – Number of samples in each bin.

Examples

>>> frac_pos, mean_pred, counts = reliability_curve(y_true, y_prob)