Calibration Metrics¶

Calibration metrics measure whether confidence values are trustworthy for decision-making.

Why This Matters¶

If a model predicts 0.90 confidence, teams often act as if it is correct 90 percent of the time. Calibration metrics test whether that assumption is true.

When to Use¶

when confidence values drive downstream thresholds
when ranking or triage depends on probability quality
when validating reliability before deployment

Inputs and Assumptions¶

y_true: ground-truth labels
y_prob: predicted probabilities
Multiclass Support: TrustLens v0.4.0 supports multiclass calibration using top-label ECE and Multiclass Brier Score.

Output and Interpretation¶

Key outputs include:

Brier score: lower values indicate better probabilistic accuracy
ECE: lower values indicate confidence is better aligned with observed accuracy
Reliability curve data: supports visual overconfidence or underconfidence inspection

Limitations and Caveats¶

calibration quality estimates are less stable on very small datasets
poor probability estimation upstream can dominate all calibration outputs

API Reference¶

trustlens.metrics.calibration.¶

Calibration metrics for probabilistic classifiers.

Calibration measures how well a model’s predicted probabilities reflect the true likelihood of outcomes. A perfectly calibrated model that predicts 80% confidence for a set of samples should be correct ~80% of the time.

Metrics implemented¶

brier_score — proper scoring rule for probabilistic forecasts
expected_calibration_error — binned confidence vs accuracy gap
reliability_curve — data for reliability (calibration) diagrams

References

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. ICML.
Guo, C., et al. (2017). On calibration of modern neural networks. ICML.

trustlens.metrics.calibration.brier_score(y_true: ndarray, y_prob: ndarray) → float[source]¶

Compute the Brier Score for a binary probabilistic classifier.

The Brier Score is the mean squared difference between predicted probabilities and actual outcomes. Lower is better; a perfect forecaster scores 0.0, a random coin-flip scores ~0.25.

\[\begin{split}\\text{BS} = \\frac{1}{N} \\sum_{i=1}^{N} \\bigl(\\hat{p}_i - y_i\\bigr)^2\end{split}\]

Parameters:

y_true (np.ndarray) – Binary ground-truth labels (0 or 1), shape (n_samples,).
y_prob (np.ndarray) – Predicted probabilities for the positive class, shape (n_samples,).

Returns:

Brier Score in [0, 1].

Return type:

float

Raises:

ValueError – If y_true and y_prob have different lengths, or if y_true contains values outside {0, 1}.

Examples

>>> import numpy as np
>>> from trustlens.metrics.calibration import brier_score
>>> y_true = np.array([1, 0, 1, 1, 0])
>>> y_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
>>> brier_score(y_true, y_prob)
0.036

trustlens.metrics.calibration.expected_calibration_error(y_true: ndarray, y_prob: ndarray, n_bins: int = 10, strategy: str = 'uniform') → float[source]¶

Compute the Expected Calibration Error (ECE).

ECE measures the weighted average absolute difference between predicted confidence and actual accuracy across probability bins.

\[\begin{split}\\text{ECE} = \\sum_{b=1}^{B} \\frac{|\\mathcal{B}_b|}{N} \\left|\\text{acc}(\\mathcal{B}_b) - \\text{conf}(\\mathcal{B}_b)\\right|\end{split}\]

Parameters:

y_true (np.ndarray) – Binary ground-truth labels (0 or 1), shape (n_samples,).
y_prob (np.ndarray) – Predicted probabilities for the positive class, shape (n_samples,).
n_bins (int) – Number of confidence bins. Default 10.
strategy (str) – Binning strategy — "uniform" (equal-width) or "quantile" (equal-frequency). Default "uniform".

Returns:

ECE value in [0, 1]. Lower is better.

Return type:

float

Examples

>>> from trustlens.metrics.calibration import expected_calibration_error
>>> ece = expected_calibration_error(y_true, y_prob, n_bins=10)

trustlens.metrics.calibration.reliability_curve(y_true: ndarray, y_prob: ndarray, n_bins: int = 10, strategy: str = 'uniform') → tuple[ndarray, ndarray, ndarray][source]¶

Compute the reliability (calibration) curve data.

Returns the mean predicted probability, fraction of positives, and bin counts for each confidence bin. Use this data with trustlens.visualization.plot_reliability_diagram to render a calibration plot.

Parameters:

y_true (np.ndarray) – Binary ground-truth labels (0 or 1).
y_prob (np.ndarray) – Predicted probabilities for the positive class.
n_bins (int) – Number of confidence bins. Default 10.
strategy (str) – "uniform" or "quantile". Default "uniform".

Returns:

fraction_of_positives (np.ndarray) – Actual fraction of positive samples in each bin.
mean_predicted_value (np.ndarray) – Mean predicted probability in each bin.
bin_counts (np.ndarray) – Number of samples in each bin.

Examples

>>> frac_pos, mean_pred, counts = reliability_curve(y_true, y_prob)

Calibration Metrics¶

Why This Matters¶

When to Use¶

Inputs and Assumptions¶

Output and Interpretation¶

Limitations and Caveats¶

API Reference¶

trustlens.metrics.calibration.¶

Metrics implemented¶

Related Pages¶