Calibration Metrics¶
Calibration metrics measure whether confidence values are trustworthy for decision-making.
Why This Matters¶
If a model predicts 0.90 confidence, teams often act as if it is correct 90 percent of the time. Calibration metrics test whether that assumption is true.
When to Use¶
when confidence values drive downstream thresholds
when ranking or triage depends on probability quality
when validating reliability before deployment
Inputs and Assumptions¶
y_true: ground-truth labelsy_prob: predicted probabilitiesMulticlass Support: TrustLens v0.4.0 supports multiclass calibration using top-label ECE and Multiclass Brier Score.
Output and Interpretation¶
Key outputs include:
Brier score: lower values indicate better probabilistic accuracy
ECE: lower values indicate confidence is better aligned with observed accuracy
Reliability curve data: supports visual overconfidence or underconfidence inspection
Limitations and Caveats¶
calibration quality estimates are less stable on very small datasets
poor probability estimation upstream can dominate all calibration outputs
API Reference¶
trustlens.metrics.calibration.¶
Calibration metrics for probabilistic classifiers.
Calibration measures how well a model’s predicted probabilities reflect the true likelihood of outcomes. A perfectly calibrated model that predicts 80% confidence for a set of samples should be correct ~80% of the time.
Metrics implemented¶
brier_score— proper scoring rule for probabilistic forecastsexpected_calibration_error— binned confidence vs accuracy gapreliability_curve— data for reliability (calibration) diagrams
References
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. ICML.
Guo, C., et al. (2017). On calibration of modern neural networks. ICML.
- trustlens.metrics.calibration.brier_score(y_true: ndarray, y_prob: ndarray) float[source]¶
Compute the Brier Score for a binary probabilistic classifier.
The Brier Score is the mean squared difference between predicted probabilities and actual outcomes. Lower is better; a perfect forecaster scores 0.0, a random coin-flip scores ~0.25.
\[\begin{split}\\text{BS} = \\frac{1}{N} \\sum_{i=1}^{N} \\bigl(\\hat{p}_i - y_i\\bigr)^2\end{split}\]- Parameters:
y_true (np.ndarray) – Binary ground-truth labels (0 or 1), shape (n_samples,).
y_prob (np.ndarray) – Predicted probabilities for the positive class, shape (n_samples,).
- Returns:
Brier Score in [0, 1].
- Return type:
float
- Raises:
ValueError – If
y_trueandy_probhave different lengths, or ify_truecontains values outside {0, 1}.
Examples
>>> import numpy as np >>> from trustlens.metrics.calibration import brier_score >>> y_true = np.array([1, 0, 1, 1, 0]) >>> y_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3]) >>> brier_score(y_true, y_prob) 0.036
- trustlens.metrics.calibration.expected_calibration_error(y_true: ndarray, y_prob: ndarray, n_bins: int = 10, strategy: str = 'uniform') float[source]¶
Compute the Expected Calibration Error (ECE).
ECE measures the weighted average absolute difference between predicted confidence and actual accuracy across probability bins.
\[\begin{split}\\text{ECE} = \\sum_{b=1}^{B} \\frac{|\\mathcal{B}_b|}{N} \\left|\\text{acc}(\\mathcal{B}_b) - \\text{conf}(\\mathcal{B}_b)\\right|\end{split}\]- Parameters:
y_true (np.ndarray) – Binary ground-truth labels (0 or 1), shape (n_samples,).
y_prob (np.ndarray) – Predicted probabilities for the positive class, shape (n_samples,).
n_bins (int) – Number of confidence bins. Default 10.
strategy (str) – Binning strategy —
"uniform"(equal-width) or"quantile"(equal-frequency). Default"uniform".
- Returns:
ECE value in [0, 1]. Lower is better.
- Return type:
float
Examples
>>> from trustlens.metrics.calibration import expected_calibration_error >>> ece = expected_calibration_error(y_true, y_prob, n_bins=10)
- trustlens.metrics.calibration.reliability_curve(y_true: ndarray, y_prob: ndarray, n_bins: int = 10, strategy: str = 'uniform') tuple[ndarray, ndarray, ndarray][source]¶
Compute the reliability (calibration) curve data.
Returns the mean predicted probability, fraction of positives, and bin counts for each confidence bin. Use this data with
trustlens.visualization.plot_reliability_diagramto render a calibration plot.- Parameters:
y_true (np.ndarray) – Binary ground-truth labels (0 or 1).
y_prob (np.ndarray) – Predicted probabilities for the positive class.
n_bins (int) – Number of confidence bins. Default 10.
strategy (str) –
"uniform"or"quantile". Default"uniform".
- Returns:
fraction_of_positives (np.ndarray) – Actual fraction of positive samples in each bin.
mean_predicted_value (np.ndarray) – Mean predicted probability in each bin.
bin_counts (np.ndarray) – Number of samples in each bin.
Examples
>>> frac_pos, mean_pred, counts = reliability_curve(y_true, y_prob)