Open-source Python library · v0.3.0 · pip install trustlens

Your model has 92% accuracy.
That may still be unsafe.

TrustLens evaluates model reliability beyond accuracy — and produces a deployment decision backed by evidence, not instinct.

~10K Lines of Code
25 Source Files
16 Test Files
7 Contributors
4 Diagnostic Modules
MIT License
Failure Deep Dive: show_failures()

Confidently Wrong: The dangerous mistakes.

Accuracy hides the danger of failures. TrustLens isolates "Confidently Wrong" samples where the model is 95%+ certain but incorrect. These are the samples most likely to bypass human review and cause production disasters.

96.4% Mean confidence on top failures
20.1% Total error rate
Model A | Worst Offenders
# Sample True Pred Confidence Danger 1 234 1 0 96.7% CRITICAL 2 659 1 0 96.7% CRITICAL 3 740 1 0 95.7% CRITICAL [Insight]: High-confidence mistakes detected. The model is certain it is right, but it is wrong. Overconfidence detected — consider calibration.
Multi-Model Benchmarking

Compare candidates.
Pick the safest.

Accuracy is a shallow metric. TrustLens allows you to benchmark multiple models across calibration, failure risk, and bias — ensuring you ship the most reliable candidate, not just the one with the highest accuracy.

Model A BLOCKED
Accuracy79.88%
Trust Score28/100
Grade[D]
Primary Risks
Fairness violation High failure risk
Model B BLOCKED
Accuracy73.38%
Trust Score27/100
Grade[D]
Primary Risks
Failure risk Calibration error
Model C BLOCKED
Accuracy78.62%
Trust Score42/100
Grade[D]
Primary Risks
Failure risk Fairness violation
BENCHMARK VERDICT: DO NOT DEPLOY

All candidates triggered critical diagnostic blocks. While Model C has the highest trust score, its fairness violations exceed safety thresholds. Recommendation: Retrain with class-weighted loss and bias mitigation.

$ pip install trustlens
v0.3.0 Latest Python ≥ 3.9 MIT License CI Passing 4 - Beta
The Problem

Three ways your
metrics lie to you.

You trained a model. It hits 92% accuracy. You ship it. Three months later — a minority-class user gets consistently wrong predictions, the model is 90% confident on its worst mistakes, and a regulator asks "why did it make that decision?" and you have no answer.

01
⚠ Miscalibration
Confident on the Wrong Answers
A model saying "I'm 99% sure" when it's right only 60% of the time. ECE > 0.05 is a deployment red flag. You cannot set thresholds if the probabilities are wrong.
02
⚠ Silent Bias
High Overall, Broken for Some
High overall accuracy that masks significant performance drops for minority classes. The aggregate hides systematic subgroup failure — until it's too late and someone notices in production.
03
⚠ Fragile Representations
Classes Too Close to Each Other
Latent spaces where classes are so closely packed that slight input noise causes classification flips. Silhouette score reveals this. Accuracy never will.

Accuracy tells you how often your model is right.
It tells you nothing about when it fails, why it fails, or who it fails. TrustLens makes those failures visible before they reach production — and turns raw diagnostics into a deployment decision.

Quickstart

From pip install to
verdict in 30 seconds.

TrustLens is zero-friction. If your model has .predict() and .predict_proba(), you're ready to go. One function call. One report.

trustlens_demo.py
# Install once, run anywhere $ pip install trustlens from trustlens import analyze # 1. Run the audit — that's it. report = analyze( model, X_test, y_test, y_prob = model.predict_proba(X_test), sensitive_features = { "gender": gender_test, "age_group": age_test, } ) # 2. Inspect the verdict. report.show() # 3. Compare candidates. from trustlens import compare compare([report_a, report_b, report_c]) # 4. Export artifacts. report.save("trust_report") # JSON+plots report.save("report.txt") # human-readable
TRUST SCORE:88 / 100
Grade:[B]
Assessment:Good Trust — minor issues to address
Base Score:92
Penalties:−4.0 [Calibration]
Final Score:88
APPROVED FOR DEPLOYMENT — review calibration before release
What this one call does
1
Resolves probabilities
Calls predict_proba if not supplied. Falls back gracefully.
2
Dispatches modules
Calibration, Failure, Bias, Representation — in parallel.
3
Scores & penalizes
Weighted composite score with penalties and blockers applied.
4
Returns TrustReport
Score, grade, verdict, insights, plots, exportable artifacts.
One-line demo
from trustlens import quick_analyze quick_analyze(dataset="breast_cancer")
Trust Score Engine

One score. Traceable rules.
No black boxes.

The Trust Score combines diagnostic modules with explicit, auditable logic. Weights, penalties, and blockers are all visible — you can trace every deduction back to a specific metric failure.

// Trust Score Output
0
/100
[B] · Good Trust
Calibration Sub-score 84
Failure Sub-score 91
Bias Sub-score 95
Weighted Base Score 92
Penalties Applied −4.0
Final Trust Score 88
// Dimension Weights
Calibration
35%
0.35
Failure
30%
0.30
Bias
25%
0.25
Representation
10%
0.10

When Representation is unavailable, remaining weights are renormalized to sum to 1.0.

// Penalty Triggers
Failure sub-score < safety threshold up to −20 trigger: failure_score < 60
ECE exceeds calibration tolerance up to −15 trigger: ECE > 0.05
Severe fairness violations up to −15 trigger: subgroup disparity > limit
Total penalty cap −35 max preserves score interpretability
// Grade Bands
A
90–100
Production ready
B
75–89
Good, address issues
C
55–74
Investigate first
D
0–54
Block deployment

Deployment Blockers — Override the Score

🚫
Confidently Wrong
Strong confidently-wrong behavior detected. Model is systematically overconfident on misclassified samples. Forces Grade D regardless of overall score.
🚫
Failure Below Critical
Failure sub-score < 40. Error concentration in high-confidence regions is too severe to ship. Verdict forced to do-not-deploy.
🚫
Severe Fairness Violation
A protected subgroup sees catastrophic performance gaps that exceed safety thresholds. Deployment would cause discriminatory outcomes.
🚫
Very Poor Calibration
ECE > 0.10. Predicted probabilities are so miscalibrated that downstream threshold-based decisions are unreliable. Blocks deployment.
Diagnostic Modules

Four lenses on
model reliability.

Each module diagnoses a different failure mode. Together they produce one deployment-oriented verdict while preserving full diagnostic detail for root-cause analysis.

Calibration w = 0.35
Are probabilities trustworthy?
Checks whether predicted probabilities match real-world frequencies. A model saying "95% confident" should be right 95% of the time. ECE > 0.05 triggers a penalty. Essential for any downstream decision that depends on threshold quality.
Brier Score ECE Reliability Curve
Failure Analysis w = 0.30
Where are mistakes concentrated?
Focuses on risk concentration, not total error rate. Identifies "Confidently Wrong" behavior — where misclassifications occur in high-confidence regions. This is the most dangerous failure mode for production models and the second-highest weighted dimension.
Confidence Gap Misclassification Summary High-Conf Error Rate
Bias & Fairness w = 0.25
Does it fail some groups more?
Detects performance disparity across subgroups. Pass multiple sensitive features — gender, age, income — and TrustLens generates per-feature plots automatically. Equalized-odds checks expose group-wise TPR/FPR disparity. Supports EU AI Act transparency requirements.
Class Imbalance Subgroup Performance Equalized Odds Multi-Feature Plots
Representation w = 0.10
Are embeddings geometrically sound?
Evaluates the geometry of latent spaces when embeddings are provided. Silhouette-based separability estimates how well-separated classes are. CKA utility enables representation similarity studies across architectures. Optional — only runs when embeddings are supplied.
Silhouette Score Within/Between Distance CKA Utility
Fairness Visualization — Multi-Feature

Pass multiple sensitive features and TrustLens generates per-feature plots for every visualization type — no feature is silently dropped. Filenames are automatically sanitized. Features processed in sorted order for deterministic output.

# Batch save — one call, all per-feature files from trustlens.visualization import plot_module plot_module( "bias", report.results["bias"], save_dir="plots/" ) # Or via report object report.plot_bias(mode="all")
Visual Intelligence: Summary Plot
TrustLens Report - Model A
Trust Score 28/100 | Grade D | Low Trust - Blocked by diagnostic risk
Trust Score
28
Calibration
Confidence Gap
Error Rate by Class
Class Distribution
Sub-score Breakdown
Who It's For

Built for practitioners
who ship.

Zero-friction by design. If your model has .predict(), you're already ready. No TrustLens-specific concepts to learn before getting results.

ML Engineers
CI/CD Release Gating
Building mission-critical production systems where a high-confidence mistake has real consequences. Use TrustLens as a CI gate — block deploys when calibration regresses or bias spikes between model versions.
Data Scientists
Stakeholder Evidence
When you need to justify model decisions to stakeholders or regulators. TrustLens provides the visual and quantitative evidence needed — reliability diagrams, subgroup tables, confidence gap plots — not just a number.
Researchers
Beyond-Leaderboard Benchmarking
Benchmarking reliability of new architectures. Go beyond accuracy leaderboards — compare how different models represent classes geometrically or handle difficult edge cases in high-confidence regions.
AI Governance
Regulatory Transparency
Focused on safety, fairness, and regulatory compliance (EU AI Act, NIST AI RMF). Subgroup performance and bias reports provide the demographic transparency required for auditing and accountability documentation.
Architecture

Layered. Decoupled.
Extensible.

Each layer has a single responsibility. Metric computation, scoring logic, and report interpretation are decoupled — independently testable and swappable. The plugin system means new capabilities never touch core files.

Orchestration Layer
api.py · 307 lines
Validates inputs, resolves probabilities, dispatches enabled modules, assembles the final results payload. Single entry point: analyze(). No configuration classes or builder patterns required.
analyze() compare() quick_analyze() Plugin dispatch
Metrics Layer
metrics/ · 4 modules
Independent, testable compute nodes. Calibration, Failure, Bias, Representation — each a self-contained module with no cross-module imports. Returns structured dicts for scoring and visualization.
calibration.py ~200 failure.py ~140 bias.py ~340 representation.py ~180 faithfulness.py
Scoring Layer
trust_score.py · 485 lines
Computes sub-scores and weighted composite. Redistributes weights for missing dimensions. Applies risk penalties and deployment blockers. All rules are explicit, traceable, and documented.
TrustScoreResult Penalty engine Blocker hierarchy Weight redistribution
Reporting Layer
report.py · 1,263 lines
The central result container. Packages run metadata, module outputs, generates textual summaries, surfaces detected risk patterns, and renders visualizations. Also handles JSON/TXT serialization and artifact export.
show() plot_bias() summary_plot() save() to_dict()
Plugin Layer
plugins/ · Extensible
Singleton registry pattern. New capabilities extend the core without modifying it. Implement BasePlugin ABC, register with the registry — no changes to analyze() signature required. Entry-point ready for third-party distribution.
BasePlugin ABC PluginRegistry register() list_plugins()
Visualization Layer
visualization/ · 7 modules
Accepts dicts — not TrustReport objects. Fully decoupled from the report class. Every metric has a corresponding visualization. Plots annotated with metric values. Matplotlib Agg backend for CI/server compatibility. 150 DPI minimum.
plot_module() fairness.py ~400 summary_plot ~400 bias_plots calibration_plots
Design Principles

Every decision in this library
was made on purpose.

New contributors read this before writing any code. These principles guide every API decision, every tradeoff, every line.

1
Simplicity > Complexity
A correct library that is never used has zero impact. TrustLens competes with the user's time. The primary API is a single function — no configuration classes, no session objects, no builders. Default parameters produce a useful result for 80% of users. Errors must be actionable, not cryptic.
// rules out: overengineered abstractions, TrustLens-specific concepts users must learn first
2
Modular by Design
Users who only want Brier Score should not pay the import cost of Grad-CAM. Each analysis area lives in its own module. No circular imports. Deep learning dependencies are optional extras. The plugin system ensures new capabilities extend — not couple — the core.
// rules out: monolithic files, mandatory heavy dependencies for lightweight features
3
Visual-First Outputs
An ECE of 0.042 means nothing until you see the reliability diagram and notice overconfidence specifically at high confidence. Every metric has a corresponding visualization. Every plot answers a specific question, not just displays data. Plots are self-contained with annotated metric values.
// rules out: raw number dumps, non-informative plots that don't annotate the metric they convey
4
Research + Practical Balance
A library used only in papers or only in production is half a library. TrustLens sits at the intersection — rigorous enough for researchers (correct math, citations, edge-case handling), accessible enough for practitioners (sklearn API, sensible defaults, fast runtimes).
// every metric links to its original paper in module docstring
5
Extensibility Without Fragility
New capabilities should not break existing ones. The plugin system is the primary extension mechanism. analyze() dispatches to modules by name string — adding a new module never changes the function signature. Backward compatibility is maintained within a major version.
// rules out: tight coupling, hardcoded module lists that must be updated on every addition
6
Test Everything
TrustLens is a trust tool — it must itself be trustworthy. Minimum 80% branch coverage. Every metric tested for perfect predictor and random predictor cases. Edge cases (empty input, single class, NaN, infinite values) are explicitly tested. 16 test files, end-to-end integration coverage.
// rules out: merging untested code, skipping tests for "obviously correct" functions
7
Performance is Not Optional
A tool that takes 10 minutes to run won't be run. Silhouette score uses subsampling for n > 5000. Faithfulness tests support configurable n_steps. Expensive operations are lazy — only run when the module is requested. Matplotlib Agg backend for CI/server compatibility.
// no unnecessary computation, no display requirement in CI environments
Why TrustLens

What standard tools
don't give you.

Most evaluation pipelines stop at one or two dimensions. TrustLens combines them all into a single deployment decision with traceable reasoning.

Capability sklearn metrics fairlearn TrustLens
Calibration (ECE, Brier) ✓ partial ✓ Full + penalties
Failure / confidence gap analysis ✓ Core module
Subgroup fairness + equalized odds ✓ + multi-feature
Embedding representation analysis ✓ Optional module
Composite Trust Score + verdict ✓ 0–100 with grade
Deployment blockers ✓ Hard-stop rules
Multi-candidate model comparison ✓ compare()
Plugin extensibility ✓ BasePlugin ABC
Exportable reports (JSON, TXT, plots) ✓ report.save()
CI/CD gate integration ✓ is_blocked flag
By the Numbers

Built seriously.
Open source.

~10K
Lines of Code
0
Source Files
0
Test Files
0
Contributors
0
Core Modules
0.3.0
Current Version
Contributors
Citation

Using TrustLens in research?

If TrustLens contributed to your work, cite it so others can find it. Even a GitHub star helps with discovery.

@software{trustlens2026, author = {Shahid Ul Islam}, title = {TrustLens: Debug your ML models beyond accuracy}, year = {2026}, url = {https://github.com/Khanz9664/TrustLens}, }
Get Started

Three ways to start

shell
# Option 1: Install and audit $ pip install trustlens # Option 2: Full demo $ python demo.py # Option 3: Comprehensive audit $ python examples/comprehensive_audit.py