Open-source Python library · v0.3.0 · pip install trustlens

Your model has 92% accuracy.
That may still be unsafe.

TrustLens evaluates model reliability beyond accuracy — and produces a deployment decision backed by evidence, not instinct.

↓ Install on PyPI ★ GitHub ▶ Watch Demo

~10K Lines of Code

25 Source Files

16 Test Files

7 Contributors

4 Diagnostic Modules

MIT License

Failure Deep Dive: show_failures()

Confidently Wrong: The dangerous mistakes.

Accuracy hides the danger of failures. TrustLens isolates "Confidently Wrong" samples where the model is 95%+ certain but incorrect. These are the samples most likely to bypass human review and cause production disasters.

96.4% Mean confidence on top failures

20.1% Total error rate

Model A | Worst Offenders

#    Sample     True   Pred   Confidence   Danger 
1    234           1      0       96.7%    CRITICAL 
2    659           1      0       96.7%    CRITICAL 
3    740           1      0       95.7%    CRITICAL 

[Insight]: High-confidence mistakes detected. 
The model is certain it is right, but it is wrong. 
Overconfidence detected — consider calibration.

Multi-Model Benchmarking

Compare candidates.
Pick the safest.

Accuracy is a shallow metric. TrustLens allows you to benchmark multiple models across calibration, failure risk, and bias — ensuring you ship the most reliable candidate, not just the one with the highest accuracy.

Model A BLOCKED

Accuracy79.88%

Trust Score28/100

Grade[D]

Primary Risks

Fairness violation High failure risk

Model B BLOCKED

Accuracy73.38%

Trust Score27/100

Grade[D]

Primary Risks

Failure risk Calibration error

Model C BLOCKED

Accuracy78.62%

Trust Score42/100

Grade[D]

Primary Risks

Failure risk Fairness violation

⚠ BENCHMARK VERDICT: DO NOT DEPLOY

All candidates triggered critical diagnostic blocks. While Model C has the highest trust score, its fairness violations exceed safety thresholds. Recommendation: Retrain with class-weighted loss and bias mitigation.

The Problem

Three ways your
metrics lie to you.

You trained a model. It hits 92% accuracy. You ship it. Three months later — a minority-class user gets consistently wrong predictions, the model is 90% confident on its worst mistakes, and a regulator asks "why did it make that decision?" and you have no answer.

⚠ Miscalibration

Confident on the Wrong Answers

A model saying "I'm 99% sure" when it's right only 60% of the time. ECE > 0.05 is a deployment red flag. You cannot set thresholds if the probabilities are wrong.

⚠ Silent Bias

High Overall, Broken for Some

High overall accuracy that masks significant performance drops for minority classes. The aggregate hides systematic subgroup failure — until it's too late and someone notices in production.

⚠ Fragile Representations

Classes Too Close to Each Other

Latent spaces where classes are so closely packed that slight input noise causes classification flips. Silhouette score reveals this. Accuracy never will.

Accuracy tells you how often your model is right.
It tells you nothing about when it fails, why it fails, or who it fails. TrustLens makes those failures visible before they reach production — and turns raw diagnostics into a deployment decision.

Quickstart

From pip install to
verdict in 30 seconds.

TrustLens is zero-friction. If your model has .predict() and .predict_proba(), you're ready to go. One function call. One report.

trustlens_demo.py

# Install once, run anywhere
$ pip install trustlens

from trustlens import analyze

# 1. Run the audit — that's it.
report = analyze(
    model,
    X_test,
    y_test,
    y_prob = model.predict_proba(X_test),
    sensitive_features = {
        "gender": gender_test,
        "age_group": age_test,
    }
)

# 2. Inspect the verdict.
report.show()

# 3. Compare candidates.
from trustlens import compare
compare([report_a, report_b, report_c])

# 4. Export artifacts.
report.save("trust_report")  # JSON+plots
report.save("report.txt")   # human-readable

TRUST SCORE:88 / 100

Grade:[B]

Assessment:Good Trust — minor issues to address

Base Score:92

Penalties:−4.0 [Calibration]

Final Score:88

APPROVED FOR DEPLOYMENT — review calibration before release

What this one call does

Resolves probabilities

Calls predict_proba if not supplied. Falls back gracefully.

Dispatches modules

Calibration, Failure, Bias, Representation — in parallel.

Scores & penalizes

Weighted composite score with penalties and blockers applied.

Returns TrustReport

Score, grade, verdict, insights, plots, exportable artifacts.

One-line demo

from trustlens import quick_analyze
quick_analyze(dataset="breast_cancer")

Trust Score Engine

One score. Traceable rules.
No black boxes.

The Trust Score combines diagnostic modules with explicit, auditable logic. Weights, penalties, and blockers are all visible — you can trace every deduction back to a specific metric failure.

// Trust Score Output

/100

[B] · Good Trust

Calibration Sub-score 84

Failure Sub-score 91

Bias Sub-score 95

Weighted Base Score 92

Penalties Applied −4.0

Final Trust Score 88

// Dimension Weights

Calibration

35%

0.35

Failure

30%

0.30

Bias

25%

0.25

Representation

10%

0.10

When Representation is unavailable, remaining weights are renormalized to sum to 1.0.

// Penalty Triggers

Failure sub-score < safety threshold up to −20 trigger: failure_score < 60

ECE exceeds calibration tolerance up to −15 trigger: ECE > 0.05

Severe fairness violations up to −15 trigger: subgroup disparity > limit

Total penalty cap −35 max preserves score interpretability

// Grade Bands

90–100

Production ready

75–89

Good, address issues

55–74

Investigate first

0–54

Block deployment

Deployment Blockers — Override the Score

🚫

Confidently Wrong

Strong confidently-wrong behavior detected. Model is systematically overconfident on misclassified samples. Forces Grade D regardless of overall score.

🚫

Failure Below Critical

Failure sub-score < 40. Error concentration in high-confidence regions is too severe to ship. Verdict forced to do-not-deploy.

🚫

Severe Fairness Violation

A protected subgroup sees catastrophic performance gaps that exceed safety thresholds. Deployment would cause discriminatory outcomes.

🚫

Very Poor Calibration

ECE > 0.10. Predicted probabilities are so miscalibrated that downstream threshold-based decisions are unreliable. Blocks deployment.

Diagnostic Modules

Four lenses on
model reliability.

Each module diagnoses a different failure mode. Together they produce one deployment-oriented verdict while preserving full diagnostic detail for root-cause analysis.

Calibration w = 0.35

Are probabilities trustworthy?

Checks whether predicted probabilities match real-world frequencies. A model saying "95% confident" should be right 95% of the time. ECE > 0.05 triggers a penalty. Essential for any downstream decision that depends on threshold quality.

Brier Score ECE Reliability Curve

Failure Analysis w = 0.30

Where are mistakes concentrated?

Focuses on risk concentration, not total error rate. Identifies "Confidently Wrong" behavior — where misclassifications occur in high-confidence regions. This is the most dangerous failure mode for production models and the second-highest weighted dimension.

Confidence Gap Misclassification Summary High-Conf Error Rate

Bias & Fairness w = 0.25

Does it fail some groups more?

Detects performance disparity across subgroups. Pass multiple sensitive features — gender, age, income — and TrustLens generates per-feature plots automatically. Equalized-odds checks expose group-wise TPR/FPR disparity. Supports EU AI Act transparency requirements.

Class Imbalance Subgroup Performance Equalized Odds Multi-Feature Plots

Representation w = 0.10

Are embeddings geometrically sound?

Evaluates the geometry of latent spaces when embeddings are provided. Silhouette-based separability estimates how well-separated classes are. CKA utility enables representation similarity studies across architectures. Optional — only runs when embeddings are supplied.

Silhouette Score Within/Between Distance CKA Utility

Fairness Visualization — Multi-Feature

Pass multiple sensitive features and TrustLens generates per-feature plots for every visualization type — no feature is silently dropped. Filenames are automatically sanitized. Features processed in sorted order for deterministic output.

# Batch save — one call, all per-feature files
from trustlens.visualization import plot_module
plot_module(
    "bias",
    report.results["bias"],
    save_dir="plots/"
)

# Or via report object
report.plot_bias(mode="all")

Visual Intelligence: Summary Plot

TrustLens Report - Model A

Trust Score 28/100 | Grade D | Low Trust - Blocked by diagnostic risk

Trust Score

Calibration

Confidence Gap

Error Rate by Class

Class Distribution

Sub-score Breakdown

Who It's For

Built for practitioners
who ship.

Zero-friction by design. If your model has .predict(), you're already ready. No TrustLens-specific concepts to learn before getting results.

ML Engineers

CI/CD Release Gating

Building mission-critical production systems where a high-confidence mistake has real consequences. Use TrustLens as a CI gate — block deploys when calibration regresses or bias spikes between model versions.

Data Scientists

Stakeholder Evidence

When you need to justify model decisions to stakeholders or regulators. TrustLens provides the visual and quantitative evidence needed — reliability diagrams, subgroup tables, confidence gap plots — not just a number.

Researchers

Beyond-Leaderboard Benchmarking

Benchmarking reliability of new architectures. Go beyond accuracy leaderboards — compare how different models represent classes geometrically or handle difficult edge cases in high-confidence regions.

AI Governance

Regulatory Transparency

Focused on safety, fairness, and regulatory compliance (EU AI Act, NIST AI RMF). Subgroup performance and bias reports provide the demographic transparency required for auditing and accountability documentation.

Architecture

Layered. Decoupled.
Extensible.

Each layer has a single responsibility. Metric computation, scoring logic, and report interpretation are decoupled — independently testable and swappable. The plugin system means new capabilities never touch core files.

Orchestration Layer

api.py · 307 lines

Validates inputs, resolves probabilities, dispatches enabled modules, assembles the final results payload. Single entry point: analyze(). No configuration classes or builder patterns required.

analyze() compare() quick_analyze() Plugin dispatch

Metrics Layer

metrics/ · 4 modules

Independent, testable compute nodes. Calibration, Failure, Bias, Representation — each a self-contained module with no cross-module imports. Returns structured dicts for scoring and visualization.

calibration.py ~200 failure.py ~140 bias.py ~340 representation.py ~180 faithfulness.py

Scoring Layer

trust_score.py · 485 lines

Computes sub-scores and weighted composite. Redistributes weights for missing dimensions. Applies risk penalties and deployment blockers. All rules are explicit, traceable, and documented.

TrustScoreResult Penalty engine Blocker hierarchy Weight redistribution

Reporting Layer

report.py · 1,263 lines

The central result container. Packages run metadata, module outputs, generates textual summaries, surfaces detected risk patterns, and renders visualizations. Also handles JSON/TXT serialization and artifact export.

show() plot_bias() summary_plot() save() to_dict()

Plugin Layer

plugins/ · Extensible

Singleton registry pattern. New capabilities extend the core without modifying it. Implement BasePlugin ABC, register with the registry — no changes to analyze() signature required. Entry-point ready for third-party distribution.

BasePlugin ABC PluginRegistry register() list_plugins()

Visualization Layer

visualization/ · 7 modules

Accepts dicts — not TrustReport objects. Fully decoupled from the report class. Every metric has a corresponding visualization. Plots annotated with metric values. Matplotlib Agg backend for CI/server compatibility. 150 DPI minimum.

plot_module() fairness.py ~400 summary_plot ~400 bias_plots calibration_plots

Design Principles

Every decision in this library
was made on purpose.

New contributors read this before writing any code. These principles guide every API decision, every tradeoff, every line.

Simplicity > Complexity

A correct library that is never used has zero impact. TrustLens competes with the user's time. The primary API is a single function — no configuration classes, no session objects, no builders. Default parameters produce a useful result for 80% of users. Errors must be actionable, not cryptic.

// rules out: overengineered abstractions, TrustLens-specific concepts users must learn first

Modular by Design

Users who only want Brier Score should not pay the import cost of Grad-CAM. Each analysis area lives in its own module. No circular imports. Deep learning dependencies are optional extras. The plugin system ensures new capabilities extend — not couple — the core.

// rules out: monolithic files, mandatory heavy dependencies for lightweight features

Visual-First Outputs

An ECE of 0.042 means nothing until you see the reliability diagram and notice overconfidence specifically at high confidence. Every metric has a corresponding visualization. Every plot answers a specific question, not just displays data. Plots are self-contained with annotated metric values.

// rules out: raw number dumps, non-informative plots that don't annotate the metric they convey

Research + Practical Balance

A library used only in papers or only in production is half a library. TrustLens sits at the intersection — rigorous enough for researchers (correct math, citations, edge-case handling), accessible enough for practitioners (sklearn API, sensible defaults, fast runtimes).

// every metric links to its original paper in module docstring

Extensibility Without Fragility

New capabilities should not break existing ones. The plugin system is the primary extension mechanism. analyze() dispatches to modules by name string — adding a new module never changes the function signature. Backward compatibility is maintained within a major version.

// rules out: tight coupling, hardcoded module lists that must be updated on every addition

Test Everything

TrustLens is a trust tool — it must itself be trustworthy. Minimum 80% branch coverage. Every metric tested for perfect predictor and random predictor cases. Edge cases (empty input, single class, NaN, infinite values) are explicitly tested. 16 test files, end-to-end integration coverage.

// rules out: merging untested code, skipping tests for "obviously correct" functions

Performance is Not Optional

A tool that takes 10 minutes to run won't be run. Silhouette score uses subsampling for n > 5000. Faithfulness tests support configurable n_steps. Expensive operations are lazy — only run when the module is requested. Matplotlib Agg backend for CI/server compatibility.

// no unnecessary computation, no display requirement in CI environments

Why TrustLens

What standard tools
don't give you.

Most evaluation pipelines stop at one or two dimensions. TrustLens combines them all into a single deployment decision with traceable reasoning.

Capability	sklearn metrics	fairlearn	TrustLens
Calibration (ECE, Brier)	✓ partial	✗	✓ Full + penalties
Failure / confidence gap analysis	✗	✗	✓ Core module
Subgroup fairness + equalized odds	✗	✓	✓ + multi-feature
Embedding representation analysis	✗	✗	✓ Optional module
Composite Trust Score + verdict	✗	✗	✓ 0–100 with grade
Deployment blockers	✗	✗	✓ Hard-stop rules
Multi-candidate model comparison	✗	✗	✓ compare()
Plugin extensibility	✗	✗	✓ BasePlugin ABC
Exportable reports (JSON, TXT, plots)	✗	✗	✓ report.save()
CI/CD gate integration	✗	✗	✓ is_blocked flag

By the Numbers

Built seriously.
Open source.

~10K

Lines of Code

Source Files

Test Files

Contributors

Core Modules

0.3.0

Current Version

Contributors

Citation

Using TrustLens in research?

If TrustLens contributed to your work, cite it so others can find it. Even a GitHub star helps with discovery.

@software{trustlens2026, author = {Shahid Ul Islam}, title = {TrustLens: Debug your ML models beyond accuracy}, year = {2026}, url = {https://github.com/Khanz9664/TrustLens}, }

Get Started

Three ways to start

shell

# Option 1: Install and audit
$ pip install trustlens

# Option 2: Full demo
$ python demo.py

# Option 3: Comprehensive audit
$ python examples/comprehensive_audit.py

↓ Install on PyPI ★ Star on GitHub ▶ Watch Demo

Your model has 92% accuracy. That may still be unsafe.

Confidently Wrong: The dangerous mistakes.

Compare candidates.Pick the safest.

Three ways yourmetrics lie to you.

From pip install toverdict in 30 seconds.

One score. Traceable rules.No black boxes.

Deployment Blockers — Override the Score

Four lenses onmodel reliability.

Built for practitionerswho ship.

Layered. Decoupled.Extensible.

Every decision in this librarywas made on purpose.

What standard toolsdon't give you.

Built seriously.Open source.

Using TrustLens in research?

Three ways to start

Your model has 92% accuracy.
That may still be unsafe.

Compare candidates.
Pick the safest.

Three ways your
metrics lie to you.

From pip install to
verdict in 30 seconds.

One score. Traceable rules.
No black boxes.

Four lenses on
model reliability.

Built for practitioners
who ship.

Layered. Decoupled.
Extensible.

Every decision in this library
was made on purpose.

What standard tools
don't give you.

Built seriously.
Open source.