Case Study · Dec 2024

Breast Cancer
Diagnostic Support
System

Transforming FNA cytology data into clinically accountable, explainable AI decisions — 98.2% accuracy with full SHAP transparency.

Python 3.10+ Random Forest SHAP XAI Streamlit WDBC Dataset
98.2%
Accuracy
0.999
AUC-ROC
569
Samples
2
False Neg.
Scroll

Executive Summary

Breast cancer is the most commonly diagnosed cancer in women globally. Early, accurate classification of a tumor as benign or malignant is the pivotal moment in a patient's care pathway — catching malignancy early dramatically improves survival outcomes while avoiding unnecessary surgical interventions for benign cases.

This project delivers a production-grade clinical decision support system that classifies breast tumors in real-time from 30 nuclear morphology measurements derived from Fine Needle Aspiration (FNA) cytology.

The system wraps its prediction inside a fully transparent, explainable, and auditable platform — making it suitable for medical researchers and clinical informaticians who need to trust and interpret every decision.

Design philosophy: A prediction is only clinically useful if the clinician can understand, challenge, and reproduce the reasoning behind it. Accuracy alone is insufficient — transparency is mandatory.

// SYSTEM AT A GLANCE
Input
30 FNA cytology features numeric
Dataset
WDBC — 569 clinical samples
Engine
Random Forest Ensemble v2 sklearn
Accuracy
98.2% · AUC 0.9990
XAI
SHAP + Rule-based clinical engine
Output
Benign / Malignant + confidence + PDF report
Ops
Z-score drift detection · Audit log

The Clinical & Engineering Gap

Pathologists face a recurring challenge: manual interpretation of FNA cytology is subjective, time-costly, and prone to inter-observer variability. Existing AI tools compound this with black-box outputs and zero operational transparency.

False Negative
Missing a malignancy delays life-saving treatment and leads to poorer prognosis. Every missed case is a catastrophic failure in the care pathway.
vs.
False Positive
Flagging a benign tumor as malignant causes unnecessary surgery, patient anxiety, and significant financial burden — with no clinical benefit.

Problem Statement: How can we build an AI system that not only classifies breast tumors with near-perfect accuracy, but is also transparent, operationally monitored, clinically actionable, and ethically documented — such that a clinician can trust and use it with confidence?

Engineering Gap Analysis

Pain Point Legacy Tools This System
Black-box predictions No explanation for the decision SHAP feature attribution per prediction
Fixed threshold 0.50 default, ignores clinical stakes Adjustable live threshold slider
No drift monitoring Silent model degradation Z-score drift detection per feature
No audit trail No persistent prediction log Timestamped CSV audit log
No PDF report Manual copy-paste documentation Auto-generated structured PDF
No robustness check Single-point probability estimate Tree-level variance + 95% CI
No research utility Static production tool only Synthetic data + PCA research lab

Dataset & Domain Context

The Wisconsin Diagnostic Breast Cancer (WDBC) dataset represents digitized nuclear morphology measurements from Fine Needle Aspiration imaging, sourced from the UCI Machine Learning Repository.

Total Samples
569
Digitized FNA cytology measurements from breast mass aspirates.
Features
30
Numeric, continuous. No missing values. No imputation required.
Class Ratio
1.68:1
Manageable imbalance. No SMOTE or oversampling was necessary.
Provenance
1995
Wolberg, Street & Mangasarian — UCI ML Repository benchmark dataset.
569 SAMPLES
Benign
357 samples · 62.7%
Malignant
212 samples · 37.3%
Split
80% train · 20% test (stratified)

Feature Architecture

10 features
Mean Measurements
  • Radius
  • Texture
  • Perimeter
  • Area
  • Smoothness
  • Compactness
  • Concavity
  • Concave Points
  • Symmetry
  • Fractal Dim.
10 features
Standard Error
  • Radius SE
  • Texture SE
  • Perimeter SE
  • Area SE
  • Smoothness SE
  • Compactness SE
  • Concavity SE
  • Concave Pts SE
  • Symmetry SE
  • Fractal Dim. SE
10 features · highest predictive power
Worst-Case Values
  • Radius Worst
  • Texture Worst
  • Perimeter Worst
  • Area Worst
  • Smoothness W.
  • Compactness W.
  • Concavity W.
  • Concave Pts W.
  • Symmetry W.
  • Fractal Dim. W.

Solution Architecture

A modular, 11-module utility layer feeds a Streamlit multi-tab frontend. Each component is independently testable and swappable — enabling seamless model upgrades, feature additions, and compliance audits.

// PREDICTION PIPELINE
Input
30 FNA measurements via sidebar sliders
Validate
Feature alignment & range validation
Scale
StandardScaler normalization
Predict
Random Forest Ensemble v2
Explain
SHAP + tree variance + CI
Report
PDF export + audit log
// APPLICATION TABS
TAB 01
AI Analysis
Classification result, SHAP waterfall, radar chart, PCA projection
TAB 02
Model Monitoring
Z-score drift per feature, system health gauge, prediction log
TAB 03
Model Card & Ethics
Intended use, limitations, fairness disclosures, accountability
TAB 04
Error Analysis
Interactive ROC, AUC=0.999, confusion matrix, FN/FP log
TAB 05
Research Lab
Gaussian synthetic generation, PCA manifold overlay
TAB 06
Dataset Explorer
Filterable scatter plots, marginal histograms, raw data
Python 3.10+
scikit-learn
SHAP
Streamlit
Plotly
ReportLab PDF
Pandas
NumPy

Model Results

The Random Forest ensemble achieves near-perfect discrimination, validated across all standard clinical AI metrics on a held-out stratified 20% test set.

0%
Accuracy
Test set · stratified split
0.000
AUC-ROC
Near-perfect discrimination
0%
Sensitivity
At threshold T = 0.50
0%
Specificity
At threshold T = 0.50
2
False Negatives
At T=0.50 on test set
// TOP SHAP CONTRIBUTORS → MALIGNANT DIRECTION
Concave Points Worst
+0.82
Radius Worst
+0.78
Perimeter Worst
+0.75
Area Worst
+0.62
Concavity Mean
+0.55
Compactness Worst
+0.47
Concave Points Mean
+0.44
Radius Mean
+0.41
Texture Worst
+0.36
Area Mean
+0.31

Note: SHAP values are approximate illustrative representations from ensemble analysis.

Clinical Impact

The system redefines the clinical workflow from manual, subjective, and time-intensive to real-time, explainable, and audit-ready — across every dimension that matters to clinicians and institutions.

Faster Diagnosis
From hours or days of manual cytology review to real-time classification. Immediate structured second opinion available at any point in the workflow.
Reduced Missed Malignancies
Only 2 false negatives at T=0.50 on the test set. The adjustable threshold allows clinicians to tune sensitivity even lower when clinical stakes demand it.
Explainable Decisions
SHAP attribution + rule engine makes the AI's reasoning visible, quantified, and challengeable — not a black-box probability score.
Audit-Ready Documentation
Auto-generated PDF report + CSV prediction log with timestamps supports regulatory adherence and institutional review without extra documentation work.
Research Acceleration
Synthetic data generation + PCA manifold lab enables safe hypothesis testing without requiring access to new patient data or ethics approvals.
Operational Reliability
Z-score drift detection monitors every feature against training distribution, protecting against silent model degradation in production environments.

Baseline vs. System Comparison

Impact Dimension Without AI With This System
Time to classification Hours to days (manual) Seconds (real-time)
Sensitivity at default threshold ~85–90% (manual cytology) ~97% at T=0.50
Explainability Subjective operator judgment SHAP-backed, quantified attribution
Documentation time Manual write-up Auto-generated PDF in one click
Model degradation visibility None — silent failure Z-score drift alerts per feature
Second opinion access Requires specialist availability Instant AI second opinion
Research feasibility Requires new patient data Gaussian synthetic generation

Stakeholder Impact Map

Pathologist / Cytologist
  • Faster structured second opinion
  • SHAP explanations validate intuition
  • PDF report reduces documentation effort
Clinical Institution
  • Audit trail supports regulatory compliance
  • Drift monitoring catches model degradation
  • Reproducible, explainable decisions
Medical Researcher
  • Synthetic data lab for safe hypothesis testing
  • Dataset Explorer for population insights
  • Sensitivity curves for feature-level analysis
AI / ML Practitioner
  • Reference XAI-in-healthcare implementation
  • End-to-end MLOps pattern: train → serve → monitor
  • Ethical AI documentation via Model Card

Ethical Considerations

Clinical AI demands more than accuracy. This system ships with a structured Model Card covering intended use, known limitations, fairness disclosures, and explicit accountability documentation.

Intended Use
Educational and research decision support tool only. Explicitly NOT a replacement for physician diagnosis or clinical judgment.
Known Limitations
WDBC dataset (1995) may not represent all contemporary populations or imaging protocols. Temporal generalizability is not guaranteed.
Fairness Disclosure
No demographic stratification is available in WDBC — bias analysis across age, ethnicity, or institution is not possible. Explicitly disclosed in the Model Card.
Explainability Mandate
Every prediction is accompanied by SHAP attribution. No silent black-box outputs. The AI's reasoning is always visible and challengeable by the clinician.
Audit Trail
All predictions are logged with timestamp, input features, and outcome to a persistent CSV file. Full provenance for every clinical decision.
Threshold Transparency
The clinician retains full control over the sensitivity/specificity trade-off. The system never enforces a fixed 0.50 default for clinical decisions.

Lessons Learned

Building a clinical AI tool surfaces priorities that pure ML benchmarking obscures. These are the five most important architectural and design insights from this project.

01
Explainability is not optional in clinical AI
SHAP transforms a prediction from a number into a clinically reasoned argument — this is what separates a tool a clinician will trust from one they will ignore. Accuracy without transparency is clinically worthless.
02
Threshold flexibility matters more than raw accuracy
A fixed 0.50 threshold optimizes for accuracy, not for clinical stakes. Giving control to the clinician is both ethically correct and practically superior — the sensitivity/specificity trade-off is a clinical decision, not a technical default.
03
Monitoring is as important as modeling
A model that degrades silently is clinically dangerous. Z-score drift detection is a non-negotiable MLOps requirement for healthcare AI — not an optional operational nicety.
04
Modular architecture commands future-proofing
Each of the 11 utility modules is independently testable and swappable — this design enables seamless model upgrades, feature additions, and compliance audits without touching unrelated components.
05
Synthetic data unlocks safe research
The ability to generate controlled synthetic clinical profiles without accessing real patient data opens research workflows that would otherwise be ethically or logistically blocked — a significant accelerator for clinical AI development.

Future Directions

The current system provides a robust, production-ready foundation. The following enhancements would extend its clinical utility, fairness coverage, and institutional applicability.

Enhancement Rationale
Multi-class support Extension to BIRADS multi-category classification (Benign, Prob. Benign, Suspicious, Malignant) — providing more granular clinical mapping.
EHR Integration (FHIR/HL7) Direct integration with hospital electronic health records — enabling seamless clinical orders and result ingestion.
Demographic Stratification Benchmarking across larger, contemporary multi-ethnic datasets — improving the system's global reliability and fairness validation.
Federated Learning Multi-institution model training without data sharing — enabling broader datasets while preserving patient privacy.
Model Versioning (MLflow) Production-grade experiment tracking and automatic model retraining pipelines — ensuring the AI remains state-of-the-art.
LIME Comparison Adding LIME explanations alongside SHAP — providing clinicians with multiple cross-validated perspectives on AI reasoning.