Research Paper · v2.4.0 · April 2026 · Clinical Validation Active

CardioSense AI

An Integrated eXplainable Clinical Decision Support System (X-CDSS)
for Precision Cardiovascular Risk Assessment

Cardiovascular diseases remain the leading cause of global mortality. We present CardioSense AI — a state-of-the-art system that integrates optimized XGBoost with multi-modal interpretability layers (SHAP + LIME), ACC/AHA safety guardrails, and a novel Least Effort Path (LEP) optimization algorithm. High performance and full interpretability are not mutually exclusive.

88.52%

Clinical Accuracy

0.9621

ROC-AUC

92.86%

Recall (Sensitivity)

0.0814

Brier Score

0.9553

PR-AUC

100%

Security Audit

Shahid Ul Islam · ML Engineer | CV Engineer · Keywords: XAI, XGBoost, SHAP, LIME, Risk Optimization, CDSS

Read

§1 — Introduction

The Interpretability Gap in Clinical AI

Cardiovascular medicine is inherently data-rich. Traditional risk calculators — Framingham Score, ASCVD Estimator — rely on linear assumptions that fail to capture high-dimensional non-linear dependencies. AI offers a solution, but deployment is hampered by the "Black Box" problem.

"A High Risk notification from a model, without a supporting clinical rationale, is often viewed with scepticism. The clinician is ethically and legally responsible for every diagnosis they provide." A passive model that provides a risk score without explanation — and without intervention guidance — is clinically inert.

Trust via Multi-Modal XAI

Global (SHAP) and local (LIME) explainability techniques provide a "glass-box" view of every prediction. Every inference is accompanied by a feature-level waterfall decomposition and a sensitivity analysis.

Safety via Guideline Integration

AHA/ACC Hypertension Guidelines are embedded directly into a Safety Engine as deterministic "Hard-Stop" guardrails that act before the probabilistic ML model's output is presented to the clinician.

Active Decision Support (LEP)

The Least Effort Path (LEP) optimization algorithm identifies the most feasible clinical interventions for a specific patient — transforming passive prediction into an actionable clinical roadmap.

§2 — Data

UCI Cleveland Clinical Dataset

CardioSense AI is trained and validated on the internationally recognized UCI Cleveland Heart Disease dataset — 303 patient records, 13 clinical features, binary cardiovascular disease target. VIF analysis confirms all features exhibit VIF < 2.5 (low multicollinearity).

Feature	Description	Clinical Significance	Range / Type
age	Patient age in years	Primary risk factor for vascular decay	29–77 yrs
sex	Biological sex	Biological variance in coronary anatomy	0: Female, 1: Male
cp	Chest pain type (1–4)	Qualitative indicator of ischemic stress	Categorical
trestbps	Resting systolic BP	Hemodynamic marker of vascular pressure	94–200 mmHg
chol	Serum cholesterol	Risk factor for lipid-driven plaque formation	126–564 mg/dl
fbs	Fasting blood sugar > 120mg/dl	Metabolic indicator of diabetic risk	0/1 Boolean
restecg	Resting ECG results	Electric signal evidence of hypertrophy/ischemia	0, 1, 2
thalach	Maximum heart rate achieved	Marker of cardiac reserve and fitness	71–202 bpm
exang	Exercise induced angina	Direct evidence of coronary insufficiency	0/1 Boolean
oldpeak	ST depression via exercise	Metric for myocardial repolarization delay	0.0–6.2
slope	Peak exercise ST slope	Clinical indicator of ischemia severity	Upsloping/Flat/Down
ca	Number of major vessels (0–3)	Structural marker of coronary calcification	0–3
thal	Thalassemia score	Genetic/structural marker of blood flow	3 / 6 / 7

Clinical Safety Thresholds

Hypertensive Crisis

≥ 180 mmHg

trestbps — triggers immediate risk escalation

Critical ST Depression

> 3.0

oldpeak — ischemic severity override

Multivessel Disease

ca ≥ 2

Major vessels — structural override trigger

Tachycardia Risk

> 190 bpm

thalach — cardiac stress override

§3 — Methodology

Mathematical Foundations

CardioSense AI's clinical intelligence rests on four mathematical pillars: a robust preprocessing pipeline, the XGBoost objective, Bayesian hyperparameter optimization, and sigmoid probability calibration.

§3.1 Robust Preprocessing Pipeline

Numerical vitals \((x_{\text{num}} \in \{\text{age, trestbps, chol, thalach, oldpeak}\})\) are standardised via Z-score normalisation. Categorical features are transformed via One-Hot Encoding. Parameters are fitted exclusively on training data and persisted in preprocessor.joblib to eliminate training-serving skew.

Z-Score Normalisation (StandardScaler)

\[ z = \frac{x - \mu}{\sigma} \]

where \(\mu\) and \(\sigma\) are the channel-wise mean and standard deviation computed exclusively from the training set, preventing data leakage into the clinical validation pipeline.

§3.2 Core Intelligence Engine: XGBoost Objective

XGBoost optimises a second-order Taylor expansion of the loss function, enabling rapid convergence on the \(N=303\) clinical dataset while controlling overfitting through explicit regularisation of tree structure.

Regularised XGBoost Objective

\[ \mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \sum_k \Omega(f_k) \]

where \(\sum_i l(\hat{y}_i, y_i)\) is the differentiable convex loss (binary cross-entropy for clinical classification), and the regularisation term \(\Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2\) penalises the number of leaves \(T\) and leaf weights \(w\), preventing overfitting to the small clinical cohort.

§3.3 Bayesian Hyperparameter Optimization (Optuna TPE)

Unlike grid or random search, Optuna uses a Tree-structured Parzen Estimator (TPE) to intelligently navigate the hyperparameter space. We executed 50 trials with 5-fold Stratified Cross-Validation.

Optimised Hyperparameters (v2.4.0)

n_estimators

max_depth

learning_rate

0.1477

subsample

0.7827

min_child_weight

gamma

2.192

The scale_pos_weight is automatically set as \(N_{\text{neg}}/N_{\text{pos}}\) to address inherent class imbalance in cardiac datasets.

§3.4 Clinical Probability Calibration (Platt Scaling)

Raw XGBoost probabilities are pushed away from 0 and 1 due to the boosting process. We apply Sigmoid Calibration via CalibratedClassifierCV. A predicted 20% risk should correspond to an actual 20% frequency in the clinical population.

Brier Score — Calibration Measure

\[ \text{BS} = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2 = 0.0814 \]

A Brier Score approaching zero indicates near-perfect calibration. Our score of 0.0814 confirms high calibration integrity — the clinical "Risk Pulse" percentage is statistically accurate.

§4 — Explainability

Multi-Modal XAI Framework

A dual-engine interpretability layer ensures every prediction is explainable from multiple mathematical perspectives — global consistency via SHAP and local sensitivity via LIME. Neither technique alone is sufficient for clinical trust.

SHAP · Global Consistency

Shapley Additive Explanations

Based on cooperative game theory, SHAP provides a mathematically consistent allocation of "blame" or "credit" to each feature. We utilize TreeSHAP — a fast and exact algorithm for tree ensembles that satisfies Efficiency, Symmetry, and Additivity axioms.

SHAPLEY VALUE FORMULA

\[\phi_i = \sum_{S \subseteq \{x_1,\dots,x_p\}\setminus\{x_i\}} \frac{|S|!\,(M-|S|-1)!}{M!}\left[f(S \cup \{x_i\}) - f(S)\right]\]

The Shapley value \(\phi_i\) represents the fair marginal contribution of feature \(x_i\) across all possible orderings of feature coalitions \(S\).

LIME · Local Sensitivity

Local Interpretable Model-agnostic Explanations

LIME generates a linear approximation of the complex XGBoost model in the immediate vicinity of a specific patient's data point. It reveals how small changes in vitals (e.g., a 5 mmHg drop in BP) would shift the AI's risk assessment.

LIME OBJECTIVE

\[\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\]

where \(g\) is an interpretable linear surrogate, \(\pi_x\) is a proximity measure weighting perturbed samples by closeness to the patient \(x\), and \(\Omega(g)\) controls the surrogate's complexity.

Diagnostic "X-Ray" — SHAP Waterfall Decomposition

// SHAP WATERFALL — ILLUSTRATIVE HIGH-RISK PATIENT · f(X) = 0.92 · E[f(X)] = 0.46

// Risk Optimization Radar — Patient vs Target

// Shannon Entropy Confidence Engine

\[ H(p) = -(p\log_2 p + (1-p)\log_2(1-p)) \] \[ C = 1 - H(p) \]

When \(p \approx 0.5\), entropy is maximal → LOW confidence warning

§5 — Safety & Trust

Clinical Safety & Trust Framework

In medical AI, "Black Box" models are clinically unusable. CardioSense AI implements six interlocking layers of trust — from hard-stop clinical overrides to cryptographic audit hashes.

Clinical Overrides (ACC/AHA Alignment)

Hard-stop rules based on ACC/AHA guidelines trigger alerts regardless of AI probability. If trestbps ≥ 180 mmHg (Hypertensive Crisis), the system overrides any model output below 90% and escalates immediately. Similar triggers exist for ca ≥ 2 (multivessel disease) and oldpeak > 3.0 (ischemic severity).

Deterministic Hard-Stop ACC/AHA 2017 Guidelines

Out-of-Distribution (OOD) Monitoring

Compares input data against the statistical bounds of the training set (age ranges, BP maximums). If a patient's vitals fall outside the training distribution — e.g., age outside [29, 77] or cholesterol above 564 mg/dl — a clinical alert is surfaced before inference.

Statistical Bounds Training Distribution Guard

Entropy-Based Confidence Quantification

Every prediction includes a Confidence Gauge derived from \(C = 1 - H(p)\) where \(H(p) = -(p\log_2 p + (1-p)\log_2(1-p))\). A high-entropy prediction (\(p \approx 0.5\)) yields a LOW confidence warning, signalling the patient sits on a statistical boundary and requires closer human investigation.

Shannon Entropy High / Moderate / Low

Cryptographic Audit Hashes

Every inference result is cryptographically linked to the model version and timestamp via SHA-256 hashing (usedforsecurity=False flagged for audit compliance). This allows clinicians to verify that the decision support engine has not been altered since its last validated training run.

SHA-256 Audit Hash Full Traceability

Adaptive Monitoring Gateway (Evidently AI)

Detects Data Drift using Kolmogorov-Smirnov (K-S) tests on clinical feature distributions and Performance Decay via Recall Stability monitoring from real-world clinician feedback. Uses an Adaptive Search pattern for 99.9% telemetry uptime across diverse hosting environments.

K-S Test · Drift Share Concept Drift Alert

Security Audit (Bandit SAST + Safety SCA)

Every release is audited with Bandit (Static Application Security Testing) for insecure Python patterns (CWE-mapped reports) and Safety (Software Composition Analysis) to verify all dependencies are free from known CVEs. Current status: 100% PASS — No High/Medium vulnerabilities detected.

100% PASS · Bandit + Safety OWASP Aligned

§6 — Risk Optimization

The Least Effort Path Algorithm

The most significant innovation in CardioSense AI — moving from passive prediction to active intervention planning. The LEP algorithm identifies the minimum patient effort required to reach a clinician-set target risk level.

LEP Optimization Objective

\[ \arg\min_{\Delta X}\; \mathrm{Risk}(X + \Delta X) + \lambda \sum_i w_i |\Delta x_i| \]

where \(X\) is the current patient vitals vector, \(\Delta X\) is the proposed intervention, \(w_i\) is the clinical cost weight (difficulty) of modifying feature \(i\), and \(\lambda\) controls the effort-risk tradeoff. The result is a prioritized Treatment Roadmap ranking interventions by their risk-reduction ROI.

Clinical Effort Weights — w_i

Cost-Weighted Feasibility Matrix

1.0

trestbps

High feasibility — medication or dietary intervention

1.5

chol

Moderate — statin therapy + lifestyle

2.0

thalach

Lower feasibility — sustained conditioning required

3.5

oldpeak

Structural effort — cardiologic intervention

The greedy optimizer identifies which modifications yield the greatest risk reduction relative to their clinical effort — producing an actionable, prioritized clinical sequence for the treating physician.

§7 — Experimental Results

Clinical Performance Metrics

Validated using a Hold-Out Test Set (20%) and Stratified 5-Fold Cross-Validation during optimization. Metrics represent the system's state after Sigmoid Calibration and Target-Enriched Optuna optimization.

// Clinical Performance — v2.4.0 · n=303 UCI Cleveland Dataset

Accuracy

88.52%

ROC-AUC

0.9621

PR-AUC

0.9553

Recall (Sensitivity)

92.86%

Precision

83.87%

F1-Score

0.8814

Metric	Score	Professional Interpretation
Model Version	v2.4.0	Professional Optuna-calibrated clinical ensemble
Clinical Accuracy	88.52%	High fidelity across all diagnostic classes
ROC-AUC Score	0.9621	Exceptional class discrimination power
PR-AUC Score	0.9553	Precise performance in unbalanced medical sets
Recall (Sensitivity)	92.86%	Critical safety metric — minimising false negatives
F1-Score	0.8814	Robust harmonic balance of precision and recall
Brier Score	0.0814	Strong probability calibration (Platt Scaling verified)
Test Coverage	63.00%	40 verified clinical scenarios across core logic
Security Audit	100% Pass	Bandit (SAST) & Safety (SCA) verified release
Data Drift	Monitored	Adaptive Evidently AI K-S monitoring gateway

§7.3 Bias Audit & Demographic Parity

Clinical Priority: We prioritize Recall (Sensitivity) in senior and female populations to ensure no high-risk patient is "missed" due to algorithmic bias. The system maintains 100% Recall for the Senior (≥65) cohort — the highest baseline risk population.

Demographic Group	N	Accuracy	Recall (Sensitivity)	F1-Score
Gender: Female	20	95.00%	85.71%	92.31%
Gender: Male	41	87.80%	95.24%	88.89%
Age: Young (<45)	13	100.0%	100.0%	100.0%
Age: Middle (45–64)	42	90.48%	90.91%	90.91%
Age: Senior (≥65) ★	6	66.67%	100.0% ★	75.00%

★ Recall of 100% for the Senior (≥65) population is clinically vital. The accuracy dip is explained by small sample size (N=6) and deliberate prioritisation of sensitivity over specificity in the highest-risk cohort.

§8 — Architecture

System Architecture

A four-layer decoupled architecture for maximum scalability and auditability. Each layer is independently testable and integrates via well-defined interfaces.

Core Intelligence Layer

Python · XGBoost · SHAP · LIME

The predictive and explainability engines. HeartDiseasePredictor wraps the XGBoost pipeline. SHAP Engine provides TreeSHAP waterfall plots. LIME Engine generates local linear surrogates. Intervention Simulator runs the LEP coordinate descent optimizer.

src/models/ src/explainability/ src/simulation/

Inference Gateway (FastAPI)

FastAPI · Pydantic · Uvicorn · Lifespan

RESTful API handling real-time risk assessments with Pydantic validation for medical data integrity. Implements Lifespan context management for robust initialization. X-Request-ID injection for full clinical audit traceability. Rotating JSON logger (5MB, 3 backups).

api/main.py /predict · /monitoring X-Request-ID

Clinical Dashboard (Streamlit)

Streamlit · Plotly · fpdf

Premium clinician-focused interface rendering SHAP waterfall plots, optimization radar charts, LIME sensitivity analysis, and global population-level importance. Generates professional clinical PDF reports with embedded Audit Hash. CI/CD via GitHub Actions (5-job pipeline).

app/main.py PDF Reports Audit Hash Display

Data & Persistence

SQLite · Evidently AI · joblib · JSON

SQLite database for persistent inference telemetry enabling longitudinal drift analysis. model_metadata.json embeds calibration metrics, fairness audit results, and feature importance rankings. Evidently AI monitors K-S statistical distributions across all 25 features.

SQLite · history.db model_metadata.json reports/monitoring/

Automated Clinical Pipeline — GitHub Actions (5 Jobs)

🔍

JOB 1

Linting

flake8 — code quality

🧪

JOB 2

Clinical Testing

pytest · 40 scenarios

📦

JOB 3

Model Ingest Audit

ColumnTransformer verify

🔒

JOB 4

Security Audit

Bandit + Safety scan

🐳

JOB 5

Docker Build

FastAPI container

§9 — Production API

FastAPI Inference Gateway

CardioSense AI exposes a production-grade FastAPI REST interface for seamless integration with Electronic Health Record (EHR) systems. Every request receives a unique X-Request-ID for full clinical audit traceability.

Method	Endpoint	Description
POST	/predict	Primary inference endpoint — submits patient vitals, returns risk probability with X-Request-ID audit trace
GET	/monitoring/status	Returns data drift (Evidently K-S) and concept drift (Recall Stability) summary with timestamps
POST	/feedback/{id}	Clinician endpoint for ground-truth outcome labeling — feeds the Concept Drift monitoring loop
GET	/health	System health check — returns model version, uptime heartbeat, and artifact load status
GET	/docs	Interactive Swagger UI for live endpoint testing and documentation

REQUEST — POST /predict

            // Patient vitals payload

            {

              "age": 55,

              "sex": 1,

              "cp": 4,

              "trestbps": 140,

              "chol": 250,

              "fbs": 0,

              "restecg": 1,

              "thalach": 130,

              "exang": 1,

              "oldpeak": 2.5,

              "slope": 2,

              "ca": 1,

              "thal": 7

            }

RESPONSE — 200 OK

            {

              "prediction": 1,

              "risk_probability": 0.9234,

              "status": "Positive (High Risk)",

              "model_version": "2.4.0",

              "request_id": "550e8400-e2..."

            }

            // /monitoring/status excerpt

            {

              "drift_share": 0.0,

              "dataset_drift": false,

              "current_recall": 0.881,

              "recall_drop": 0.0119,

              "concept_drift": false

            }

§10 — Monitoring & Sustainability

Adaptive Monitoring Gateway

A medical CDSS must remain accurate as the underlying patient population evolves. CardioSense AI integrates adaptive monitoring that detects both distributional drift and predictive performance decay in real time.

📊

Data Drift (Evidently AI)

Kolmogorov-Smirnov tests on all 25 monitored clinical feature distributions. Drift Share reports the percentage of features showing statistically significant deviation from the validated training baseline.

🎯

Concept Drift (Recall Stability)

Tracks real-world Recall via clinician ground-truth feedback loops (/feedback/{id}). Baseline Recall = 92.86%. Alert triggered when decay exceeds acceptable clinical safety thresholds.

🗄️

Persistent Telemetry (SQLite)

All inference events — probability, vitals, audit hash, timestamp — logged to local SQLite database. Powers longitudinal drift analysis and performance audit dashboards.

📝

Structured JSON Logging

Rotating file-based JSON logger (5MB per file, max 3 backups) at logs/cardiosense.log. Contains API access logs, inference probability distributions, and internal trace IDs.

§11–12 — Discussion & Future

Conclusions & Future Roadmap

CardioSense AI demonstrates that the perceived tradeoff between accuracy and interpretability is a false dichotomy. Post-hoc attribution (SHAP) alongside a high-capacity model (XGBoost) achieves state-of-the-art accuracy with full clinical transparency.

§11.1 Interpretability-Accuracy Tradeoff: By using SHAP as a post-hoc attribution layer over XGBoost, we achieve clinical transparency without sacrificing predictive capacity. The LIME sensitivity analysis further ensures that boundary-case patients are flagged — not silently misclassified.

🌐

Federated Learning

Training across multiple institutions without compromising PHI (Protected Health Information) — enabling larger, more representative cohorts without centralising sensitive patient data.

🏥

FHIR-Compliant API

Seamless integration into Electronic Health Record systems (Epic, Cerner) via HL7 FHIR-compliant API endpoints — eliminating manual data entry from clinical workflows.

⌚

Real-Time ECG Analysis

Incorporating deep temporal features from wearable sensors — moving from a static snapshot of 13 vitals to a continuous, longitudinal risk assessment framework.

Accuracy

Recall

0.9621

ROC-AUC

Clinical Features

Safety Layers

Clinical Tests

References & Technical Bibliography

[1]

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[2]

Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS).

[3]

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD (KDD '16).

[4]

ACC/AHA Task Force. (2017). Guidelines for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults. Journal of the American College of Cardiology.

[5]

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. [Cleveland Heart Disease Dataset, N=303].