Research Paper · v2.4.0 · April 2026 · Clinical Validation Active

CardioSense AI

An Integrated eXplainable Clinical Decision Support System (X-CDSS)
for Precision Cardiovascular Risk Assessment

Cardiovascular diseases remain the leading cause of global mortality. We present CardioSense AI — a state-of-the-art system that integrates optimized XGBoost with multi-modal interpretability layers (SHAP + LIME), ACC/AHA safety guardrails, and a novel Least Effort Path (LEP) optimization algorithm. High performance and full interpretability are not mutually exclusive.

88.52%
Clinical Accuracy
0.9621
ROC-AUC
92.86%
Recall (Sensitivity)
0.0814
Brier Score
0.9553
PR-AUC
100%
Security Audit
Shahid Ul Islam · ML Engineer | CV Engineer · Keywords: XAI, XGBoost, SHAP, LIME, Risk Optimization, CDSS
Read
§1 — Introduction

The Interpretability Gap in Clinical AI

Cardiovascular medicine is inherently data-rich. Traditional risk calculators — Framingham Score, ASCVD Estimator — rely on linear assumptions that fail to capture high-dimensional non-linear dependencies. AI offers a solution, but deployment is hampered by the "Black Box" problem.

"A High Risk notification from a model, without a supporting clinical rationale, is often viewed with scepticism. The clinician is ethically and legally responsible for every diagnosis they provide." A passive model that provides a risk score without explanation — and without intervention guidance — is clinically inert.

01
Trust via Multi-Modal XAI
Global (SHAP) and local (LIME) explainability techniques provide a "glass-box" view of every prediction. Every inference is accompanied by a feature-level waterfall decomposition and a sensitivity analysis.
02
Safety via Guideline Integration
AHA/ACC Hypertension Guidelines are embedded directly into a Safety Engine as deterministic "Hard-Stop" guardrails that act before the probabilistic ML model's output is presented to the clinician.
03
Active Decision Support (LEP)
The Least Effort Path (LEP) optimization algorithm identifies the most feasible clinical interventions for a specific patient — transforming passive prediction into an actionable clinical roadmap.
§2 — Data

UCI Cleveland Clinical Dataset

CardioSense AI is trained and validated on the internationally recognized UCI Cleveland Heart Disease dataset — 303 patient records, 13 clinical features, binary cardiovascular disease target. VIF analysis confirms all features exhibit VIF < 2.5 (low multicollinearity).

Feature Description Clinical Significance Range / Type
age Patient age in years Primary risk factor for vascular decay 29–77 yrs
sex Biological sex Biological variance in coronary anatomy 0: Female, 1: Male
cp Chest pain type (1–4) Qualitative indicator of ischemic stress Categorical
trestbps Resting systolic BP Hemodynamic marker of vascular pressure 94–200 mmHg
chol Serum cholesterol Risk factor for lipid-driven plaque formation 126–564 mg/dl
fbs Fasting blood sugar > 120mg/dl Metabolic indicator of diabetic risk 0/1 Boolean
restecg Resting ECG results Electric signal evidence of hypertrophy/ischemia 0, 1, 2
thalach Maximum heart rate achieved Marker of cardiac reserve and fitness 71–202 bpm
exang Exercise induced angina Direct evidence of coronary insufficiency 0/1 Boolean
oldpeak ST depression via exercise Metric for myocardial repolarization delay 0.0–6.2
slope Peak exercise ST slope Clinical indicator of ischemia severity Upsloping/Flat/Down
ca Number of major vessels (0–3) Structural marker of coronary calcification 0–3
thal Thalassemia score Genetic/structural marker of blood flow 3 / 6 / 7

Clinical Safety Thresholds

Hypertensive Crisis
≥ 180 mmHg
trestbps — triggers immediate risk escalation
Critical ST Depression
> 3.0
oldpeak — ischemic severity override
Multivessel Disease
ca ≥ 2
Major vessels — structural override trigger
Tachycardia Risk
> 190 bpm
thalach — cardiac stress override
§3 — Methodology

Mathematical Foundations

CardioSense AI's clinical intelligence rests on four mathematical pillars: a robust preprocessing pipeline, the XGBoost objective, Bayesian hyperparameter optimization, and sigmoid probability calibration.

§3.1 Robust Preprocessing Pipeline

Numerical vitals \((x_{\text{num}} \in \{\text{age, trestbps, chol, thalach, oldpeak}\})\) are standardised via Z-score normalisation. Categorical features are transformed via One-Hot Encoding. Parameters are fitted exclusively on training data and persisted in preprocessor.joblib to eliminate training-serving skew.

Z-Score Normalisation (StandardScaler)
\[ z = \frac{x - \mu}{\sigma} \]
where \(\mu\) and \(\sigma\) are the channel-wise mean and standard deviation computed exclusively from the training set, preventing data leakage into the clinical validation pipeline.

§3.2 Core Intelligence Engine: XGBoost Objective

XGBoost optimises a second-order Taylor expansion of the loss function, enabling rapid convergence on the \(N=303\) clinical dataset while controlling overfitting through explicit regularisation of tree structure.

Regularised XGBoost Objective
\[ \mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \sum_k \Omega(f_k) \]
where \(\sum_i l(\hat{y}_i, y_i)\) is the differentiable convex loss (binary cross-entropy for clinical classification), and the regularisation term \(\Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2\) penalises the number of leaves \(T\) and leaf weights \(w\), preventing overfitting to the small clinical cohort.

§3.3 Bayesian Hyperparameter Optimization (Optuna TPE)

Unlike grid or random search, Optuna uses a Tree-structured Parzen Estimator (TPE) to intelligently navigate the hyperparameter space. We executed 50 trials with 5-fold Stratified Cross-Validation.

Optimised Hyperparameters (v2.4.0)
n_estimators
80
max_depth
12
learning_rate
0.1477
subsample
0.7827
min_child_weight
9
gamma
2.192

The scale_pos_weight is automatically set as \(N_{\text{neg}}/N_{\text{pos}}\) to address inherent class imbalance in cardiac datasets.

§3.4 Clinical Probability Calibration (Platt Scaling)

Raw XGBoost probabilities are pushed away from 0 and 1 due to the boosting process. We apply Sigmoid Calibration via CalibratedClassifierCV. A predicted 20% risk should correspond to an actual 20% frequency in the clinical population.

Brier Score — Calibration Measure
\[ \text{BS} = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2 = 0.0814 \]
A Brier Score approaching zero indicates near-perfect calibration. Our score of 0.0814 confirms high calibration integrity — the clinical "Risk Pulse" percentage is statistically accurate.
🏥
Clinical Input
13 patient vitals via Streamlit / API
🔄
Preprocessor
StandardScaler + OHE · persisted joblib
🛡️
Safety Engine
ACC/AHA overrides · OOD · Entropy
🌲
XGBoost
Optuna-calibrated ensemble
🔬
SHAP + LIME
Waterfall · sensitivity · physician summary
🗺️
LEP Engine
Least Effort Path roadmap
📄
PDF Report
Audit hash · clinical export
§4 — Explainability

Multi-Modal XAI Framework

A dual-engine interpretability layer ensures every prediction is explainable from multiple mathematical perspectives — global consistency via SHAP and local sensitivity via LIME. Neither technique alone is sufficient for clinical trust.

SHAP · Global Consistency
Shapley Additive Explanations
Based on cooperative game theory, SHAP provides a mathematically consistent allocation of "blame" or "credit" to each feature. We utilize TreeSHAP — a fast and exact algorithm for tree ensembles that satisfies Efficiency, Symmetry, and Additivity axioms.
SHAPLEY VALUE FORMULA
\[\phi_i = \sum_{S \subseteq \{x_1,\dots,x_p\}\setminus\{x_i\}} \frac{|S|!\,(M-|S|-1)!}{M!}\left[f(S \cup \{x_i\}) - f(S)\right]\]
The Shapley value \(\phi_i\) represents the fair marginal contribution of feature \(x_i\) across all possible orderings of feature coalitions \(S\).
LIME · Local Sensitivity
Local Interpretable Model-agnostic Explanations
LIME generates a linear approximation of the complex XGBoost model in the immediate vicinity of a specific patient's data point. It reveals how small changes in vitals (e.g., a 5 mmHg drop in BP) would shift the AI's risk assessment.
LIME OBJECTIVE
\[\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\]
where \(g\) is an interpretable linear surrogate, \(\pi_x\) is a proximity measure weighting perturbed samples by closeness to the patient \(x\), and \(\Omega(g)\) controls the surrogate's complexity.

Diagnostic "X-Ray" — SHAP Waterfall Decomposition

// SHAP WATERFALL — ILLUSTRATIVE HIGH-RISK PATIENT · f(X) = 0.92 · E[f(X)] = 0.46
Feature SHAP value → E[f(X)]=0.46 exang=1 +0.22 oldpeak=2.5 +0.18 ca=1 +0.15 cp=4 +0.12 age=55 +0.08 thalach=130 −0.09 chol=250 −0.05 f(X) = 0.92 Risk factors Protective factors
// Risk Optimization Radar — Patient vs Target
trestbps chol oldpeak thalach ca age Current LEP Target
// Shannon Entropy Confidence Engine
HIGH C = 1 − H(p) LOW HIGH
\[ H(p) = -(p\log_2 p + (1-p)\log_2(1-p)) \] \[ C = 1 - H(p) \]
When \(p \approx 0.5\), entropy is maximal → LOW confidence warning
§5 — Safety & Trust

Clinical Safety & Trust Framework

In medical AI, "Black Box" models are clinically unusable. CardioSense AI implements six interlocking layers of trust — from hard-stop clinical overrides to cryptographic audit hashes.

1
Clinical Overrides (ACC/AHA Alignment)
Hard-stop rules based on ACC/AHA guidelines trigger alerts regardless of AI probability. If trestbps ≥ 180 mmHg (Hypertensive Crisis), the system overrides any model output below 90% and escalates immediately. Similar triggers exist for ca ≥ 2 (multivessel disease) and oldpeak > 3.0 (ischemic severity).
Deterministic Hard-Stop ACC/AHA 2017 Guidelines
2
Out-of-Distribution (OOD) Monitoring
Compares input data against the statistical bounds of the training set (age ranges, BP maximums). If a patient's vitals fall outside the training distribution — e.g., age outside [29, 77] or cholesterol above 564 mg/dl — a clinical alert is surfaced before inference.
Statistical Bounds Training Distribution Guard
3
Entropy-Based Confidence Quantification
Every prediction includes a Confidence Gauge derived from \(C = 1 - H(p)\) where \(H(p) = -(p\log_2 p + (1-p)\log_2(1-p))\). A high-entropy prediction (\(p \approx 0.5\)) yields a LOW confidence warning, signalling the patient sits on a statistical boundary and requires closer human investigation.
Shannon Entropy High / Moderate / Low
4
Cryptographic Audit Hashes
Every inference result is cryptographically linked to the model version and timestamp via SHA-256 hashing (usedforsecurity=False flagged for audit compliance). This allows clinicians to verify that the decision support engine has not been altered since its last validated training run.
SHA-256 Audit Hash Full Traceability
5
Adaptive Monitoring Gateway (Evidently AI)
Detects Data Drift using Kolmogorov-Smirnov (K-S) tests on clinical feature distributions and Performance Decay via Recall Stability monitoring from real-world clinician feedback. Uses an Adaptive Search pattern for 99.9% telemetry uptime across diverse hosting environments.
K-S Test · Drift Share Concept Drift Alert
6
Security Audit (Bandit SAST + Safety SCA)
Every release is audited with Bandit (Static Application Security Testing) for insecure Python patterns (CWE-mapped reports) and Safety (Software Composition Analysis) to verify all dependencies are free from known CVEs. Current status: 100% PASS — No High/Medium vulnerabilities detected.
100% PASS · Bandit + Safety OWASP Aligned
§6 — Risk Optimization

The Least Effort Path Algorithm

The most significant innovation in CardioSense AI — moving from passive prediction to active intervention planning. The LEP algorithm identifies the minimum patient effort required to reach a clinician-set target risk level.

LEP Optimization Objective
\[ \arg\min_{\Delta X}\; \mathrm{Risk}(X + \Delta X) + \lambda \sum_i w_i |\Delta x_i| \]
where \(X\) is the current patient vitals vector, \(\Delta X\) is the proposed intervention, \(w_i\) is the clinical cost weight (difficulty) of modifying feature \(i\), and \(\lambda\) controls the effort-risk tradeoff. The result is a prioritized Treatment Roadmap ranking interventions by their risk-reduction ROI.
Clinical Effort Weights — w_i
Cost-Weighted Feasibility Matrix
1.0
trestbps
High feasibility — medication or dietary intervention
1.5
chol
Moderate — statin therapy + lifestyle
2.0
thalach
Lower feasibility — sustained conditioning required
3.5
oldpeak
Structural effort — cardiologic intervention
The greedy optimizer identifies which modifications yield the greatest risk reduction relative to their clinical effort — producing an actionable, prioritized clinical sequence for the treating physician.
§7 — Experimental Results

Clinical Performance Metrics

Validated using a Hold-Out Test Set (20%) and Stratified 5-Fold Cross-Validation during optimization. Metrics represent the system's state after Sigmoid Calibration and Target-Enriched Optuna optimization.

// Clinical Performance — v2.4.0 · n=303 UCI Cleveland Dataset
Accuracy
88.52%
88.52%
ROC-AUC
0.9621
0.9621
PR-AUC
0.9553
0.9553
Recall (Sensitivity)
92.86%
92.86%
Precision
83.87%
83.87%
F1-Score
0.8814
0.8814
Metric Score Professional Interpretation
Model Version v2.4.0 Professional Optuna-calibrated clinical ensemble
Clinical Accuracy 88.52% High fidelity across all diagnostic classes
ROC-AUC Score 0.9621 Exceptional class discrimination power
PR-AUC Score 0.9553 Precise performance in unbalanced medical sets
Recall (Sensitivity) 92.86% Critical safety metric — minimising false negatives
F1-Score 0.8814 Robust harmonic balance of precision and recall
Brier Score 0.0814 Strong probability calibration (Platt Scaling verified)
Test Coverage 63.00% 40 verified clinical scenarios across core logic
Security Audit 100% Pass Bandit (SAST) & Safety (SCA) verified release
Data Drift Monitored Adaptive Evidently AI K-S monitoring gateway

§7.3 Bias Audit & Demographic Parity

Clinical Priority: We prioritize Recall (Sensitivity) in senior and female populations to ensure no high-risk patient is "missed" due to algorithmic bias. The system maintains 100% Recall for the Senior (≥65) cohort — the highest baseline risk population.

Demographic Group N Accuracy Recall (Sensitivity) F1-Score
Gender: Female 20 95.00% 85.71% 92.31%
Gender: Male 41 87.80% 95.24% 88.89%
Age: Young (<45) 13 100.0% 100.0% 100.0%
Age: Middle (45–64) 42 90.48% 90.91% 90.91%
Age: Senior (≥65) ★ 6 66.67% 100.0% ★ 75.00%

★ Recall of 100% for the Senior (≥65) population is clinically vital. The accuracy dip is explained by small sample size (N=6) and deliberate prioritisation of sensitivity over specificity in the highest-risk cohort.

§8 — Architecture

System Architecture

A four-layer decoupled architecture for maximum scalability and auditability. Each layer is independently testable and integrates via well-defined interfaces.

01
Core Intelligence Layer
Python · XGBoost · SHAP · LIME
The predictive and explainability engines. HeartDiseasePredictor wraps the XGBoost pipeline. SHAP Engine provides TreeSHAP waterfall plots. LIME Engine generates local linear surrogates. Intervention Simulator runs the LEP coordinate descent optimizer.
src/models/ src/explainability/ src/simulation/
02
Inference Gateway (FastAPI)
FastAPI · Pydantic · Uvicorn · Lifespan
RESTful API handling real-time risk assessments with Pydantic validation for medical data integrity. Implements Lifespan context management for robust initialization. X-Request-ID injection for full clinical audit traceability. Rotating JSON logger (5MB, 3 backups).
api/main.py /predict · /monitoring X-Request-ID
03
Clinical Dashboard (Streamlit)
Streamlit · Plotly · fpdf
Premium clinician-focused interface rendering SHAP waterfall plots, optimization radar charts, LIME sensitivity analysis, and global population-level importance. Generates professional clinical PDF reports with embedded Audit Hash. CI/CD via GitHub Actions (5-job pipeline).
app/main.py PDF Reports Audit Hash Display
04
Data & Persistence
SQLite · Evidently AI · joblib · JSON
SQLite database for persistent inference telemetry enabling longitudinal drift analysis. model_metadata.json embeds calibration metrics, fairness audit results, and feature importance rankings. Evidently AI monitors K-S statistical distributions across all 25 features.
SQLite · history.db model_metadata.json reports/monitoring/

Automated Clinical Pipeline — GitHub Actions (5 Jobs)

🔍
JOB 1
Linting
flake8 — code quality
🧪
JOB 2
Clinical Testing
pytest · 40 scenarios
📦
JOB 3
Model Ingest Audit
ColumnTransformer verify
🔒
JOB 4
Security Audit
Bandit + Safety scan
🐳
JOB 5
Docker Build
FastAPI container
§9 — Production API

FastAPI Inference Gateway

CardioSense AI exposes a production-grade FastAPI REST interface for seamless integration with Electronic Health Record (EHR) systems. Every request receives a unique X-Request-ID for full clinical audit traceability.

Method Endpoint Description
POST /predict Primary inference endpoint — submits patient vitals, returns risk probability with X-Request-ID audit trace
GET /monitoring/status Returns data drift (Evidently K-S) and concept drift (Recall Stability) summary with timestamps
POST /feedback/{id} Clinician endpoint for ground-truth outcome labeling — feeds the Concept Drift monitoring loop
GET /health System health check — returns model version, uptime heartbeat, and artifact load status
GET /docs Interactive Swagger UI for live endpoint testing and documentation
REQUEST — POST /predict
// Patient vitals payload
{
  "age": 55,
  "sex": 1,
  "cp": 4,
  "trestbps": 140,
  "chol": 250,
  "fbs": 0,
  "restecg": 1,
  "thalach": 130,
  "exang": 1,
  "oldpeak": 2.5,
  "slope": 2,
  "ca": 1,
  "thal": 7
}
RESPONSE — 200 OK
{
  "prediction": 1,
  "risk_probability": 0.9234,
  "status": "Positive (High Risk)",
  "model_version": "2.4.0",
  "request_id": "550e8400-e2..."
}

// /monitoring/status excerpt
{
  "drift_share": 0.0,
  "dataset_drift": false,
  "current_recall": 0.881,
  "recall_drop": 0.0119,
  "concept_drift": false
}
§10 — Monitoring & Sustainability

Adaptive Monitoring Gateway

A medical CDSS must remain accurate as the underlying patient population evolves. CardioSense AI integrates adaptive monitoring that detects both distributional drift and predictive performance decay in real time.

📊
Data Drift (Evidently AI)
Kolmogorov-Smirnov tests on all 25 monitored clinical feature distributions. Drift Share reports the percentage of features showing statistically significant deviation from the validated training baseline.
🎯
Concept Drift (Recall Stability)
Tracks real-world Recall via clinician ground-truth feedback loops (/feedback/{id}). Baseline Recall = 92.86%. Alert triggered when decay exceeds acceptable clinical safety thresholds.
🗄️
Persistent Telemetry (SQLite)
All inference events — probability, vitals, audit hash, timestamp — logged to local SQLite database. Powers longitudinal drift analysis and performance audit dashboards.
📝
Structured JSON Logging
Rotating file-based JSON logger (5MB per file, max 3 backups) at logs/cardiosense.log. Contains API access logs, inference probability distributions, and internal trace IDs.
§11–12 — Discussion & Future

Conclusions & Future Roadmap

CardioSense AI demonstrates that the perceived tradeoff between accuracy and interpretability is a false dichotomy. Post-hoc attribution (SHAP) alongside a high-capacity model (XGBoost) achieves state-of-the-art accuracy with full clinical transparency.

§11.1 Interpretability-Accuracy Tradeoff: By using SHAP as a post-hoc attribution layer over XGBoost, we achieve clinical transparency without sacrificing predictive capacity. The LIME sensitivity analysis further ensures that boundary-case patients are flagged — not silently misclassified.

🌐
Federated Learning
Training across multiple institutions without compromising PHI (Protected Health Information) — enabling larger, more representative cohorts without centralising sensitive patient data.
🏥
FHIR-Compliant API
Seamless integration into Electronic Health Record systems (Epic, Cerner) via HL7 FHIR-compliant API endpoints — eliminating manual data entry from clinical workflows.
Real-Time ECG Analysis
Incorporating deep temporal features from wearable sensors — moving from a static snapshot of 13 vitals to a continuous, longitudinal risk assessment framework.
0%
Accuracy
0%
Recall
0.9621
ROC-AUC
13
Clinical Features
6
Safety Layers
40
Clinical Tests

References & Technical Bibliography

[1]
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[2]
Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS).
[3]
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD (KDD '16).
[4]
ACC/AHA Task Force. (2017). Guidelines for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults. Journal of the American College of Cardiology.
[5]
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. [Cleveland Heart Disease Dataset, N=303].