Research Case Study · 2026 · Under Peer Review

Beyond Visual
Plausibility

Faithfulness-Aware Comparison of CNNs and Vision Transformers
for Chest X-Ray Classification

The model that produces the most visually convincing heatmaps is the one whose explanations are causally hollow.

Department of Computer Science · Islamic University of Science and Technology, Awantipora
Dr. Owais  ·  Shahid Ul Islam
VGG16
Transfer Learning · 83% acc.
ViT-B/16
Transfer Learning · 82% acc.
Custom CNN
From Scratch · 74% acc.
Scroll
00 — Context

What Makes This Study Different

Most deep learning studies ask "how accurate is the model?" — this study asks the harder question: can we trust why the model made that decision?

Conventional Study This Study
Reports accuracy only Reports accuracy plus faithfulness of explanations
One CAM method per model Two CAM methods with inter-method agreement evaluation
Visual inspection of heatmaps Quantitative pixel deletion — AOPC and AUC curves
Single metric evaluation Six-dimensional explainability framework
No statistical correction Bonferroni-corrected non-parametric testing (α 0.0083)
Transfer Learning
Compare pretrained architectures (VGG16, ViT-B/16) against a custom baseline to quantify the advantage of ImageNet pretraining in the medical domain.
Explainable AI
Apply GradCAM++ and EigenCAM to both top models. Evaluate heatmaps across six independent quantitative dimensions with rigorous statistical testing.
Faithfulness Testing
Use progressive pixel deletion to ask whether highlighted pixels causally drive model confidence — revealing a paradox that visual inspection alone cannot detect.
01 — Data

Dataset & Class Structure

6,432 posterior-anterior chest X-rays from the Kaggle Pneumonia & COVID-19 Image Dataset — a challenging four-class classification problem with meaningful class imbalance.

Total Images
0
PA chest X-rays · stratified split
Classes
4
Normal · Bacterial · Viral · COVID-19
Training Split
80%
10% val · 10% test (withheld)
Imbalance Handling
w = N/Kn
Class-weighted cross-entropy
Class Distribution in Training Set
Bacterial Pneumonia
~40%
Viral Pneumonia
~23%
Normal
~22%
COVID-19
~15%
02 — Transfer Learning

Three Architectures, One Question

Does ImageNet pretraining confer a decisive advantage for medical image classification? Two pretrained models face a custom baseline trained from scratch on the same data.

Stage 1 — Frozen Backbone
Classification head adapts rapidly to the 4-class target label space. The pretrained encoder's ImageNet representations are preserved intact, preventing catastrophic forgetting.
Stage 2 — Fine-Tuning
Full network trained end-to-end at η = 10⁻⁴. Incremental domain adaptation refines — rather than overwrites — the ImageNet initialization advantage.
Early Stopping
Patience P = 3 epochs, threshold δ = 10⁻⁴. Checkpoint restored at minimum validation loss. Prevents overfitting without a predefined epoch ceiling.
Differential Augmentation
VGG16 tolerates aggressive augmentation (±30°, blur, erasing). ViT-B/16 requires a moderated pipeline — empirically discovered: aggressive transforms destabilize transformer training.
VGG16 — Aggressive
  • ±30° random rotation
  • Scale [0.8, 1.0] crop
  • Horizontal flip p=0.5
  • Gaussian blur
  • Random erasing
  • ImageNet normalize
ViT-B/16 — Moderated
  • ±15° rotation (half of VGG16)
  • Colour jitter (B, C, S)
  • Horizontal flip p=0.5
  • Random erasing ↓ prob.
  • ImageNet normalize
  • No blur — patch stability
Custom CNN — Standard
  • Random rotation
  • Horizontal flip
  • Colour jitter
  • No erasing
  • Random initialize
  • Runs all 15 epochs
// Training Outcomes
Model Stop Epoch Best Val Acc Best Val Loss Train Time
VGG16 6 84.48% 0.4021 ~3.3 min
ViT-B/16 7 82.38% 0.4169 ~5.4 min
Custom CNN 15 74.37% 0.8472 ~11.1 min

→ Transfer learning delivers a 9-point accuracy advantage and 2–3× faster convergence over training from scratch.

03 — Classification Results

Model Performance

Overall classification performance across all three architectures, measured on a held-out stratified 20% test set with class-weighted cross-entropy training.

Transfer Learning
VGG16
ImageNet Pretrained · Aggressive Augmentation
0%
Accuracy
0.00
Macro F1
  • Excels at focal, spatially-bounded patterns
  • Bacterial Pneumonia F1 = 0.83 (strong)
  • COVID-19 F1 = 0.98 (near-perfect)
  • 13 convolutional layers · VGG-pure design
⏱ Stopped at epoch 6 · inference 1.068s
Transfer Learning
ViT-B/16
HuggingFace Pretrained · Moderated Augmentation
0%
Accuracy
0.00
Macro F1
  • Global self-attention — no spatial bias
  • COVID-19 recall = 0.99 (best of all models)
  • 14×14 patch grid attention maps
  • 196 patches of 16×16 from 224×224 input
⏱ Stopped at epoch 7 · inference 0.977s
From Scratch · Baseline
Custom CNN
Random Init · 4 Conv Blocks · 15 Epochs
0%
Accuracy
0.00
Macro F1
  • Quantifies pretraining advantage directly
  • 4 × (Conv→BN→ReLU×2 → MaxPool → Dropout)
  • Viral Pneumonia hardest class (F1 = 0.55)
  • Needs all 15 epochs to converge
⏱ All 15 epochs · inference 0.768s
// Class-wise F1 Scores
Class
VGG16
ViT-B/16
Custom CNN
Normal
0.91
0.91
0.80
Bacterial Pneumonia
0.83
0.81
0.74
Viral Pneumonia
0.64
0.64
0.55
COVID-19
0.98
0.99
0.92

→ Viral Pneumonia remains the hardest class across all models due to its radiographic overlap with bacterial pneumonia.

"The model that produces the most visually convincing heatmaps is the one whose explanations are causally hollow."
— The Explainability Paradox · Central finding of this study
04 — The Paradox

Progressive Pixel Deletion

If a heatmap genuinely reflects the model's reasoning, removing the highlighted pixels should reduce confidence. We test exactly this — and discover opposite behavior in the two models.

// CONFIDENCE vs. PIXELS REMOVED
VGG16 — confidence rises
ViT-B/16 — confidence falls
1.0 0.8 0.6 0.4 0.2 0% 10% 20% 33% 50% 70% 90% Fraction of highlighted pixels removed → Curves cross at ~33% VGG16 ViT-B/16

n=32 stratified images · T=10 deletion steps · pixels replaced with channel-wise ImageNet mean

Causally Unfaithful
VGG16
Visually convincing heatmaps. Anatomically broad. Clinically intuitive. Yet confidence rises as highlighted pixels are removed.
0.828
AUC ↑ (bad)
−0.012
AOPC ↓ (bad)
Confidence progression as pixels removed:
0%
0.47
10%
0.51
33%
0.74
70%
0.91
90%
0.98
Causally Faithful
ViT-B/16
Patchier, less visually consistent heatmaps. Some clinically puzzling. Yet confidence falls sharply as highlighted pixels are removed.
0.588
AUC ↓ (good)
+0.199
AOPC ↑ (good)
Confidence progression as pixels removed:
0%
0.99
10%
0.96
33%
0.63
70%
0.44
90%
0.41
05 — Root Causes

Three Interlocking Causes

VGG16's explanation infidelity is not a single failure — it arises from three compounding structural properties of convolutional gradient geometry.

01
Negative Inter-Method Agreement
ρ(VGG16) = −0.309
GradCAM++ and EigenCAM identify different regions for the same VGG16 prediction. When two independently derived methods actively contradict each other, neither can be trusted to represent the model's true reasoning.
02
Gradient Landscape Instability
Robustness ρ(VGG16) = 0.542
Thirteen stacked convolutional layers with nonlinear activations produce a highly non-smooth gradient landscape. Small input perturbations cause large heatmap shifts. The CAM reflects a snapshot of local gradient geometry, not stable feature use.
03
Non-Pathological Decision Basis
Sparsity 0.466 · Entropy 5.159
High sparsity and high entropy — superficially contradictory — are reconciled by a model relying on distributed, low-level statistical image properties (background, scanner artifacts) rather than coherent pathological structures.
VGG16 — The Hollow Explainer
Plausible but not causal
Heatmaps concentrate over bilateral lower lung zones for COVID-19 — anatomically consistent, clinically familiar. Yet confidence rises monotonically as these pixels are erased. The actual decision signal lives in background regions the heatmap does not highlight.
Visual plausibility ≠ causal faithfulness
ViT-B/16 — The Honest Explainer
Imperfect but genuine
Patchier attention maps. Some edge-focused, some puzzling to a radiologist. But inter-method agreement is positive (+0.301), robustness is high (0.809), and confidence falls sharply when highlighted patches are removed — causal structure is real.
Imperfect appearance, genuine causality
06 — Explainability Framework

Six-Dimensional Evaluation

Heatmap quality measured across six independent dimensions with Bonferroni-corrected statistical testing — all six comparisons reach significance at α 0.0083.

// EXPLAINABILITY PROFILE — GRADCAM++ · EIGENCAM
Dimension
VGG16 Score
ViT-B/16 Score
Visual Intuitiveness
0.90
0.60
Activation Contrast
0.45
0.85
Spatial Selectivity
0.80
0.45
Perturbation Robustness ↑
0.35
0.95
Inter-Method Agreement ↑
-0.31
0.75
Causal Faithfulness ↑
-0.01
0.80

↑ = higher is better · VGG16 leads on surface metrics · ViT-B/16 leads on every trustworthiness dimension

// Quantitative Metric Values
Metric VGG16 ViT-B/16 Winner
Shannon Entropy 5.159 ± 0.034 4.987 ± 0.092
Activation Std Dev 0.216 ± 0.018 0.250 ± 0.024 ViT-B/16
Sparsity 0.466 ± 0.148 0.252 ± 0.116
Top-k Mass Concentration 16.350 ± 0.874 16.197 ± 0.861 Tie
Perturbation Robustness ↑ 0.542 ± 0.215 0.809 ± 0.217 ViT-B/16
Inter-Method Agreement ↑ −0.309 ± 0.483 +0.301 ± 0.406 ViT-B/16

Statistical testing: Mann-Whitney U with Bonferroni correction. All six comparisons significant at p < 0.0083.

07 — Clinical Risk

Why This Matters Clinically

Explanation infidelity is not an academic curiosity. When a hollow explanation accompanies a wrong prediction, it actively prevents the clinician from detecting the error.

Real Case — Identified in Qualitative Analysis

VGG16 predicted Normal with 90.8% confidence on an image whose ground truth was Viral Pneumonia. The GradCAM++ heatmap concentrated activation on the vertebral column and chest wall — non-pathological regions that appeared anatomically reasonable. A clinician relying on visual heatmap validation would have no signal that the explanation was misleading. Only the AOPC faithfulness test reveals the failure.

False Confidence
A model with AOPC ≤ 0 actively deceives clinicians — the heatmap provides a false sense of transparency that is more dangerous than acknowledged uncertainty.
Visual Inspection Is Insufficient
The dominant validation practice — presenting GradCAM heatmaps and asking if they look clinically reasonable — can systematically endorse causally hollow models.
AOPC as Safety Requirement
Faithfulness evaluation via AOPC should be treated as a clinical safety requirement, not an optional post-hoc analysis. AOPC ≤ 0 should block deployment.
Accuracy–Faithfulness Tradeoff
VGG16 (83%) vs ViT-B/16 (82%) — a 1-point accuracy gap. But VGG16's AOPC = −0.012 vs ViT-B/16's +0.199. In clinical deployment, faithfulness wins.
08 — Proposed Framework

Three-Layer Validation

We propose that heatmap-based explanation evaluation must operate across three ascending layers of rigor. Each layer catches failures the previous cannot detect.

I
Visual Inspection
Fast and interpretable. Detects gross anatomical failures — heatmaps on backgrounds, uninformative uniform activations. But it cannot distinguish visually plausible from causally faithful. In some cases it actively misleads.
Fast · interpretable · detects gross failures Cannot detect subtle unfaithfulness · actively misleading in some cases
NECESSARY · NOT SUFFICIENT
II
Inter-Method Agreement (ρ)
Computationally cheap — requires only two CAM passes per image. A negative correlation score is an immediate deployment red flag: if GradCAM++ and EigenCAM disagree, neither method reliably captures the model's reasoning.
Cheap · requires only 2 CAM passes · negative ρ = red flag Does not prove causal faithfulness — only method consistency
ρ < 0 → REVIEW REQUIRED
III
Faithfulness Testing — AOPC
Progressive pixel deletion is the only causal test available without ground-truth saliency masks. Quantifiable, standardizable, and reveals hollow explanations definitively. The computational cost (10 forward passes per image) is a clinical safety overhead worth paying.
Only causal test · quantifiable · definitively reveals hollow explanations Requires ~10 forward passes per image · computationally heavier
AOPC ≤ 0 → DEPLOYMENT BLOCKED
09 — Summary

Key Findings at a Glance

0%
VGG16 Accuracy
0%
ViT-B/16 Accuracy
−0.012
VGG16 AOPC
+0.199
ViT-B/16 AOPC
0pt
TL Accuracy Gain
6
Eval. Dimensions
Transfer Learning Wins
9-point accuracy gap and 2–3× faster convergence. Both pretrained models stop in 6–7 epochs; the custom CNN needs all 15 and still trails significantly.
Architecture–Task Alignment
VGG16 excels at focal, spatially-bounded patterns (bacterial pneumonia). ViT-B/16 excels at global patterns (COVID-19 recall = 0.99). Task structure determines which architecture fits.
The Explainability Paradox
VGG16's AOPC = −0.012. Confidence rises as important pixels are removed. The explanation is causally hollow despite being visually convincing.
ViT as the Faithful Explainer
ViT-B/16 wins all three trustworthiness dimensions: robustness (0.809), inter-method agreement (+0.301), and causal faithfulness (AOPC +0.199).
// Citation
@misc{owais2026beyondvisual,
  author    = {Owais and {Shahid Ul Islam}},
  title     = {Beyond Visual Plausibility: A Faithfulness-Aware Comparison of
               CNNs and Vision Transformers for Multi-Class Chest X-Ray Classification},
  year      = {2026},
  note      = {Manuscript under peer review},
  institution = {Islamic University of Science and Technology, Awantipora}
}