Research Case Study · 2026 · Under Peer Review

Beyond Visual
Plausibility

Faithfulness-Aware Comparison of CNNs and Vision Transformers
for Chest X-Ray Classification

The model that produces the most visually convincing heatmaps is the one whose explanations are causally hollow.

Department of Computer Science · Islamic University of Science and Technology, Awantipora
Dr. Owais · Shahid Ul Islam

VGG16

Transfer Learning · 83% acc.

ViT-B/16

Transfer Learning · 82% acc.

Custom CNN

From Scratch · 74% acc.

Scroll

00 — Context

What Makes This Study Different

Most deep learning studies ask "how accurate is the model?" — this study asks the harder question: can we trust why the model made that decision?

Conventional Study	This Study
Reports accuracy only	Reports accuracy plus faithfulness of explanations
One CAM method per model	Two CAM methods with inter-method agreement evaluation
Visual inspection of heatmaps	Quantitative pixel deletion — AOPC and AUC curves
Single metric evaluation	Six-dimensional explainability framework
No statistical correction	Bonferroni-corrected non-parametric testing (α 0.0083)

Transfer Learning

Compare pretrained architectures (VGG16, ViT-B/16) against a custom baseline to quantify the advantage of ImageNet pretraining in the medical domain.

Explainable AI

Apply GradCAM++ and EigenCAM to both top models. Evaluate heatmaps across six independent quantitative dimensions with rigorous statistical testing.

Faithfulness Testing

Use progressive pixel deletion to ask whether highlighted pixels causally drive model confidence — revealing a paradox that visual inspection alone cannot detect.

01 — Data

Dataset & Class Structure

6,432 posterior-anterior chest X-rays from the Kaggle Pneumonia & COVID-19 Image Dataset — a challenging four-class classification problem with meaningful class imbalance.

Total Images

PA chest X-rays · stratified split

Classes

Normal · Bacterial · Viral · COVID-19

Training Split

80%

10% val · 10% test (withheld)

Imbalance Handling

w = N/Kn

Class-weighted cross-entropy

Class Distribution in Training Set

Bacterial Pneumonia

~40%

Viral Pneumonia

~23%

Normal

~22%

COVID-19

~15%

02 — Transfer Learning

Three Architectures, One Question

Does ImageNet pretraining confer a decisive advantage for medical image classification? Two pretrained models face a custom baseline trained from scratch on the same data.

Stage 1 — Frozen Backbone

Classification head adapts rapidly to the 4-class target label space. The pretrained encoder's ImageNet representations are preserved intact, preventing catastrophic forgetting.

Stage 2 — Fine-Tuning

Full network trained end-to-end at η = 10⁻⁴. Incremental domain adaptation refines — rather than overwrites — the ImageNet initialization advantage.

Early Stopping

Patience P = 3 epochs, threshold δ = 10⁻⁴. Checkpoint restored at minimum validation loss. Prevents overfitting without a predefined epoch ceiling.

Differential Augmentation

VGG16 tolerates aggressive augmentation (±30°, blur, erasing). ViT-B/16 requires a moderated pipeline — empirically discovered: aggressive transforms destabilize transformer training.

VGG16 — Aggressive

±30° random rotation
Scale [0.8, 1.0] crop
Horizontal flip p=0.5
Gaussian blur
Random erasing
ImageNet normalize

ViT-B/16 — Moderated

±15° rotation (half of VGG16)
Colour jitter (B, C, S)
Horizontal flip p=0.5
Random erasing ↓ prob.
ImageNet normalize
No blur — patch stability

Custom CNN — Standard

Random rotation
Horizontal flip
Colour jitter
No erasing
Random initialize
Runs all 15 epochs

// Training Outcomes

Model	Stop Epoch	Best Val Acc	Best Val Loss	Train Time
VGG16	6	84.48%	0.4021	~3.3 min
ViT-B/16	7	82.38%	0.4169	~5.4 min
Custom CNN	15	74.37%	0.8472	~11.1 min

→ Transfer learning delivers a 9-point accuracy advantage and 2–3× faster convergence over training from scratch.

03 — Classification Results

Model Performance

Overall classification performance across all three architectures, measured on a held-out stratified 20% test set with class-weighted cross-entropy training.

Transfer Learning

VGG16

ImageNet Pretrained · Aggressive Augmentation

Accuracy

0.00

Macro F1

Excels at focal, spatially-bounded patterns
Bacterial Pneumonia F1 = 0.83 (strong)
COVID-19 F1 = 0.98 (near-perfect)
13 convolutional layers · VGG-pure design

⏱ Stopped at epoch 6 · inference 1.068s

Transfer Learning

ViT-B/16

HuggingFace Pretrained · Moderated Augmentation

Accuracy

0.00

Macro F1

Global self-attention — no spatial bias
COVID-19 recall = 0.99 (best of all models)
14×14 patch grid attention maps
196 patches of 16×16 from 224×224 input

⏱ Stopped at epoch 7 · inference 0.977s

From Scratch · Baseline

Custom CNN

Random Init · 4 Conv Blocks · 15 Epochs

Accuracy

0.00

Macro F1

Quantifies pretraining advantage directly
4 × (Conv→BN→ReLU×2 → MaxPool → Dropout)
Viral Pneumonia hardest class (F1 = 0.55)
Needs all 15 epochs to converge

⏱ All 15 epochs · inference 0.768s

// Class-wise F1 Scores

Class

VGG16

ViT-B/16

Custom CNN

Normal

0.91

0.80

Bacterial Pneumonia

0.83

0.81

0.74

Viral Pneumonia

0.64

0.55

COVID-19

0.98

0.99

0.92

→ Viral Pneumonia remains the hardest class across all models due to its radiographic overlap with bacterial pneumonia.

04 — The Paradox

Progressive Pixel Deletion

If a heatmap genuinely reflects the model's reasoning, removing the highlighted pixels should reduce confidence. We test exactly this — and discover opposite behavior in the two models.

// CONFIDENCE vs. PIXELS REMOVED

VGG16 — confidence rises

ViT-B/16 — confidence falls

n=32 stratified images · T=10 deletion steps · pixels replaced with channel-wise ImageNet mean

Causally Unfaithful

VGG16

Visually convincing heatmaps. Anatomically broad. Clinically intuitive. Yet confidence rises as highlighted pixels are removed.

0.828

AUC ↑ (bad)

−0.012

AOPC ↓ (bad)

Confidence progression as pixels removed:

0.47

→

10%

0.51

→

33%

0.74

→

70%

0.91

→

90%

0.98

Causally Faithful

ViT-B/16

Patchier, less visually consistent heatmaps. Some clinically puzzling. Yet confidence falls sharply as highlighted pixels are removed.

0.588

AUC ↓ (good)

+0.199

AOPC ↑ (good)

Confidence progression as pixels removed:

0.99

→

10%

0.96

→

33%

0.63

→

70%

0.44

→

90%

0.41

05 — Root Causes

Three Interlocking Causes

VGG16's explanation infidelity is not a single failure — it arises from three compounding structural properties of convolutional gradient geometry.

Negative Inter-Method Agreement

ρ(VGG16) = −0.309

GradCAM++ and EigenCAM identify different regions for the same VGG16 prediction. When two independently derived methods actively contradict each other, neither can be trusted to represent the model's true reasoning.

Gradient Landscape Instability

Robustness ρ(VGG16) = 0.542

Thirteen stacked convolutional layers with nonlinear activations produce a highly non-smooth gradient landscape. Small input perturbations cause large heatmap shifts. The CAM reflects a snapshot of local gradient geometry, not stable feature use.

Non-Pathological Decision Basis

Sparsity 0.466 · Entropy 5.159

High sparsity and high entropy — superficially contradictory — are reconciled by a model relying on distributed, low-level statistical image properties (background, scanner artifacts) rather than coherent pathological structures.

VGG16 — The Hollow Explainer

Plausible but not causal

Heatmaps concentrate over bilateral lower lung zones for COVID-19 — anatomically consistent, clinically familiar. Yet confidence rises monotonically as these pixels are erased. The actual decision signal lives in background regions the heatmap does not highlight.

Visual plausibility ≠ causal faithfulness

ViT-B/16 — The Honest Explainer

Imperfect but genuine

Patchier attention maps. Some edge-focused, some puzzling to a radiologist. But inter-method agreement is positive (+0.301), robustness is high (0.809), and confidence falls sharply when highlighted patches are removed — causal structure is real.

Imperfect appearance, genuine causality

06 — Explainability Framework

Six-Dimensional Evaluation

Heatmap quality measured across six independent dimensions with Bonferroni-corrected statistical testing — all six comparisons reach significance at α 0.0083.

// EXPLAINABILITY PROFILE — GRADCAM++ · EIGENCAM

Dimension

VGG16 Score

ViT-B/16 Score

Visual Intuitiveness

0.90

0.60

Activation Contrast

0.45

0.85

Spatial Selectivity

0.80

0.45

Perturbation Robustness ↑

0.35

0.95

Inter-Method Agreement ↑

-0.31

0.75

Causal Faithfulness ↑

-0.01

0.80

↑ = higher is better · VGG16 leads on surface metrics · ViT-B/16 leads on every trustworthiness dimension

// Quantitative Metric Values

Metric	VGG16	ViT-B/16	Winner
Shannon Entropy	5.159 ± 0.034	4.987 ± 0.092	—
Activation Std Dev	0.216 ± 0.018	0.250 ± 0.024	ViT-B/16
Sparsity	0.466 ± 0.148	0.252 ± 0.116	—
Top-k Mass Concentration	16.350 ± 0.874	16.197 ± 0.861	Tie
Perturbation Robustness ↑	0.542 ± 0.215	0.809 ± 0.217	ViT-B/16
Inter-Method Agreement ↑	−0.309 ± 0.483	+0.301 ± 0.406	ViT-B/16

Statistical testing: Mann-Whitney U with Bonferroni correction. All six comparisons significant at p < 0.0083.

07 — Clinical Risk

Why This Matters Clinically

Explanation infidelity is not an academic curiosity. When a hollow explanation accompanies a wrong prediction, it actively prevents the clinician from detecting the error.

Real Case — Identified in Qualitative Analysis

VGG16 predicted Normal with 90.8% confidence on an image whose ground truth was Viral Pneumonia. The GradCAM++ heatmap concentrated activation on the vertebral column and chest wall — non-pathological regions that appeared anatomically reasonable. A clinician relying on visual heatmap validation would have no signal that the explanation was misleading. Only the AOPC faithfulness test reveals the failure.

False Confidence

A model with AOPC ≤ 0 actively deceives clinicians — the heatmap provides a false sense of transparency that is more dangerous than acknowledged uncertainty.

Visual Inspection Is Insufficient

The dominant validation practice — presenting GradCAM heatmaps and asking if they look clinically reasonable — can systematically endorse causally hollow models.

AOPC as Safety Requirement

Faithfulness evaluation via AOPC should be treated as a clinical safety requirement, not an optional post-hoc analysis. AOPC ≤ 0 should block deployment.

Accuracy–Faithfulness Tradeoff

VGG16 (83%) vs ViT-B/16 (82%) — a 1-point accuracy gap. But VGG16's AOPC = −0.012 vs ViT-B/16's +0.199. In clinical deployment, faithfulness wins.

08 — Proposed Framework

Three-Layer Validation

We propose that heatmap-based explanation evaluation must operate across three ascending layers of rigor. Each layer catches failures the previous cannot detect.

I
Visual Inspection
Fast and interpretable. Detects gross anatomical failures — heatmaps on backgrounds,
              uninformative uniform activations. But it cannot distinguish visually plausible from causally faithful. In
              some cases it actively misleads.

              Fast · interpretable · detects gross failures
              Cannot detect subtle unfaithfulness · actively misleading in some cases
            
NECESSARY · NOT SUFFICIENT
II
Inter-Method Agreement (ρ)
Computationally cheap — requires only two CAM passes per image. A negative
              correlation score is an immediate deployment red flag: if GradCAM++ and EigenCAM disagree, neither method
              reliably captures the model's reasoning.

              Cheap · requires only 2 CAM passes · negative ρ = red flag
              Does not prove causal faithfulness — only method consistency
            
ρ < 0 → REVIEW REQUIRED
III
Faithfulness Testing — AOPC
Progressive pixel deletion is the only causal test available without ground-truth
              saliency masks. Quantifiable, standardizable, and reveals hollow explanations definitively. The
              computational cost (10 forward passes per image) is a clinical safety overhead worth paying.

              Only causal test · quantifiable · definitively reveals hollow explanations
              Requires ~10 forward passes per image · computationally heavier
            
AOPC ≤ 0 → DEPLOYMENT BLOCKED

09 — Summary

Key Findings at a Glance

VGG16 Accuracy

ViT-B/16 Accuracy

−0.012

VGG16 AOPC

+0.199

ViT-B/16 AOPC

0pt

TL Accuracy Gain

Eval. Dimensions

Transfer Learning Wins

9-point accuracy gap and 2–3× faster convergence. Both pretrained models stop in 6–7 epochs; the custom CNN needs all 15 and still trails significantly.

Architecture–Task Alignment

VGG16 excels at focal, spatially-bounded patterns (bacterial pneumonia). ViT-B/16 excels at global patterns (COVID-19 recall = 0.99). Task structure determines which architecture fits.

The Explainability Paradox

VGG16's AOPC = −0.012. Confidence rises as important pixels are removed. The explanation is causally hollow despite being visually convincing.

ViT as the Faithful Explainer

ViT-B/16 wins all three trustworthiness dimensions: robustness (0.809), inter-method agreement (+0.301), and causal faithfulness (AOPC +0.199).

// Citation

@misc{owais2026beyondvisual,
  author    = {Owais and {Shahid Ul Islam}},
  title     = {Beyond Visual Plausibility: A Faithfulness-Aware Comparison of
               CNNs and Vision Transformers for Multi-Class Chest X-Ray Classification},
  year      = {2026},
  note      = {Manuscript under peer review},
  institution = {Islamic University of Science and Technology, Awantipora}
}

Beyond VisualPlausibility

What Makes This Study Different

Dataset & Class Structure

Three Architectures, One Question

Model Performance

Progressive Pixel Deletion

Three Interlocking Causes

Six-Dimensional Evaluation

Why This Matters Clinically

Three-Layer Validation

Key Findings at a Glance

Beyond Visual
Plausibility