The model that produces the most visually convincing heatmaps is the one whose explanations are causally hollow.
Most deep learning studies ask "how accurate is the model?" — this study asks the harder question: can we trust why the model made that decision?
| Conventional Study | This Study |
|---|---|
| Reports accuracy only | Reports accuracy plus faithfulness of explanations |
| One CAM method per model | Two CAM methods with inter-method agreement evaluation |
| Visual inspection of heatmaps | Quantitative pixel deletion — AOPC and AUC curves |
| Single metric evaluation | Six-dimensional explainability framework |
| No statistical correction | Bonferroni-corrected non-parametric testing (α 0.0083) |
6,432 posterior-anterior chest X-rays from the Kaggle Pneumonia & COVID-19 Image Dataset — a challenging four-class classification problem with meaningful class imbalance.
Does ImageNet pretraining confer a decisive advantage for medical image classification? Two pretrained models face a custom baseline trained from scratch on the same data.
| Model | Stop Epoch | Best Val Acc | Best Val Loss | Train Time |
|---|---|---|---|---|
| VGG16 | 6 | 84.48% | 0.4021 | ~3.3 min |
| ViT-B/16 | 7 | 82.38% | 0.4169 | ~5.4 min |
| Custom CNN | 15 | 74.37% | 0.8472 | ~11.1 min |
→ Transfer learning delivers a 9-point accuracy advantage and 2–3× faster convergence over training from scratch.
Overall classification performance across all three architectures, measured on a held-out stratified 20% test set with class-weighted cross-entropy training.
→ Viral Pneumonia remains the hardest class across all models due to its radiographic overlap with bacterial pneumonia.
If a heatmap genuinely reflects the model's reasoning, removing the highlighted pixels should reduce confidence. We test exactly this — and discover opposite behavior in the two models.
n=32 stratified images · T=10 deletion steps · pixels replaced with channel-wise ImageNet mean
VGG16's explanation infidelity is not a single failure — it arises from three compounding structural properties of convolutional gradient geometry.
Heatmap quality measured across six independent dimensions with Bonferroni-corrected statistical testing — all six comparisons reach significance at α 0.0083.
↑ = higher is better · VGG16 leads on surface metrics · ViT-B/16 leads on every trustworthiness dimension
| Metric | VGG16 | ViT-B/16 | Winner |
|---|---|---|---|
| Shannon Entropy | 5.159 ± 0.034 | 4.987 ± 0.092 | — |
| Activation Std Dev | 0.216 ± 0.018 | 0.250 ± 0.024 | ViT-B/16 |
| Sparsity | 0.466 ± 0.148 | 0.252 ± 0.116 | — |
| Top-k Mass Concentration | 16.350 ± 0.874 | 16.197 ± 0.861 | Tie |
| Perturbation Robustness ↑ | 0.542 ± 0.215 | 0.809 ± 0.217 | ViT-B/16 |
| Inter-Method Agreement ↑ | −0.309 ± 0.483 | +0.301 ± 0.406 | ViT-B/16 |
Statistical testing: Mann-Whitney U with Bonferroni correction. All six comparisons significant at p < 0.0083.
Explanation infidelity is not an academic curiosity. When a hollow explanation accompanies a wrong prediction, it actively prevents the clinician from detecting the error.
VGG16 predicted Normal with 90.8% confidence on an image whose ground truth was Viral Pneumonia. The GradCAM++ heatmap concentrated activation on the vertebral column and chest wall — non-pathological regions that appeared anatomically reasonable. A clinician relying on visual heatmap validation would have no signal that the explanation was misleading. Only the AOPC faithfulness test reveals the failure.
We propose that heatmap-based explanation evaluation must operate across three ascending layers of rigor. Each layer catches failures the previous cannot detect.
@misc{owais2026beyondvisual, author = {Owais and {Shahid Ul Islam}}, title = {Beyond Visual Plausibility: A Faithfulness-Aware Comparison of CNNs and Vision Transformers for Multi-Class Chest X-Ray Classification}, year = {2026}, note = {Manuscript under peer review}, institution = {Islamic University of Science and Technology, Awantipora} }