Exploring Plain ViT Features for Multi-class Unsupervised Visual Anomaly Detection

Abstract

This work studies a challenging and practical problem, termed multi-class unsupervised anomaly detection (MUAD), which only requires normal images for training while simultaneously testing both normal and anomaly images for multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, accompanied by elaborating sub-modules with heavier handcraft engineering designs. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore plain ViT features for MUAD. We first abstract a Meta-AD concept by inducing current reconstruction-based methods. Then, we instantiate a novel ViT-based ViTAD structure, effectively designed step by step from global and local perspectives. In addition, this paper reveals several interesting findings for further exploration. Finally, we benchmark various approaches comprehensively and fairly on eight metrics. Based on a naive training recipe with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec-AD, VisA, and Uni-Medical datasets, obtaining 85.4 mAD that surpasses UniAD by +3.0, while only requiring 1.1 hours and 2.3G GPU memory to complete model training by one V100 on the MVTec-AD dataset.

Motivation & Comparison with SoTAs on MUAD Setting

Existing reconstruction-based methods typically adopt pyramid networks as encoders/decoders to obtain multi-resolution features, accompanied by elaborate sub-modules with heavier handcraft engineering designs for more precise localization.
1. We first abstract a Meta-AD concept by inducing current reconstruction-based methods.
2. Then, we instantiate a novel ViT-based ViTAD structure, effectively designed step by step from global and local perspectives.
3. Based on a naive training recipe (only MSE loss), ViTAD achieves SoTA results and efficiency on MVTec AD, VisA, and Uni-Medical datasets.

Meta-AD ===> ViTAD

Left: Illustration of abstracted Meta-AD. Right: Details of instantiated plain ViT-based ViTAD.

Experiments on MVTec-AD

Experiments on VisA

Experiments on Uni-Medical

Qualitative Results on MVTec AD, VisA, and Uni-Medical

BibTeX


      @article{vitad,
        title={Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection},
        author={Jiangning Zhang and Xuhai Chen and Yabiao Wang and Chengjie Wang and Yong Liu and Xiangtai Li and Ming-Hsuan Yang and Dacheng Tao},
        journal={arXiv preprint arXiv:2312.07495},
        year={2023}
      }