Fully Attentional Networks with Self-emerging Token Labeling

Creators: Zhao, Bingyin; Yu, Zhiding; Lan, Shiyi; Cheng, Yutao; Anandkumar, Anima¹; Lao, Yingjie; Alvarez, Jose M.

1. California Institute of Technology

Abstract

Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pretraining with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FANL-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNetC, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model.

Additional details

Views

Downloads

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

More info on how stats are collected....

Resource type: Conference Paper
Publisher: IEEE
Published in: 5562-5572.
Imprint: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ. ISBN: 979-8-3503-0718-4.
Conference: 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 1-6 October 2023
Languages: English