A Caltech Library Service

Vision Transformers Are Good Mask Auto-Labelers

Lan, Shiyi and Yang, Xitong and Yu, Zhiding and Wu, Zuxuan and Alvarez, Jose M. and Anandkumar, Anima (2023) Vision Transformers Are Good Mask Auto-Labelers. . (Unpublished)

[img] PDF - Submitted Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4% performance of fully supervised models. The best model achieves 44.1% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper
Yang, Xitong0000-0003-4372-241X
Yu, Zhiding0000-0003-1776-996X
Anandkumar, Anima0000-0002-6974-6797
Additional Information:Attribution 4.0 International (CC BY 4.0).
Record Number:CaltechAUTHORS:20230316-153757695
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:120089
Deposited By: George Porter
Deposited On:16 Mar 2023 17:57
Last Modified:16 Mar 2023 17:57

Repository Staff Only: item control page