CaltechAUTHORS
  A Caltech Library Service

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

Ma, Xiaojian and Nie, Weili and Yu, Zhiding and Jiang, Huaizu and Xiao, Chaowei and Zhu, Yuke and Zhu, Song-Chun and Anandkumar, Anima (2022) RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning. . (Unpublished) https://resolver.caltech.edu/CaltechAUTHORS:20220714-212522171

[img] PDF - Submitted Version
Creative Commons Attribution.

5MB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20220714-212522171

Abstract

Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e., systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.


Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription
https://doi.org/10.48550/arXiv.2204.11167arXivDiscussion Paper
https://github.com/NVlabs/RelViTRelated ItemCode
ORCID:
AuthorORCID
Zhu, Yuke0000-0002-9198-2227
Anandkumar, Anima0000-0002-6974-6797
Additional Information:Attribution 4.0 International (CC BY 4.0)
Record Number:CaltechAUTHORS:20220714-212522171
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20220714-212522171
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:115586
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:15 Jul 2022 22:40
Last Modified:15 Jul 2022 22:40

Repository Staff Only: item control page