A Caltech Library Service

Long-Short Transformer: Efficient Transformers for Language and Vision

Zhu, Chen and Ping, Wei and Xiao, Chaowei and Shoeybi, Mohammad and Goldstein, Tom and Anandkumar, Anima and Catanzaro, Bryan (2021) Long-Short Transformer: Efficient Transformers for Language and Vision. In: Thirty-fifth Conference on Neural Information Processing Systems 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Neural Information Processing Foundation , La Jolla, CA, pp. 1-14. ISBN 9781713845393.

Full text is not posted in this repository. Consult Related URLs below.

Use this Persistent URL to link to this item:


Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224 x 224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images. The source code and models are released at

Item Type:Book Section
Related URLs:
URLURL TypeDescription ItemDiscussion Paper
Zhu, Chen0000-0002-3103-8752
Xiao, Chaowei0000-0002-7043-4926
Anandkumar, Anima0000-0002-6974-6797
Catanzaro, Bryan0000-0003-0034-7728
Record Number:CaltechAUTHORS:20221222-230911029
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:118600
Deposited By: George Porter
Deposited On:23 Dec 2022 16:20
Last Modified:23 Dec 2022 16:20

Repository Staff Only: item control page