A Caltech Library Service

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Shi, Yang and Niranjan, U. N. and Anandkumar, Animashree and Cecka, Cris (2016) Tensor Contractions with Extended BLAS Kernels on CPU and GPU. In: 2016 IEEE 23rd International Conference on High Performance Computing. IEEE , Piscataway, NJ, pp. 193-202. ISBN 978-1-5090-5411-4.

[img] PDF - Submitted Version
See Usage Policy.


Use this Persistent URL to link to this item:


Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. In this paper, we propose and evaluate new BLAS-like primitives that are capable of performing a wide range of tensor contractions on CPU and GPU efficiently. We begin by focusing on single-index contractions involving all the possible configurations of second-order and third-order tensors. Then, we discuss extensions to more general cases. Existing approaches for tensor contractions spend large amounts of time restructuring the data which typically involves explicit copy and transpose operations. In this work, we summarize existing approaches and present library-based approaches that avoid memory movement. Through systematic benchmarking, we demonstrate that our approach can achieve 10x speedup on a K40c GPU and 2x speedup on dual-socket Haswell-EP CPUs, using MKL and CUBLAS respectively, for small and moderate tensor sizes. This is relevant in many machine learning applications such as deep learning, where tensor sizes tend to be small, but require numerous tensor contraction operations to be performed successively. Concretely, we implement a Tucker decomposition and show that using our kernels yields atleast an order of magnitude speedup as compared to state-of-the-art libraries.

Item Type:Book Section
Related URLs:
URLURL TypeDescription Paper
Additional Information:© 2016 IEEE. The authors would like to thank Aparna Chandramowlish-waran for providing the computation resources and suggestions. Animashree Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF Career Award CCF-1254106, ONR Award N00014-14-1-0665, ARO YIP Award W911NF-13-1-0084, and AFOSRYIP FA9550-15-1-0221. Yang Shi is supported by NSF Career Award CCF-1254106 and ONR Award N00014-15-1-2737, Niranjan is supported by NSF BigData Award IIS-1251267 and ONR Award N00014-15-1-2737.
Funding AgencyGrant Number
Microsoft ResearchUNSPECIFIED
Office of Naval Research (ONR)N00014-14-1-0665
Army Research Office (ARO)W911NF-13-1-0084
Air Force Office of Scientific Research (AFOSR)FA9550-15-1-0221
Office of Naval Research (ONR)N00014-15-1-2737
Subject Keywords:Parallelism; BLAS; GPU; Tensor
Record Number:CaltechAUTHORS:20170920-112945692
Persistent URL:
Official Citation:Y. Shi, U. N. Niranjan, A. Anandkumar and C. Cecka, "Tensor Contractions with Extended BLAS Kernels on CPU and GPU," 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), Hyderabad, 2016, pp. 193-202. doi: 10.1109/HiPC.2016.031 URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:81622
Deposited By: Tony Diaz
Deposited On:20 Sep 2017 18:57
Last Modified:03 Oct 2019 18:45

Repository Staff Only: item control page