A Caltech Library Service

hepaccelerate: Fast Analysis of Columnar Collider Data

Pata, J. and Spiropulu, M. (2019) hepaccelerate: Fast Analysis of Columnar Collider Data. . (Unpublished)

[img] PDF - Submitted Version
See Usage Policy.


Use this Persistent URL to link to this item:


At HEP experiments, processing terabytes of structured numerical event data to a few statistical summaries is a common task. This step involves selecting events and objects within the event, reconstructing high-level variables, evaluating multivariate classifiers with up to hundreds of variations and creating thousands of low-dimensional histograms. Currently, this is done using multi-step workflows and batch jobs. Based on the CMS search for H(μμ), we demonstrate that it is possible to carry out significant parts of a real collider analysis at a rate of up to a million events per second on a single multicore server with optional GPU acceleration. This is achieved by representing HEP event data as memory-mappable sparse arrays, and by expressing common analysis operations as kernels that can be parallelized across the data using multithreading. We find that only a small number of relatively simple kernels are needed to implement significant parts of this Higgs analysis. Therefore, analysis of real collider datasets of billions events could be done within minutes to a few hours using simple multithreaded codes, reducing the need for managing distributed workflows in the exploratory phase. This approach could speed up the cycle for delivering physics results at HEP experiments. We release the hepaccelerate prototype library as a demonstrator of such accelerated computational kernels. We look forward to discussion, further development and use of efficient and easy-to-use software for terabyte-scale high-level data analysis in the physical sciences.

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper
Spiropulu, M.0000-0001-8172-7081
Additional Information:We would like to thank Jim Pivarski and Lindsey Gray for helpful feedback at the start of this project. We are grateful to Nan Lu and Irene Dutta for providing a reference implementation of the H(μμ) analysis that could be adapted to vectorized code. We would like to thank Christina Reissel for being an independent early tester of these approaches and for helpful feedback on this report. The availability of the excellent Python libraries uproot, awkward, coffea, Numba, cupy and numpy was imperative for this project and we are grateful to the developers of those projects. Part of this work was conducted at ”iBanks”, the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of ”iBanks”.
Record Number:CaltechAUTHORS:20190923-091626883
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:98788
Deposited By: Tony Diaz
Deposited On:23 Sep 2019 16:38
Last Modified:03 Oct 2019 21:44

Repository Staff Only: item control page