CaltechAUTHORS
  A Caltech Library Service

Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection

Beery, Sara and Wu, Guanhang and Rathod, Vivek and Votel, Ronny and Huang, Jonathan (2020) Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE , Piscataway, NJ, pp. 13072-13082. ISBN 9781728171685. https://resolver.caltech.edu/CaltechAUTHORS:20200806-153947935

Full text is not posted in this repository. Consult Related URLs below.

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20200806-153947935

Abstract

In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.


Item Type:Book Section
Related URLs:
URLURL TypeDescription
https://doi.org/10.1109/cvpr42600.2020.01309DOIArticle
ORCID:
AuthorORCID
Beery, Sara0000-0002-2544-1844
Additional Information:© 2020 IEEE. We would like to thank Pietro Perona, David Ross, Zhichao Lu, Ting Yu, Tanya Birch and the Wildlife Insights Team, Joe Marino, and Oisin MacAodha for their valuable insight. This work was supported by NSFGRFP Grant No. 1745301, the views are those of the authors and do not necessarily reflect the views of the NSF.
Funders:
Funding AgencyGrant Number
NSF Graduate Research FellowshipDGE-1745301
DOI:10.1109/cvpr42600.2020.01309
Record Number:CaltechAUTHORS:20200806-153947935
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20200806-153947935
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:104786
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:10 Aug 2020 16:49
Last Modified:16 Nov 2021 18:35

Repository Staff Only: item control page