of 35
Segalin
et al
. eLife 2021;10:e63720. DOI: https:// doi. org/ 10. 7554/ eLife. 63720
1 of 35
The Mouse Action Recognition System
(MARS) software pipeline for automated
analysis of social behaviors in mice
Cristina Segalin
1
, Jalani Williams
1
, Tomomi Karigo
2
, May Hui
2
, Moriel Zelikowsky
2†
,
Jennifer J Sun
1
, Pietro Perona
1
, David J Anderson
2,3
, Ann Kennedy
2
*
1
Department of Computing & Mathematical Sciences, California Institute of
Technology, Pasadena, United States;
2
Division of Biology and Biological Engineering
156-
29, TianQiao and Chrissy Chen Institute for Neuroscience, California Institute of
Technology, Pasadena, United States;
3
Howard Hughes Medical Institute, California
Institute of Technology, Pasadena, United States
Abstract
The study of naturalistic social behavior requires quantification of animals’ interactions.
This is generally done through manual annotation—a highly time-
consuming and tedious process.
Recent advances in computer vision enable tracking the pose (posture) of freely behaving animals.
However, automatically and accurately classifying complex social behaviors remains technically chal-
lenging. We introduce the Mouse Action Recognition System (MARS), an automated pipeline for
pose estimation and behavior quantification in pairs of freely interacting mice. We compare MARS’s
annotations to human annotations and find that MARS’s pose estimation and behavior classifica-
tion achieve human-
level performance. We also release the pose and annotation datasets used to
train MARS to serve as community benchmarks and resources. Finally, we introduce the Behavior
Ensemble and Neural Trajectory Observatory (BENTO), a graphical user interface for analysis of
multimodal neuroscience datasets. Together, MARS and BENTO provide an end-
to-
end pipeline for
behavior data extraction and analysis in a package that is user-
friendly and easily modifiable.
Introduction
The brain evolved to guide survival-
related behaviors, which frequently involve interaction with
other animals. Gaining insight into brain systems that control these behaviors requires recording
and manipulating neural activity while measuring behavior in freely moving animals. Recent techno-
logical advances, such as miniaturized imaging and electrophysiological devices, have enabled the
recording of neural activity in freely behaving mice (
Remedios et al., 2017
;
Li et al., 2017
;
Falkner
et al., 2020
)—however, to make sense of the recorded neural activity, it is also necessary to obtain
a detailed characterization of the animals’ actions during recording. This is usually accomplished via
manual scoring of the animals’ actions (
Yang et al., 2011
;
Silverman et al., 2010
;
Winslow, 2003
). A
typical study of freely behaving animals can produce tens to hundreds of hours of video that require
manual behavioral annotation (
Zelikowsky et al., 2018
;
Shemesh et al., 2013
;
Branson et al., 2009
).
Scoring for social behaviors often takes human annotators 3–4× the video’s duration to annotate; for
long recordings, there is also risk of drops in annotation quality due to drifting annotator attention.
It is unclear to what extent individual human annotators within and between different labs agree on
the definitions of behaviors, especially the precise timing of behavior onset/offset. When behavior
is being analyzed alongside neural recording data, it is also often unclear whether the set of social
behaviors that were chosen to annotate are a good fit for explaining the activity of a neural population
or whether other, unannotated behaviors with clearer neural correlates may have been missed.
TOOLS AND RESOURCES
*For correspondence:
ann. kennedy@ northwestern. edu
Present address:
Department
of Neurobiology and Anatomy,
University of Utah, Salt Lake City,
United States;
Department of
Neuroscience, Northwestern
University Feinberg School of
Medicine, Chicago, United States
Competing interest:
The authors
declare that no competing
interests exist.
Funding:
See page 31
Preprinted:
27 July 2020
Received:
04 October 2020
Accepted:
14 October 2021
Published:
30 November 2021
Reviewing Editor:
Gordon
J Berman, Emory University,
United States
Copyright Segalin
et al
. This
article is distributed under the
terms of the Creative Commons
Attribution License, which
permits unrestricted use and
redistribution provided that the
original author and source are
credited.
Tools and resources
Neuroscience
Segalin
et al
. eLife 2021;10:e63720. DOI: https:// doi. org/ 10. 7554/ eLife. 63720
2 of 35
An accurate, sharable, automated approach to scoring social behavior is thus needed. Use of such
a pipeline would enable social behavior measurements in large-
scale experiments (e.g., genetic or
drug screens), and comparison of datasets generated across the neuroscience community by using a
common set of definitions and classification methods for behaviors of interest. Automation of behavior
classification using machine learning methods poses a potential solution to both the time demand of
annotation and to the risk of inter-
individual and inter-
lab differences in annotation style.
We present the Mouse Action Recognition System (MARS), a quartet of software tools for auto-
mated behavior analysis, training and evaluation of novel pose estimator and behavior classifi-
cation models, and joint visualization of neural and behavioral data (
Figure 1
). This software is
accompanied by three datasets aimed at characterizing inter-
annotator variability for both pose
and behavior annotation. Together, the software and datasets introduced in this paper provide a
robust computational pipeline for the analysis of social behavior in pairs of interacting mice and
establish essential measures of reliability and sources of variability in human annotations of animal
pose and behavior.
Contributions
The contributions of this paper are as follows:
Data
MARS pose estimators are trained on a novel corpus of manual pose annotations in top- and front-
view video (
Figure 1—figure supplement 1
) of pairs of mice engaged in a standard resident-
intruder
assay (
Thurmond, 1975
). These data include a variety of experimental manipulations of the resident
animal, including mice that are unoperated, cannulated, or implanted with fiberoptic cables, fiber
photometry cables, or a head-
mounted microendoscope, with one or more cables leading from the
animal’s head to a commutator feeding out the top of the cage. All MARS training datasets can be
found at
https:// neuroethology. github. io/ MARS/
under ‘datasets.’
Multi-annotator pose dataset
Anatomical landmarks (‘keypoints’ in the following) in this training set are manually annotated by five
human annotators, whose labels are combined to create a ‘consensus’ keypoint location for each
image. 9 anatomical keypoints are annotated on each mouse in the top view, and 13 in the front view
(two keypoints, corresponding to the midpoint and end of the tail, are included in this dataset but
were omitted in training MARS due to high annotator noise).
Behavior classifier training/testing dataset
MARS includes three supervised classifiers trained to detect attack, mounting, and close investigation
behaviors in tracked animals. These classifiers were trained on 6.95 hr of behavior video, 4 hr of which
were obtained from animals with a cable-
attached device such as a microendoscope. Separate eval-
uation (3.85 hr) and test (3.37 hr) sets of videos were used to constrain training and evaluate MARS
performance, giving a total of over 14 hr of video (
Figure 1—figure supplement 2
). All videos were
manually annotated on a frame-
by-
frame basis by a single trained human annotator. Most videos
in this dataset are a subset of the recent CalMS mouse social behavior dataset (
Sun et al., 2021a
)
(specifically, from Task 1).
Multi-annotator behavior dataset
To evaluate inter-
annotator variability in behavior classification, we also collected frame-
by-
frame
manual labels of animal actions by eight trained human annotators on a dataset of ten 10-
min videos.
Two of these videos were annotated by all eight annotators a second time a minimum of 10 months
later for evaluation of annotator self-
consistency.
Software
This paper is accompanied by four software tools, all of which can be found on the MARS project
website at:
https:// neuroethology. github. io/ MARS/
.
Tools and resources
Neuroscience
Segalin
et al
. eLife 2021;10:e63720. DOI: https:// doi. org/ 10. 7554/ eLife. 63720
3 of 35
Detect white and
black mouse
Classify ac
tion
sning
Crop and estimate
pose of each mouse
Ex
trac
t fe
atures
from pose data
B
A
C
Detect
mice
Estimate
poses
Extract
pose feats.
Find structure in
population activity
Raw
video
Imaging
data (Ca)
Filter and
motion
correct
Extract
Ca
2+
traces
Element of Bento
Future modules
Element of MARS
Other existing tools
Data ex
trac
tion
Data analy
sis
Classify
behaviors
Joint
visualization
Joint statistical
models
Neural
decoding
Event-triggered
averaging
Use strategy 1 (basic)
Use strategy 2 (intermediate
)
Use strategy 3 (advanced)
Collec
t behavior
videos + neural data
Run MARS to get pose
+
behavior annotations
Validate behavior annotations
and correlate with neural ac
tivit
y
in BENT
O
Annotate f
or new
behaviors in BENT
O
Train new behavior
classiers
Collec
t data in
novel arena
Crowdsource
pose annotatio
n
Train new detec
tion,
pose,
or bh
vr models
with MARS_Developer
Run new models
from within MARS
ST
AR
T
ST
AR
T
Figure 1.
The Mouse Action Recognition System (MARS) data pipeline. (
A
) Sample use strategies of MARS, including either out-
of-
the-
box application
or fine-
tuning to custom arenas or behaviors of interest. (
B
) Overview of data extraction and analysis steps in a typical neuroscience experiment,
indicating contributions to this process by MARS and Behavior Ensemble and Neural Trajectory Observatory (BENTO). (
C
) Illustration of the four stages
of data processing included in MARS.
The online version of this article includes the following figure supplement(s) for figure 1:
Figure supplement 1.
Mouse Action Recognition System (MARS) camera positioning and sample frames.
Figure supplement 2.
The Mouse Action Recognition System (MARS) annotation dataset.
Figure supplement 3.
Mouse Action Recognition System (MARS) graphical user interface.
Tools and resources
Neuroscience
Segalin
et al
. eLife 2021;10:e63720. DOI: https:// doi. org/ 10. 7554/ eLife. 63720
4 of 35
MARS
An open-
source, Python-
based tool for running trained detection, pose estimation, and behavior clas-
sification models on video data. MARS can be run on a desktop computer equipped with TensorFlow
and a graphical processing unit (GPU), and supports both Python command-
line and graphical user
interface (GUI)-
based usage (
Figure 1—figure supplement 3
). The MARS GUI allows users to select
a directory containing videos and will produce as output a folder containing bounding boxes, pose
estimates, features, and predicted behaviors for each video in the directory.
MARS_Developer
Python suite for training MARS on new datasets and behaviors. It includes the following compo-
nents: (1) a module for collecting crowdsourced pose annotation datasets, (2) a module for training a
MultiBox detector, (3) a module for training a stacked hourglass network for pose estimation, and (4)
a module for training new behavior classifiers. It is accompanied by a Jupyter notebook guiding users
through the training process.
MARS_pycocotools
A fork of the popular COCO API for evaluation of object detection and pose estimation models (
Lin
et al., 2014
), used within MARS_Developer. In addition to the original COCO API, it includes added
scripts for quantifying performance of keypoint-
based pose estimates, as well as added support for
computing object keypoint similarity (OKS) scores (see Materials and methods) in laboratory mice.
The Behavior Ensemble and Neural Trajectory Observatory (BENTO)
A MATLAB-
based GUI for synchronous display of neural recording data, multiple videos, human/
automated behavior annotations, spectrograms of recorded audio, pose estimates, and 270 ‘features’
extracted from MARS pose data—such as animals’ velocities, joint angles, and relative positions. It
features an interface for fast frame-
by-
frame manual annotation of animal behavior, as well as a tool
to create annotations programmatically by applying thresholds to combinations of the MARS pose
features. BENTO also provides tools for exploratory neural data analysis, such as PCA and event-
triggered averaging. While BENTO can be linked to MARS to annotate and train classifiers for behav-
iors of interest, BENTO may also be used independently, and with plug-
in support can be used to
display pose estimates from other systems such as DeepLabCut (
Mathis et al., 2018
).
Related work
Automated tracking and behavior classification can be broken into a series of computational steps,
which may be implemented separately, as we do, or combined into a single module. First, animals
are detected, producing a 2D/3D centroid, blob, or bounding box that captures the animal’s location,
and possibly its orientation. When animals are filmed in an empty arena, a common approach is to use
background subtraction to segment animals from their environments (
Branson et al., 2009
). Deep
networks for object detection (such as Inception Resnet [
Szegedy et al., 2017
], Yolo [
Redmon et al.,
2016
], or Mask R-
CNN [
He et al., 2017
]) may also be used. Some behavior systems, such as Ethovi-
sion (
Noldus et al., 2001
), MoTr (
Ohayon et al., 2013
), idTracker (
Pérez-
Escudero et al., 2014
), and
previous work from our group (
Hong et al., 2015
), classify behavior from this location and movement
information alone. MARS uses the MSC-
MultiBox approach to detect each mouse prior to pose esti-
mation; this architecture was chosen for its combined speed and accuracy.
The tracking of multiple animals raises problems not encountered in single-
animal tracking systems.
First, each animal must be detected, located, and identified consistently over the duration of the
video. Altering the appearance of individuals using paint or dye, or selecting animals with differing
coat colors, facilitates this task (
Ohayon et al., 2013
,
Gal et al., 2020
). In cases where these manipu-
lations are not possible, animal identity can in some cases be tracked by identity-
matching algorithms
(
Branson et al., 2009
). The pretrained version of MARS requires using animals of differing coat colors
(black and white).
Second, the posture (‘pose’) of the animal, including its orientation and body part configuration,
is computed for each frame and tracked across frames. A pose estimate comprises the position and
identity of multiple tracked body parts, either in terms of a set of anatomical ‘keypoints’ (
Toshev and
Szegedy, 2014
), shapes (
Dankert et al., 2009
,
Dollár et al., 2010
), or a dense 2D or 3D mesh (
Güler
Tools and resources
Neuroscience
Segalin
et al
. eLife 2021;10:e63720. DOI: https:// doi. org/ 10. 7554/ eLife. 63720
5 of 35
et al., 2018
). Keypoints are typically defined based on anatomical landmarks (nose, ears, paws, digits),
and their selection is determined by the experimenter depending on the recording setup and type of
motion being tracked.
Animal tracking and pose estimation systems have evolved in step with the field of computer
vision. Early computer vision systems relied on specialized data acquisition setups using multiple
cameras and/or depth sensors (
Hong et al., 2015
), and were sensitive to minor changes in experi-
mental conditions. More recently, systems for pose estimation based on machine learning and deep
neural networks, including DeepLabCut (
Mathis et al., 2018
), LEAP (
Pereira et al., 2019
), and Deep-
PoseKit (
Graving et al., 2019
), have emerged as a flexible and accurate tool in behavioral and systems
neuroscience. These networks, like MARS’s pose estimator, are more accurate and more adaptable
to recording changes than their predecessors (
Sturman et al., 2020
), although they require an initial
investment in creating labeled training data before they can be used.
Third, once raw animal pose data are acquired, a classification or identification of behavior is
required. Several methods have been introduced for analyzing the actions of animals in an unsuper
-
vised or semi-
supervised manner, in which behaviors are identified by extracting features from the
animal’s pose and performing clustering or temporal segmentation based on those features, including
Moseq (
Wiltschko et al., 2015
), MotionMapper (
Berman et al., 2014
), and multiscale unsupervised
structure learning (
Vogelstein et al., 2014
). Unsupervised techniques are said to identify behaviors in
a ‘user-
unbiased’ manner (although the behaviors identified do depend on how pose is preprocessed
prior to clustering). Thus far, they are most successful when studying individual animals in isolation.
Our goal is to detect complex and temporally structured social behaviors that were previously
determined to be of interest to experimenters; therefore, MARS takes a supervised learning approach
to behavior detection. Recent examples of supervised approaches to detection of social behavior
include
Giancardo et al., 2013
, MiceProfiler (
de Chaumont et al., 2012
), SimBA (
Nilsson, 2020
), and
Hong et al., 2015
. Like MARS, SimBA uses a keypoint-
based representation of animal pose, obtained
via separate software (supported pose representations include DeepLabCut [
Mathis et al., 2018
],
DeepPoseKit [
Graving et al., 2019
], SLEAP [
Pereira et al., 2020b
], and MARS itself). In contrast,
Giancardo et al., Hong et al., and MiceProfiler are pre-
deep-
learning methods that characterize animal
pose in terms of geometrical primitives (
Hong et al., 2015
,
Giancardo et al., 2013
) or contours
extracted using background subtraction (
de Chaumont et al., 2012
). Following pose estimation, all
five systems extract a set of handcrafted spatiotemporal features from animal pose: features common
to all systems include relative position, animal shape (typically body area), animal movement, and inter-
animal orientation. MARS and Hong et al. use additional handcrafted features capturing the orienta-
tion and minimum distances between interacting animals. Both MARS and SimBA adopt the rolling
feature-
windowing method introduced by JAABA (
Kabra et al., 2013
), although choice of windowing
differs modestly: SimBA computes raw and normalized feature median, mean, and sum within five
rolling time windows, whereas MARS computes feature mean, standard deviation, minimum, and
maximum values, and uses three windows. Finally, most methods use these handcrafted features as
inputs to trained ensemble-
based classifiers: Adaptive Boosting in Hong et al., Random Forests in
SimBA, Temporal Random Forests in Giancardo et al., and Gradient Boosting in MARS; MiceProfiler
instead identifies behaviors using handcrafted functions. While there are many similarities between
the approaches of these tools, direct comparison of performance is challenging due to lack of stan-
dardized evaluation metrics. We have attempted to address this issue in a separate paper (
Sun et al.,
2021a
).
A last difference between these five supervised approaches is their user interface and flexibility.
Three are designed for out-
of-
the-
box use in single, fixed settings: Giancardo et al. and Hong et al. in
the resident-
intruder assay, and MiceProfiler in a large open-
field arena. SimBA is fully user-
defined,
functioning in diverse experimental arenas but requiring users to train their own pose estimation and
behavior models; a GUI is provided for this purpose. MARS takes a hybrid approach: whereas the core
‘end-
user’ version of MARS provides pretrained pose and behavior models that function in a standard
resident-
intruder assay, MARS_Developer allows users to train MARS pose and behavior models for
their own applications. Unique to MARS_Developer is a novel library for collecting crowdsourced
pose annotation datasets, including tools for quantifying inter-
human variability in pose labels and
using this variability to evaluate trained pose models. The BENTO GUI accompanying MARS is also
unique: while BENTO does support behavior annotation and (like SimBA) behavior classifier training,