of 8
Robust Estimation Framework with Semantic Measurements
Karena X. Cai
1
, Alexei Harvard
2
, Richard M. Murray, Soon-Jo Chung
Abstract
— Conventional simultaneous localization and map-
ping (SLAM) algorithms rely on geometric measurements and
require loop-closure detections to correct for drift accumulated
over a vehicle trajectory. Semantic measurements can add
measurement redundancy and provide an alternative form of
loop closure. We propose two different estimation algorithms
that incorporate semantic measurements provided by vision-
based object classifiers. An a priori map of regions where
the objects can be detected is assumed. The first estimation
framework is posed as a maximum-likelihood problem, where
the likelihood function for semantic measurements is derived
from the confusion matrices of the object classifiers. The
second estimation framework is comprised of two parts: 1) a
continuous-state estimation formulation that includes semantic
measurements as a form of state constraints and 2) a discrete-
state estimation formulation used to compute the certainty of
object detection measurements using a Hidden Markov Model
(HMM). The advantages of incorporating semantic measure-
ments in these frameworks are demonstrated in numerical
simulations. In particular, the proposed estimation algorithms
improve upon the robustness and accuracy of conventional
SLAM algorithms.
I. INTRODUCTION
The widespread availability of vision-based object classifi-
cation tools has opened up the possibility of including seman-
tic data into simultaneous localization and mapping (SLAM)
algorithms. The data extracted from object-classifiers, which
we refer to as semantic data, can be used to complement the
more conventional nonlinear continuous measurements that
are traditionally used in SLAM algorithms.
Semantic data, or object detection events, can be modeled
as binary measurements that have state-dependent probabilis-
tic likelihood functions [1], [2], [3]. The probability of a
positive detection measurement is modeled as an inverse-
exponential function of the distance to the detected object
in [1], meaning positive object detections occur with higher
probability when the vehicle is close to the detected object.
The authors in [4] compute the likelihood function of an
object detection event as a function of the object classifier
confusion matrix, and solve the coupled data association and
estimation problem by iteratively solving an expectation-
maximization problem. In these algorithms, however, the
likelihood functions lack the ability to capture false positive
or negative detections. A likelihood function that captures
these types of errors is derived in [3], but requires additional
*This work was in part supported by AeroVironment, Inc., Boeing, and
Caltech’s Center for Autonomous Systems and Technologies (CAST).
1
Karena X. Cai is a graduate student in Control and Dynamical Systems
at the California Institute of Technology
kcai@caltech.edu
.
2
Alexei Harvard is a senior research engineer at the California Institute
of Technology
alexh@caltech.edu
.
assumptions on the probability of false positive detections
generated by clutter.
Since object classifiers are prone to false positive and
negative detections, an accurate estimation algorithm that
integrates semantic data requires the rejection of false mea-
surements. Loop closure events, which are the detection of
returning to a previously visited location, are similar to object
detection events, since they add a state constraint and the
inclusion of false loop closure measurements can cause large
errors in the state estimate [5].
Several algorithms have been developed to ensure the
factor-graph formulation for SLAM problems are robust to
loop-closure errors [5], [6], [7]. In particular, the covariance
value associated with loop closure measurements, which are
referred to as switchable constraint variables, are introduced
into the optimization framework to improve robustness to
false detections [7], [8]. In this paper, we extend such meth-
ods to robustly handle false object detection measurements
as well. This requires deriving a method for computing
the certainty of object detection measurements. The method
we derive for computing this certainty metric leverages
information from pose estimation.
Pose estimation has been shown to improve the accuracy
of object classification algorithms [9], [10]. These methods
take into account the motion profile of the vehicle as it
observes an object, but they do not consider the dynamics
of the vehicle between object detection events [9]. In our
paper, we exploit the dynamics of the robot between object
detection events to help quantify the certainty of an object
detection measurement.
In this paper, we propose a novel framework that simulta-
neously uses object classification to improve pose estimation
and pose estimation to improve the certainty of measure-
ments from object classifiers. In order to robustly include
semantic measurements into the estimation formulation, we
introduce a higher-layer estimation framework that parses
the vehicle trajectory into object detection events, and uses
Hidden Markov Model (HMM) algorithms to estimate the
certainty of the object detection events. This computed
certainty is used to improve the robustness of our semantic
estimation algorithm since it allows for the rejection of false
positive measurements.
This paper is structured as follows: in Section II, we
review the problem formulation and introduce the format of
the semantic data. We show the maximum-likelihood formu-
lation of the SLAM problem in Section III. An alternative
formulation for including the semantic data into the factor
graph is derived in Section IV. In Section V, we introduce
switch variables to improve the robustness of our algorithm.
2019 American Control Conference (ACC)
Philadelphia, PA, USA, July 10-12, 2019
978-1-5386-7926-5/$31.00 ©2019 AACC
3809
We propose a higher-layer estimation framework modeled
as an HMM in Section VI to further improve the robustness
of the formulation. Finally, the algorithm is summarized in
Section VII and simulation results are presented in Section
VIII.
II. P
ROBLEM
F
ORMULATION
Consider the traditional localization and mapping algo-
rithm, where the goal is to simultaneously estimate the
vehicle poses
X
,
{
x
t
}
T
t
=
0
and the position of a set of
landmarks in the environment denoted by
L
,
{
l
m
}
M
m
=
1
,
given a set of continuous measurements
Z
c
,
{
z
c
,
t
}
T
t
=
0
. The
landmarks are features in the environment that can easily be
recognized, and the continuous measurements are range or
bearing measurements to those landmarks. Note that
x
R
n
,
and
M
is the number of landmarks in the environment.
For this model, we assume we have a set of odometry
measurements given by
B
,
{
b
t
}
T
t
=
0
to approximate the
vehicle dynamics, where
b
t
gives the vehicle translation
and rotation between discrete-time points of the vehicle
trajectory. These odometry measurements can be given by
methods like the Iterative Closest Point (ICP) algorithm. This
estimation problem is typically formulated as the following
maximum-likelihood problem:
ˆ
X
,
ˆ
L
=
argmax
X
,
L
log
p
(
Z
c
|
X
,
L
)
.
(1)
In our problem, we consider a vision-based object classifi-
cation algorithm that can detect
K
different objects given by
O
,
{
o
k
}
K
k
=
1
. We assume that we have an a priori map that
defines the positions of each object
o
k
and the corresponding
region
R
k
where the object can be detected. We define the set
of semantic measurements as
Z
s
,
{
z
s
,
t
}
T
t
=
0
, where
z
s
,
t
R
K
.
The measurement corresponding to the object detector of
object
o
k
can be represented as a binary variable
z
k
s
,
t
∈{
0
,
1
}
,
where a measurement of 1 indicates that the object
o
k
has been detected and 0 indicates that it has not. Each
object detection measurement has a corresponding confusion
matrix
C
k
R
2
×
2
that captures the ratio of false positive
and negative detections. Let the variable
v
k
be an indicator
variable representing whether the object
o
k
is in the field
of view of the camera and can be detected. The elements
of the confusion matrix are defined as:
c
k
iv
k
=
p
(
z
k
s
,
t
=
i
|
v
k
)
.
We assume these error statistics can be computed offline. In
the case of a perfect classification algorithm, the confusion
matrix would be the identity.
The goal of this paper is to formulate an estimation
algorithm that improves the robustness and accuracy of
conventional localization and mapping algorithms by incor-
porating these semantic measurements.
III. M
AXIMUM
-L
IKELIHOOD
F
ORMULATION WITH
S
EMANTIC
D
ATA
The estimation problem with semantic measurements can
be formulated as a maximum-likelihood problem. The for-
mulation is given as follows:
ˆ
X
,
ˆ
L
=
argmax
X
,
L
log
p
(
Z
c
,
Z
s
|
X
,
L
)
(2)
Assuming the semantic measurements are independent
from the continuous range measurements, the maximization
problem can be rewritten as the following minimization
problem:
ˆ
X
,
ˆ
L
=
argmin
X
,
L
T
t
=
0
K
k
=
1
log
(
p
(
z
k
s
,
t
|
x
t
))
log
(
p
(
Z
c
|
X
,
L
))
log
(
p
(
B
|
X
))
,
(3)
where the first term in the formulation corresponds to the
likelihood function of the semantic measurements, and the
second and third terms represent the nonlinear least-squares
terms associated with the continuous measurements and
odometry measurements respectively. The likelihood func-
tion for the semantic measurements from object detector
k
can be derived from the object detector’s confusion matrix
C
k
as follows:
p
(
z
k
s
,
t
|
x
t
) =
a
=
{
0
,
1
}
b
=
{
0
,
1
}
c
1
(
z
k
s
,
t
=
a
)
1
(
v
k
t
=
b
)
a
,
b
.
(4)
This likelihood function selects each element of the confu-
sion matrix based on the semantic measurement
z
k
s
,
t
and
v
k
t
,
which indicates whether the object is in the field of view.
The indicator variable
v
k
t
is a function of the pose-estimate
x
t
and the rotation matrix associated with the camera.
Depending on whether the measurement
z
k
s
,
t
is a 1 or a 0,
the likelihood function derived in (14) takes on a different
shape. The negative log likelihood function of (4) shown
in Fig. 1 assumes the object is in the field of view when
inside the region
R
k
associated with object
o
k
. Under this
assumption, we can see how the likelihood function promotes
the region corresponding to the detected object when the
measurement
z
k
s
,
t
=
1 and demotes the region when
z
k
s
,
t
=
0.
For the likelihood estimation formulation, where we solve
for (3), we only include measurements where
z
k
s
,
t
=
1. The
positive detection events are the measurements that give the
most information, since they promote a very small region
when they occur. Further, a measurement where
z
k
s
,
t
=
0 does
not imply the vehicle is not in the region associated with the
object. Instead, the vehicle could simply not have the object
in its field of view, but still be in the region
R
k
.
Note that the likelihood function is a discrete, nonlinear
function that must be approximated by a smooth function
in order to be implemented in any factor-graph estimation
algorithms like gtsam [11], which relies on gradient-based
methods for solving the optimization problem. The details
of this approximation are given in the Appendix.
Although this model improves the robustness of the esti-
mation algorithm, the likelihood function does not take into
consideration higher-level details about the measurements
like their persistence over time. In the next section, we
therefore introduce an alternative formulation for including
object detection events as nonlinear factors that impose state
constraints similar to loop closure detections.
3810
IV. F
ACTOR
-G
RAPH
F
ORMULATION WITH
S
EMANTIC
D
ATA AS
S
TATE
C
ONSTRAINTS
In this section, we treat each object detection measurement
z
k
s
,
t
=
1 as a state constraint. We do not consider measure-
ments when
z
k
s
,
t
=
0 in our factor-graph formulation for the
same reasons we excluded them in the maximum-likelihood
formulation. Since the standard factor-graph formulation
does not allow for explicit state-constraints, we introduce
a relaxation, and use the following nonlinear least-squares
factor to represent the constraint imposed by a positive object
detection measurement:
f
(
z
k
s
,
t
,
x
t
) =
z
k
s
,
t
f
k
1
(
x
t
)
.
(5)
This semantic factor is defined to reflect the same properties
as the discrete likelihood function described in Section
III. The comparison between the factors derived for the
likelihood function and the factor derived here can be seen in
Fig. 1. The factor
f
1
(
x
t
)
is defined as the following piecewise
function:
f
k
1
(
x
t
) =
{
0
d
k
h
(
x
t
) =
0
α
exp
(
β
d
k
h
(
x
t
)
)
d
k
h
(
x
t
)
>
0
,
(6)
where
d
h
(
x
t
)
is the shortest distance from
x
t
to the boundary
of the region corresponding to object
k
given by
R
k
. Although
the factor representing the likelihood function and the cus-
tomized factor are similar, the customized factor improves
the estimation accuracy further because of properties of its
gradient.
-5
0
5
Radial Distance from Center of Object (m)
0
1
2
3
Factor when Measurement Z
s,t
k
= 1
Likelihood Factor
Customized Factor
-5
0
5
Radial Distance from Center of Object (m)
0
1
2
3
Factor when Measurement Z
s,t
k
= 0
Likelihood Factor
Fig. 1. The nonlinear least squares factors added to the graph corresponding
the semantic measurement
z
k
s
,
t
=
1 is in the top figure and
z
k
s
,
t
=
0 on the bot-
tom. The factor added corresponding to the negative log likelihood function
described in (14) and the customized factors formed by piecewise inverse
exponential functions correspond to the blue and red plots respectively.
In particular, the gradient of the function
f
k
1
(
x
t
)
is nonzero
even when the estimate is far from the object detection
region. Therefore, a positive detection event acts towards
improving the estimate even when the initial estimate is
bad. Further, the parameters
α
and
β
can be modified
to change the scale and rate of the inverse exponential
functions respectively. With these additional features, the
new formulation with semantic measurements becomes the
following minimization problem:
ˆ
X
,
ˆ
L
=
argmin
X
,
L
T
t
=
0
K
k
=
1
f
(
z
k
s
,
t
,
x
t
)
‖−
log
(
p
(
Z
c
|
X
,
L
))
log
(
p
(
B
|
X
))
.
(7)
Even with this formulation, false positive measurements will
cause the wrong state-constraint factors to be imposed and
will result in poor estimation results. In the next section we
introduce switchable constraints, taken from the loop-closure
literature, to account for the possibility of bad measurements.
V. R
OBUST
F
ACTOR
-G
RAPH
F
ORMULATION WITH
S
EMANTIC
D
ATA AND
S
WITCHABLE
C
ONSTRAINTS
In traditional SLAM algorithms switchable constraints are
introduced into the optimization formulation to improve the
algorithm’s performance when false positive data associa-
tions or loop-closure detections occur [7]. We propose the
following addition of switchable constraints to improve the
robustness of our algorithm to false semantic measurements:
ˆ
X
,
ˆ
L
,
ˆ
Γ
=
argmin
X
,
L
,
Γ
T
t
=
0
K
k
=
1
(
ψ
(
γ
k
t
)
f
(
z
k
s
,
t
,
x
t
)
Σ
+
γ
k
t
̄
γ
k
t
Λ
)
log
(
p
(
Z
c
|
X
,
L
))
log
(
p
(
B
|
X
))
,
(8)
where
Γ
,
{
γ
t
}
T
t
=
0
is the set of all switch variables, and
each
γ
t
R
K
, and
̄
Γ
,
{
̄
γ
t
}
T
t
=
0
is the set of priors on the
switch variables. The variables
Σ
and
Λ
are optimization
hyperparameters that determine the weight of the factors
corresponding to the state constraints and the switch variable
priors. The function
Ψ
:
R
7→
[
0
,
1
]
is a function which takes
a real value and maps it to the closed interval between 0
and 1. We choose
Ψ
(
γ
k
t
) =
γ
k
t
to be a linear function of
the switch variables and to constrain the switch variables
between 0
γ
k
t
1 since these choices have been empirically
shown to work well [7].
Each switch variable
γ
k
t
quantifies the certainty of its asso-
ciated semantic measurement
z
k
s
,
t
. When the switch variable
γ
k
t
is set to 0, the certainty in the measurement is extremely
low, and the influence of the state-constraint factor associated
with the measurement
z
k
s
,
t
gets disregarded. Probabilistically,
the switch variable is modifying the information matrix as-
sociated with the semantic factor such that
ˆ
Σ
1
=
Ψ
(
γ
k
t
)
2
Σ
1
[7]. This means the covariance of the semantic measurement
is unchanged from
Σ
when
Ψ
(
γ
k
t
) =
1 and the certainty
in the measurement is high, but scales with
Ψ
(
γ
k
t
)
2
when
Ψ
(
γ
k
t
)
<
1 meaning the uncertainty can grow very high.
For typical object classifiers, since the rate of false positive
measurements is relatively low, we can default to trusting
the measurements, so the switch priors
̄
γ
k
t
are set to 1. Both
the certainty of each object detection measurement and the
pose estimates are variables in this optimization problem
formulation.
Since the switch variables and the semantic detections are
single links in the factor graph formulation, they add a linear
cost when computing the sparse matrix solutions on each
iteration. However, they also increase the complexity of the
underlying space and can increase the number of iterations it
takes to converge to a solution, with the number of iterations
highly dependent on the specific data. Furthermore, although
the formulation in (8) is much more robust to semantic
measurement errors, setting the prior on all switch variables
3811
to 1 will sometimes cause the optimization to converge to
the wrong solution. If we can compute the certainty of
each object detection measurement by leveraging both the
error statistics of the object classifier algorithms and the
vehicle dynamics, we can construct a more accurate prior
on the switch variables. In the next section, we propose a
formulation where we use an HMM to compute the marginal
probabilities of object detection events. These marginal prob-
abilities can be used to set the prior on the switch variables
in the optimization framework.
VI. HMM F
ORMULATION
Higher-level properties of the estimation formulation, like
the persistence of semantic measurements over time and
the relative vehicle dynamics, can be used to improve the
certainty of object detection measurements. In this section,
we propose a discrete-state estimation framework in the
form of a Hidden Markov Model (HMM) that provides a
measure on the certainty of the semantic measurements that
occur during object detection events, thereby allowing us to
distinguish between false positive detection measurements.
We consider a discrete-state representation of the vehicle
trajectory in terms of object detection events. The vehicle
trajectory can be parsed into different object detection events
based on the persistence of semantic measurements in
Z
s
over time. The continuous-state estimation, which we refer
to as the lower-layer estimation framework, occurs on the
time-scale of
t
whereas the discrete-state estimation, which
we refer to as the higher-layer estimation framework, occurs
on the time-scale of
τ
. This is also shown more clearly in
Fig. 3. Once an object detection event has been detected,
we represent the detection event with a discrete state
s
τ
.
This discrete state
s
τ
has a time interval in the continuous-
state estimation time domain given by
t
τ
= [
t
τ
,
i
,
t
τ
,
f
]
and a set
of semantic measurements
Y
τ
,
{
z
s
,
t
}
T
f
=
t
τ
,
f
t
=
t
τ
,
i
that occur over
the time interval
t
τ
. The semantic measurements associated
with the object
o
k
during this time interval are defined as:
Y
k
τ
,
{
z
k
s
,
t
}
T
f
=
t
τ
,
f
t
=
t
τ
,
i
.
The vehicle trajectory can then be represented as a se-
quence of states
S
,
{
s
τ
}
Q
τ
=
1
, where
Q
is the number of
object detection events that occur over the trajectory and
s
τ
=
{
o
1
,
o
2
,...,
o
k
}
. The notation
s
τ
=
o
i
means that object
o
i
has been seen during the detection event
s
τ
.
Fig. 2.
This figure shows the relation between the discrete-state HMM
and the continuous-state estimation. The semantic measurements
Z
s
are
illustrated by the grid, where each row represents the measurements received
by an object classifier. The grey boxes represent
z
k
s
,
t
=
1 and the white boxes
represent
z
k
s
,
t
=
0.
The time-sequence of object detection events can be
modeled as an HMM since the memoryless-Markov property
holds, i.e.
p
(
s
τ
|
s
0:
τ
1
) =
p
(
s
τ
|
s
τ
1
)
. The HMM estimation
formulation and the algorithms for computing the switch
prior, which are used to improve the optimization formu-
lation in (8), are described in the following sections.
A. Parsing Trajectory into Object Detection Events
In this model we assume only one object detection event
occurs at a given time. The trajectory
X
can be parsed
into different object detection events based on the semantic
measurements
Z
s
. Each object classifier is associated with a
sequence of measurements given by
{
z
k
s
,
t
}
T
t
=
0
and a confusion
matrix
C
k
that describes the algorithm’s error statistics. In the
event that the object
o
k
is visible, the frequency of nonzero
measurements can be approximated by
C
k
11
=
p
(
z
k
s
,
t
=
1
|
v
k
t
=
1
)
. We therefore define an object detection event for the
object
o
k
as occurring when the proportion of nonzero
measurements over a minimum time interval exceeds the
threshold value set by
C
k
11
ε
. The value of
ε
is set to a value
that depends on the certainty of the statistics given by the
confusion matrix. The object detection event is terminated
when the proportion of nonzero measurements decreases to
less than the threshold value of
C
k
11
ε
.
B. Transition and Observation Matrices
HMMs are typically defined by a single transition matrix
and a single observation matrix. The hybrid nature of our
estimation formulation means continuous states have elapsed
between the discrete states representing object detection
events, and that a set of semantic measurements, denoted
by
Y
τ
, have elapsed during each object detection event.
Since we want to incorporate continuous-state pose estimates
and semantic measurements into our HMM formulation,
the transition and observation matrices are time-varying and
dependent on the lower-layer estimates and measurements.
In particular, the transition matrices are a function of the
pose-estimates between discrete states representing object
detection events, and the observation matrices are a function
of
Y
τ
, the semantic measurements that have elapsed during
the time interval
t
τ
corresponding to the discrete state
s
τ
.
Each element of the transition matrix between discrete states
s
τ
1
and
s
τ
is defined as follows:
A
(
s
τ
1
,
s
τ
)
i j
,
p
(
s
τ
=
o
j
|
s
τ
1
=
o
i
,
ˆ
x
t
τ
,
i
,
ˆ
x
t
τ
1
,
f
)
exp
(
1
2
d
i j
ˆ
d
τ
1
,
τ
)
,
(9)
where
d
i j
=
p
i
p
j
1
2
, and
p
i
and
p
j
denote the positions
of the center of mass (COM) of the objects
o
i
and
o
j
respectively. The distance
ˆ
d
τ
1
,
τ
,
ˆ
x
t
τ
,
i
ˆ
x
t
τ
1
,
f
1
2
is the
estimated distance traveled between object detection events.
The rows of the transition probability matrix are normalized
to sum to one. The transition probability is defined by the
error between the actual distance of two objects from each
other and the estimated distance traveled from one object
detection event to another. The definition of the transition
matrix would have to be modified to accommodate for the
objects whose regions are not centralized around the objects’
3812
COM, because the distance traveled between object detection
events could vary considerably. Examples of such objects
include sidewalk or road detectors. This will be considered
in future work.
Each element of the observation matrix for the discrete
state
s
τ
is defined as follows:
O
(
τ
)
i j
,
p
(
y
τ
=
o
i
,
Y
τ
|
s
τ
=
o
j
)
=
p
(
y
τ
=
o
i
|
Y
τ
)
p
(
Y
τ
|
s
τ
=
o
j
)
.
(10)
The probability
p
(
Y
τ
|
s
τ
=
o
j
)
is the likelihood of a sequence
of semantic measurements over the time interval
t
τ
given that
the object detection event corresponds to object
o
j
. When
conditioned on
s
τ
=
o
j
, each measurement in the sequence
Y
j
τ
can be modeled as a Bernoulli random variable with
the probability of a nonzero measurement given by
C
j
11
.
Thus, the probability
p
(
Y
τ
|
s
τ
=
o
j
)
can be approximated by
how well the sequence of measurements
Y
j
τ
fits a Bernoulli
distribution with parameter
p
=
C
j
11
. This probability can be
computed with a Chi-Squared goodness of fit test [12].
Since there is no discrete-state observation of the system,
we define
y
τ
to be a function of the sequence of measure-
ments
Y
τ
. The discrete-state observation is the object that
corresponds to the maximum likelihood for the sequence of
semantic observations, which means ̄
y
τ
=
argmax
o
k
p
(
Y
τ
|
s
τ
=
o
k
)
for
k
=
1
,..
K
. Thus, the probability of a discrete-time
measurement conditioned on the continuous-state observa-
tions becomes defined as follows:
p
(
y
τ
=
o
i
|
Y
τ
) =
{
1 if
o
i
=
̄
y
τ
0 if
o
i
6
=
̄
y
τ
.
(11)
Therefore, an observation matrix for each discrete-state
s
τ
can be derived for the HMM as a function of the sequence
of semantic measurements
Y
τ
.
C. Computing Marginal Probabilities
The Viterbi algorithm can be used to determine the most
probable sequence of states given a set of observations. We
use a modified version of the Viterbi formulation given as
follows:
ˆ
S
=
argmax
S
τ
,
f
τ
=
1
log
(
p
(
y
τ
,
Y
τ
|
s
τ
))+
log
(
p
(
s
τ
|
s
τ
1
,
ˆ
d
τ
1
,
τ
))
,
(12)
where
ˆ
d
τ
1
,
τ
is the distance traveled between object detection
events and depends on the vehicle state estimates. This
equation accounts for the dependencies of the transition and
observation matrices on the time-varying lower-layer esti-
mates and measurements. Once the most-probable sequence
of states for the HMM are derived from the Viterbi algorithm
in (12), the marginal probabilities
p
(
s
τ
=
o
i
|
Y
0:
τ
,
f
)
can be
computed with dynamic programming using a modified
version of the Forwards-Backwards algorithm. The variable
Y
0:
τ
,
f
denotes all the measurement sequences corresponding
to object detection events that have been observed. Details
of this computation can be found in the Appendix.
The Forwards-Backwards algorithm therefore defines a
certainty associated with the most-probable sequence of
object detection events. The switch prior associated with
the semantic measurements that occur over the time-intervals
corresponding to these object detection events are computed
in the following section.
D. Switch Prior Derivation
The switch prior
γ
k
t
associated with a semantic measure-
ment
z
k
s
,
t
quantifies the reliability of the measurement. When
an object detection event corresponding to object
o
k
occurs,
the marginal probability from the Forwards-Backwards al-
gorithm gives us
p
(
s
τ
=
o
k
|
Y
0:
τ
,
f
)
, which is a metric for the
certainty that object
o
k
for the time interval
t
τ
associated with
the detection event. This means that over the time interval
t
τ
, the measurements where
z
k
s
,
t
=
1 should be proportional
to the certainty of the detection event. Thus, we define the
switch priors for the time interval
t
τ
where
s
τ
=
o
k
for every
measurement for which
z
k
s
,
t
=
1 as follows:
γ
k
t
=
p
(
s
τ
=
o
k
|
Y
0:
τ
,
f
)
(13)
In the case where the certainty in the object detection event
s
τ
=
o
k
is high, the switch priors corresponding to
z
k
s
,
t
=
1
will be very close to 1.
The HMM formulation only gives a certainty metric for
positive object detection events. The switch prior must also
be derived for any semantic measurements where
z
k
s
,
t
=
1
and the measurement does not occur during a time interval
specified by an object detection event. These measurements
occur during a non-detection event which occurs over the
time interval in the continuous-state estimation time domain
given by
t
τ
/0
= [
t
τ
/0
,
i
,
t
τ
/0
,
f
]
, and has semantic measurements
given by
Y
k
τ
/0
=
{
z
k
s
,
t
}
T
f
=
τ
/0
,
f
t
=
τ
/0
,
i
. To compute the probability that
all measurements during the time interval correspond to a
non-detection event, which we define as
p
(
s
τ
/0
=
/0
|
Y
k
τ
/0
)
, we
can compute how well the sequence of measurements in
Y
k
τ
/0
fits to a Bernoulli distribution with parameter
C
k
10
. The
switch prior associated with the measurements outside the
object detection are set to have a certainty proportional to
1
p
(
s
τ
/0
=
/0
|
Y
k
τ
/0
)
. Thus, when the certainty that the non-
object detection event has occurred is high, the switch priors
corresponding to
z
k
s
,
t
=
1 will be very close to 0.
VII. S
YSTEM
A
RCHITECTURE
In this section, we summarize the final estimation architec-
ture. There are two estimation processes that are occurring on
different time-scales: the continuous-state estimation process
with switchable constraints and the discrete-state estimation
of object detection events. Each process is iteratively improv-
ing the other, and the dependencies of the two processes can
be seen in Fig. 3. The continuous-state estimation framework
with switchable constraints is given by a modified version
of the maximum-likelihood formulation. The continuous-
state optimization problem is given by (8). The discrete-state
estimation framework operates on the time scale of object
detection events given by
τ
. Once an object-detection event
has been classified, as described in Section VI-A, the pose-
estimates and semantic measurements from (8) are used to
derive the transition and observation matrices of the HMM.
3813
This HMM is used to model the discrete-state representation
of the trajectory. The most probable sequence of object detec-
tion events, represented by the set of discrete states
S
, is then
solved using the Viterbi algorithm. The Forwards-Backwards
algorithm is then used to compute the marginal probabilities
associated with the maximum-likelihood sequence of object
detection events.
Fig. 3.
The system architecture is comprised of two layers: the lower
layer shown in the top box represents the factor-graph formulation with
switchable constraints and updates at every time step
t
, whereas the higher
layer shown in the bottom box represents the HMM estimation framework,
and computes switch prior variables after every object detection event
τ
.
The switch variables are denoted by
Γ
and the switch prior is given by
Γ
0
.
These marginal probabilities are used to compute priors on
the switch variables associated with the semantic measure-
ments in (8). Therefore, the higher and lower-layer estimation
processes are simultaneously improving the pose estimate
and the certainty of object detection events.
VIII. S
IMULATION
R
ESULTS
We investigate the performance of our algorithms in sim-
ulation. We consider an object classifier that can detect three
different objects, each of which is associated with a known
radial region in a 2-D map that is shown in Fig. 4, and each of
which has a confusion matrix whose parameters are defined
in Table I.
TABLE I
C
ONFUSION
M
ATRIX
P
ARAMETERS
o
1
o
2
o
3
p
(
z
k
s
,
t
=
0
|
v
k
=
1
)
0.02
0.03
0.05
p
(
z
k
s
,
t
=
1
|
v
k
=
0
)
0.2
0.15
0.1
There are four landmarks that provide the vehicle with
range measurements when detected. The data association
problem for the landmarks are assumed to be solved in this
formulation. The vehicle traverses an ellipsoid trajectory and
goes through each of the object detection regions during
its path. We introduce noisy odometry measurements in
our simulation. The range measurements to the landmarks
mitigate the estimation error when included in the traditional
SLAM algorithms. We investigate how the introduction of
semantic measurements improves the estimation even further.
In our simulations, we run three different algorithms to
estimate the vehicle trajectory. First, we run the algorithm
with range measurements to the four different landmarks,
which is what is conventionally available in the SLAM com-
munity. Second, we use the maximum-likelihood formulation
with the smoothed likelihood function described in Section
III. Finally, we test the two-layer estimation framework
with the switch variables and the priors from the HMM. In
our simulations, the estimation frameworks are implemented
using gtsam, a factor-graph formulation commonly used for
solving pose-graph estimation problems [13], [11].
For the different formulations, the noise model must be
chosen for the factors corresponding to the state constraints
imposed by the semantic measurements. In the algorithm
involving the likelihood function, we use the identity matrix
to define the covariance on the likelihood factor so that
the Bayesian representation of the likelihood function is
preserved.
o
o
o
1
2
3
Fig. 4.
The objects and their corresponding regions of detections
R
k
are
shown in blue and the positions of the landmarks that are visible during
the vehicle trajectory are shown in orange. The gray line represents the true
vehicle path.
For the formulation with switch variables, the noise on
the prior for the switch variables, which is given by
Λ
in
(8) is chosen to be 0
.
01 when the switch prior values are
chosen by the HMM and 10 when the switch prior value is
set to the default value of 1. This choice reflects our increase
in certainty on the switch prior when we use the HMM
formulation. The noise on the inverse-exponential factors,
which is represented by
Σ
in (8) is chosen to be 0
.
5.
-20
-10
0
10
20
x-position(m)
-15
-10
-5
0
5
10
15
y-position(m)
truth
landmarks only
likelihood
switch with HMM
Fig. 5. The true trajectory can be compared to the estimated trajectory using
three different algorithms. The algorithms that incorporate the semantic
measurements improve the estimate significantly.
The estimated trajectory resulting from the different esti-
mation algorithms are shown in Fig. 5. The squared error
between the estimated trajectories and the true trajectory is
shown in Fig. 6. We can see that the two algorithms that in-
corporate semantic measurements outperform the traditional
SLAM algorithm. We also see that the estimation framework
with the smooth approximation of the likelihood function
and the estimation framework with the switch variables and
HMM-derived switch priors converge to very similar local
minima.
3814
Since different noise on the odometry measurements will
contribute to different estimation results, we compare our es-
timation algorithms on a set of randomly generated odometry
measurements. This way, we can evaluate the performance
of the different algorithms over many different trials.
0
100
200
300
400
Iteration
0
2
4
6
8
10
Squared Error
landmarks only
likelihood
switch with HMM
Fig. 6.
The squared error between the true trajectory and the estimated
trajectories from the three different estimation estimation algorithms.
The comparison of the different estimation algorithms
is captured in Fig. 7 and Fig. 8. We see how the mean
of the squared error over the entire trajectory for all the
trials is notably smaller when the semantic measurements are
included, using either the likelihood algorithm or the two-
layer algorithm.
0
10
20
30
40
Average Squared Error Over Trajectory
0
20
40
60
80
100
Count
landmarks only
likelihood
switch with HMM
Fig. 7. The mean squared error over the trajectory path is computed for each
of the different algorithms and the results from the trials can be compared
in this histogram. The likelihood estimation algorithm and the two-layer
estimation algorithm perform similarly, and both perform significantly better
than the algorithm without semantic measurements.
The estimation framework with the likelihood function
has more trials with very low average-squared errors but
is less consistent than the two-layer estimation framework,
since its distribution has higher variance. If we look at
the squared error of the final position estimate, which is
the value we are iteratively estimating, Fig. 8 shows that
the HMM algorithm performs marginally better than the
likelihood algorithm. Note, the HMM algorithm would have
a more recognizable benefit in the cases where there are more
false positives (a worse object classifier), or when negative
detection measurements were included.
One of the advantages of using the HMM formulation
is to guarantee a higher certainty for the measurements
corresponding to object detection events by leveraging the
persistence of the semantic measurements as well as vehicle
dynamics. In this simulation, the sequence of events cor-
responding to the maximum likelihood given by the Viterbi
algorithm was given as follows:
s
0
=
o
2
,
s
1
=
o
3
, and
s
2
=
o
1
,
meaning object 2 as detected first, followed by object 3
and then object 1. The corresponding marginal probabilities
of these object detection events are
p
(
s
0
=
o
2
|
Y
0
) =
0
.
943,
p
(
s
1
=
o
3
|
Y
1
) =
0
.
994, and
p
(
s
2
=
o
1
|
Y
2
) =
0
.
997. This
shows certainty levels of object detection events that are
much greater than the accuracy guaranteed by the
c
k
11
element
in the confusion matrix for each object detector which were
approximately 0
.
8 for each of the object classifiers in our
simulation.
0
10
20
30
40
Squared Error of Final Position Estimate
0
10
20
30
40
Count
landmarks only
likelihood
switch with HMM
Fig. 8.
For every trial corresponding to different odometry noise mea-
surements, we use the three algorithms to estimate the trajectory. The mean
squared error of the final position estimate is computed. The final position
estimate is marginally better using the two-layer estimation framework.
The effectiveness of semantic measurements will depend
on the frequency at which these objects are detected, but
when objects have been detected, the estimation architectures
proposed in this paper provide a robust way to incorporate
these semantic measurements.
IX. C
ONCLUSION AND
F
UTURE
W
ORK
In this work we introduced a robust estimation frame-
work for incorporating probabilistic binary measurements,
which are used to model data from vision-based object
classification algorithms. We first introduce a formulation
for solving a maximum-likelihood problem with the semantic
measurement likelihood function modeled after the confusion
matrix of object classifiers. We then derived a two-part
estimation framework where the lower-layer is formulated as
a factor-graph estimation problem, with each measurement
corresponding to a state-constraining factor modeled after the
discrete likelihood function and a switchable constraint. We
also presented a higher-layer estimation framework that takes
into account measurement persistence and vehicle dynamics
to compute the certainty of sets of semantic measurements
corresponding to object detection events. These certainties
capture which measurements are false positives, and are used
to compute the switch priors in the lower-layer estimation
algorithm. The advantage of including the higher-layer esti-
mation framework is demonstrated in the presented numer-
ical simulation. We showed in simulation how the addition
of semantic measurements in this framework improves the
robustness and accuracy of the estimated trajectory.
In future work we will model the probabilistic likelihood
factors such that they account for a lower detection proba-
bility with increasing distance from the detected object. We
will investigate extensions to include 3-D object data that
incorporates height information, and test such algorithms on
aerial vehicles. Finally, we also hope to incorporate object
3815