From Google Maps to a Fine-Grained Catalog of Street trees
Steve Branson
a,1
, Jan Dirk Wegner
b,1
, David Hall
a
, Nico Lang
b
, Konrad Schindler
b
,
Pietro Perona
a
a
Computational Vision Laboratory, California Institute of Technology, USA
b
Photogrammetry and Remote Sensing, ETH Z ̈urich, Switzerland
Abstract
Up-to-date catalogs of the urban tree population are of importance for municipalities to
monitor and improve quality of life in cities. Despite much research on automation of tree
mapping, mainly relying on on dedicated airborne LiDAR or hyperspectral campaigns, tree
detection and species recognition is still mostly done manually in practice. We present a
fully automated tree detection and species recognition pipeline that can process thousands
of trees within a few hours using publicly available aerial and street view images of Google
Maps
TM
. These data provide rich information from different viewpoints and at different
scales from global tree shapes to bark textures. Our work-flow is built around a supervised
classification that automatically learns the most discriminative features from thousands of
trees and corresponding, publicly available tree inventory data. In addition, we introduce a
change tracker that recognizes changes of individual trees at city-scale, which is essential to
keep an urban tree inventory up-to-date. The system takes street-level images of the same
tree location at two different times and classifies the type of change (e.g., tree has been
removed). Drawing on recent advances in computer vision and machine learning, we apply
convolutional neural networks (CNN) for all classification tasks. We propose the following
pipeline: download all available panoramas and overhead images of an area of interest,
detect trees per image and combine multi-view detections in a probabilistic framework,
adding prior knowledge; recognize fine-grained species of detected trees. In a later, separate
module, track trees over time, detect significant changes and classify the type of change.
We believe this is the first work to exploit publicly available image data for city-scale street
tree detection, species recognition and change tracking, exhaustively over several square
kilometers, respectively many thousands of trees. Experiments in the city of Pasadena,
California, USA show that we can detect
>
70% of the street trees, assign correct species
to
>
80% for 40 different species, and correctly detect and classify changes in
>
90% of the
cases.
Keywords:
deep learning, image interpretation, urban areas, street trees, very high
resolution
1
joint first authorship
Preprint submitted to ISPRS Journal of Photogrammetry and Remote Sensing
October 8, 2019
arXiv:1910.02675v1 [cs.CV] 7 Oct 2019
1. Introduction
Urban forests in the USA alone contain around 3.8 billion trees (Nowak et al., 2002).
A relatively small but prominent element of the urban forest are street trees. Street trees
grow along public streets and are managed by cities and counties. The most recent estimate
is that there are 9.1 million trees lining the streets of California, about one street tree for
every 3.4 people
2
living in an urban area, with an estimated replacement value of $2.5 billion
(McPherson et al., 2016). However, the greatest value of a street tree is not its replacement
value but its ecosystem services value, i.e., all economic benefits that a tree provides for a
community. These benefits include: a reduction in energy use, improvement in air and water
quality, increased carbon capture and storage, increased property values and an improvement
in individual and community wellbeing (Nowak et al., 2002; McPherson et al., 2016)
3
. Still,
inventories are often lacking or outdated, due to the cost of surveying and monitoring the
trees.
We propose an automated, image-based system to build up-to-date tree inventories at
large scale, using publicly available aerial images and panoramas at street-level. The sys-
tem automatically detects trees from multiple views and recognizes their species. It draws
on recent advances in machine learning and computer vision, in particular deep learning
for object recognition (Krizhevsky et al., 2012; Szegedy et al., 2015; Simonyan and Zisser-
man, 2015; Girshick, 2015; Ren et al., 2015), fine-grained object categorization (Wah et al.,
2011; Angelova and Zhu, 2013; Branson et al., 2013; Deng et al., 2013; Duan et al., 2013;
Krause et al., 2014; Zhang et al., 2014b), and analysis of publicly available imagery at large
scale (Hays and Efros, 2008; Agarwal et al., 2009; Anguelov et al., 2010; Majdik et al.,
2013; Russakovsky et al., 2015). The method is build around a supervised classification
that uses deep convolutional neural networks (CNN) to learn tree identification and speies
classification from existing inventories.
Our method is motivated by TreeMapLA
4
, which aims to build a publicly available
tree inventory for the greater Los Angeles area. Its goal is to collect and combine already
existing tree inventories acquired by professional arborists. In case no reasonably up-to-
date data is available, which is often the case, a smartphone app is used to task users in a
crowd-sourcing effort to fill in data gaps. Unfortunately only few people (so-called citizen
scientists) participate. Only a small number of trees, e.g.,
≈
1000 out of more than 80,000
in Pasadena, have been mapped within the last 3 years. And often entries are incomplete
(e.g., missing species, trunk diameter) or inaccurate (e.g., GPS position grossly wrong). It
turns out that determining a tree’s species is often the hardest and most discouraging part
for citizen scientists. The average person does not know many species of tree, and even with
2
A rough estimate for Europe is given in (Pauleit et al., 2005). The number of people per street tree
strongly varies across European cities between 10 to 48 inhabitants per street tree. However, it is unclear
(and in fact unlikely) if US and European census numbers rely on the same definitions.
3
The most recent estimate of the ecosystem services value of the street trees in California is $1 billion
per year or $111 per tree, respectively $29 per inhabitant.
4
https://www.opentreemap.org/latreemap/map/
2
tree identification tools, the prospect of choosing one option among tens or even hundreds
is daunting.
We propose to automate tree detection and species recognition with the help of publicly
available street-level panoramas and very high-resolution aerial images. The hope is that
such a system, which comes a
virtually no cost and enables immediate inventory generation
from scratch, will allow more cities to gain access to up-to-date tree inventories. This will
help to ascertain the diversity of the urban forest by identifying tree species determinants
of urban forest management (e.g., if a pest arrives, an entire street could potentially lose its
trees). Another benefit to a homogeneous inventory across large urban areas would be to
fill in the gaps between neighboring municipalities and different agencies, allowing for more
holistic, larger-scale urban forest planning and management. Each citys Tree Master Plan
would no longer exist in a vacuum, but account for the fact that the urban forest, in larger
metropolitan areas, spreads out across multiple cities and agencies.
Our system works as follows: It first downloads all available aerial images and street
view panoramas of a specified region from a repository, in our example implementation
Google Maps. A tree detector that distinguishes trees from all other scene parts and a tree
species classifier are separately trained on areas where ground truth is available. Often, a
limited, but reasonably recent tree inventory does exist nearby or can be generated, which
has similar scene layout and the same tree species. The trained detector predicts new trees
in all available images, and the detector predictions are projected from image space to
true geographic positions, where all individual detections are fused. We use a probabilistic
conditional random field (CRF) formulation to combine all detector scores and add further
(learned) priors to make results more robust against false detections. Finally, we recognize
species for all detected trees. Moreover, we introduce a change classifier that compares
images of individual trees acquired at two different points in time. This allows for automated
updating of tree inventories.
2. Related work
There has been steady flow of research into automated tree mapping over the last
decades. A multitude of works exist and a full review is beyond the scope of this paper
(e.g., see (Larsen et al., 2011; Kaartinen et al., 2012) for a detailed comparison of methods).
Tree delineation in forests
is usually accomplished with airborne LiDAR data (Re-
itberger et al., 2009; L ̈ahivaara et al., 2014; Zhang et al., 2014a) or a combination of Li-
DAR point clouds and aerial imagery (Qin et al., 2014; Paris and Bruzzone, 2015). LiDAR
point clouds have the advantage of directly delivering height information, which is benefi-
cial to tell apart single tree crowns in dense forests. On the downside, the acquisition of
dense LiDAR point clouds requires dedicated, expensive flight campaigns. Alternatively,
height information can be obtained through multi-view matching of high-resolution aerial
images (Hirschmugl et al., 2007) but is usually less accurate than LiDAR due to matching
artefacts over forest.
Only few studies attempt segmentation of individual trees from a single aerial image. La-
farge et al. (2010) propose marked point processes (MPP) that fit circles to individual trees.
3
This works quite well in planned plantations and forest stands with reasonably well-separated
trees. However, MMPs are notoriously brittle and difficult to tune with inference methods
like simulated annealing or reversible jump Markov Chain Monte Carlo, which are compu-
tationally expensive. Simpler approaches rely on template matching, hierarchies of heuristic
rules, or scale-space analysis (see Larsen et al. (2011) for a comparison).
Tree detection in cities
has gained attention since the early 2000s. Early methods
for single tree delineation in cities were inspired by scale-space theory (initially also devel-
opped for forests by Brandtberg and Walter (1998)). A common strategy is to first segment
data into homogeneous regions, respectively 3D clusters in point clouds, and then classify
regions/clusters into tree or background, possibly followed by a refinement of the boundaries
with predefined tree shape priors or active contours. For example, Straub (2003) segments
aerial images and height models into consistent regions at multiple scales, then performs
refinement with active contours. Recent work in urban environments (Lafarge and Mallet,
2012) creates 3D city models from dense aerial LiDAR point clouds, and reconstructs not
only trees but also buildings and the ground surface. After an initial semantic segmentation
with a breakline-preserving MRF, 3D templates consisting of a cylindrical trunk and an
ellipsoidal crown are fitted to the data points. Similarly, tree trunks have been modeled as
cylinders also at smaller scales but higher resolution, using LiDAR point clouds acquired ei-
ther from UAVs (Jaakkola et al., 2010) or from terrestrial mobile mapping vehicles (Monnier
et al., 2012).
We are aware of only one recent approach for urban tree detection that, like our method,
needs neither need height information nor an infra-red channel. Yang et al. (2009) first
roughly classify aerial RGB images with a CRF into tree candidate regions and background.
Second, single tree templates are matched to candidate regions and, third, a hierarchical
rule set greedily selects best matches while minimizing overlap of adjacent templates. This
detection approach (tree species recognition is not addressed) is demonstrated on a limited
data set and it remains unclear whether it will scale to entire cities with strongly varying
tree shapes.
Tree species classification
from remote sensing data either uses multi-spectral aerial
(Leckie et al., 2005; Waser et al., 2011) or satellite images (Pu and Landry, 2012), hyper-
spectral data (Clark et al., 2005; Roth et al., 2015), dense (full-waveform) LiDAR point
clouds (Brandtberg, 2007; Yao et al., 2012), or a combination of LiDAR and multispectral
images (Heikkinen et al., 2011; Korpela et al., 2011; Heinzel and Koch, 2012). Methods
that rely on full-waveform LiDAR data exploit species-specific waveforms due to specific
penetration into the canopy, and thus different laser reflectance patterns, of different tree
species; whereas hyperspectral data delivers species-specific spectral patterns. Most works
follow the standard classification pipeline: extract a small set of texture and shape features
from images and/or LiDAR data, and train a classifier (e.g., Linear Discriminant Analysis,
Support Vector Machines) to distinguish between a small number of species (3 in (Leckie
et al., 2005; Heikkinen et al., 2011; Korpela et al., 2011), 4 in (Heinzel and Koch, 2012), 7
in (Waser et al., 2011; Pu and Landry, 2012)).
Most state-of-the-art remote sensing pipelines (except (Yang et al., 2009)) have in com-
mon that they need dedicated, expensive LiDAR, hyperspectral, or RGB-NIR imaging cam-
4
paigns. They exploit the physical properties of these sensors like species-specific spectral
signatures, height distributions, or LiDAR waveforms. As a consequence, methods are hard
to generalize beyond a specific sensor configuration, data sets tend to be limited in size, and
temporal coverage is sparse. Tree detection and species ground truth has to be annotated
anew for each test site to train the classifier. It thus remains unclear if such methods can
scale beyond small test sites to have practical impact at large scale (of course, sampling de-
signs can be used, but these deliver only statistical information, not individual tree locations
and types).
An alternative to remote sensing is in-situ interpretaton, or gathering of tree leafs, that
are then matched to a reference database (Du et al., 2007; Kumar et al., 2012; Mouine
et al., 2013; Go ̈eau et al., 2013, 2014). Anyone can recognize particular plant species with
smart-phone apps like
Pl@ntNet
(Go ̈eau et al., 2013, 2014) and
Leafsnap
(Kumar et al.,
2012) that are primarily meant to educate users about plants. Experience with a similar
app to populate the web-based tree catalog opentreemap have shown that it is difficult to
collect a homogeneous and complete inventory with in situ measurements. Each tree must
be visited by at least one person, which is very time consuming and expensive, and often
the resulting data is incomplete (e.g., missing species), and also inaccurate when amateurs
are employed.
In this paper we advocate to solely rely on standard RGB imagery that is publicly
available via online map services at world-wide scale (or from mapping agencies in some
countries). We compensate the lack of pixel-wise spectral or height signatures with the power
of deep CNNs that can learn discriminative texture patterns directly from large amounts
of training data. We use publicly available tree inventories as ground truth to train our
models. Our method presented in this paper extends the approach originally presented
in Wegner et al. (2016), by integrating the detection and species recognition components
into a single system. We further add a change tracking module, which utilizes a Siamese
CNN architecture to compare individual tree states at two different times at city-scale. This
makes it possible to recognize newly planted trees as well as removed trees, so as to update
existing tree inventories. Additionally, we discuss results in depth and present a detailed
analysis of failure cases. All steps of the pipeline are evaluated in detail on a large-scale
dataset, and potential failure cases are thoroughly discussed.
3. Detection of trees
We detect trees in all available aerial and panorama images to automatically generate a
catalog of geographic locations, with corresponding species annotations. In this section we
describe the detection component of the processing pipeline and our adaptations to include
multiple views per object and map data
5
.
We use the open source classification library Caffe
6
(Jia et al., 2014). Several of the
most recent and best-performing CNN models build on Caffe, including Faster R-CNN (Ren
5
We refer the reader to Appendix A for details on projections between images and geographic coordinates.
6
http://caffe.berkeleyvision.org/
5
Multi View Proposal Scores
Single View Detections / Proposals
Input Region
Aerial
Aerial
Streetview
15
Streetview
15
Streetview
4
Streetview
4
Combined Detections
Figure 1:
Multi View Detection:
We begin with an input region (left image), where red dots show
available street view locations. Per view detectors are run in each image (top middle), and converted to a
common geographic coordinate system. The combined proposals are converted back into each view (bottom
middle), such that we can compute detection scores with known alignment between each view. Multi-view
scores are combined with semantic map data and spatial reasoning to generate combined detctions (right).
et al., 2015), which forms the basis of our tree detector. Faster R-CNN computes a set
of region proposals and detection scores
R
=
{
(
b
j
,s
j
)
}
|
R
|
j
=1
per test image
X
, where each
b
j
= (
x
j
,y
j
,w
j
,h
j
) is a bounding box and
s
j
is the corresponding detection score over
features extracted from that bounding box. A CNN is trained to both generate region
proposals and computing detection scores for them. The method is a faster extension of
the earlier R-CNN (Girshick et al., 2014) and Fast R-CNN (Girshick, 2015) methods. Note
that we start training the tree detector with a Faster R-CNN version pre-trained on Pascal
VOC (Everingham et al., 2010).
A minor complication to using conventional object detection methods is that our target
outputs and training annotations are geographic coordinates (latitude/longitude)–they are
points rather than bounding boxes. A simple solution is to interpret boxes as regions of
interest for feature extraction rather than as physical bounding boxes around an object. At
train time we can convert geographic coordinates to pixel coordinates using the appropriate
projection function
P
v
(
`,c
) and create boxes with size inversely proportional to the distance
of the object to the camera. At test time, we can convert the pixel location of the center of
a bounding box back to geographic coordinates using
P
−
1
v
(
`
′
,c
). Doing so makes it possible
to train single-image detectors.
3.1. Multi-view detection
Establishing sparse (let alone dense) point-to-point correspondences between aerial over-
head imagery and panoramas taken from a street view perspective is a complicated wide-
baseline matching problem. Although promising approaches for ground-to-aerial image
matching (for buildings) have recetly emerged (e.g., (Shan et al., 2014; Lin et al., 2015)),
finding correspondences for trees is an unsolved problem. We circumnavigate exact, point-
wise correspondence search by combining tree detector outputs.
For a given test image
X
, the algorithm produces a set of region proposals and de-
tection scores
R
=
{
(
b
j
,s
j
)
}
|
R
|
j
=1
, where each
b
j
= (
x
j
,y
j
,w
j
,h
j
) is a bounding box and
6
s
j
= CNN(
X,b
j
;
γ
) is a corresponding detection score over CNN features extracted from
image
X
at location
b
j
. The region proposals can be understood as a short list of bounding
boxes that might contain valid detections. In standard practice, a second stage is used where
detection scores are thresholded and non-maximal suppression is used to remove overlap-
ping detections. Naively performing this full detection pipeline can be problematic when
combining multiple views (i.e., aerial and street views). Bounding box locations in one view
are not directly comparable to another, and problems occur when an object is detected in
one view but not the other. We do the following (Fig. 1):
1. Run the Faster R-CNN detector with a liberal detection threshold to compute an
over-complete set of region proposals
R
v
per view
v
.
2. Project detections of all views to geographic coordinates
R
=
{P
−
1
v
(
`
vj
,c
v
)
}
|
R
v
|
j
=1
, with
`
vj
the pixel location of the
j
-th region center.
3. Collect all detections in geographic coordinates in the multi-view region proposal set
R
by taking the union of all individual view proposals
R
v
.
4. Project each detection region
`
k
back into each view
v
(with image coordinates
P
v
(
`
k
,c
))
5. Evaluate all detection scores in each view
v
of the combined multi-view proposal set
R
.
6. Compute a combined detection score over all views, apply a detection threshold
τ
2
and
suppress overlapping regions to obtain detections in geographic coordinates.
This work-flow is robust to initially missing detections from single views because it
collects all individual detections in geographic space and projects these back to all views,
i.e. scores are evaluated also in views where nothing was detected in the first place.
3.2. Probabilistic model
It seems inadequate to simply sum over detection scores per tree, because some views
may provide more reliable information than others. In this section, we describe how we
combine and weigh detections. The proposed probabilistic framework that is capable of also
including prior information.
Conditional Random Fields (CRF) provide a discriminative, probabilistic framework
to meaningfully integrate different sources of information. They allow for construction of
expressive prior terms (e.g., (Kohli et al., 2009; Kr ̈ahenb ̈uhl and Koltun, 2011; Wegner
et al., 2015)) to model long-range contextual object-knowledge that cannot be learned from
local pixel neighborhoods alone. Here, we formulate a CRF to combine tree detections from
multiple views and to add further knowledge about typical distances from roads and spacing
of adjacent trees. More formally, we aim at choosing the optimal set of tree objects
T
, based
on evidence from aerial view imagery, street view imagery, semantic map data (e.g., the
location of roads), and spatial context of neighboring objects (each
t
i
∈
T
represents a
metric object location in geographic coordinates):
log
p
(
T
) =
∑
t
∈
T
(
k
1
Ψ(
t,
av(
t
);
γ
)
︸
︷︷
︸
aerial view image
+
k
2
∑
s
∈
sv(
t
)
(
Φ(
t,s
;
δ
)
︸
︷︷
︸
street view images
)+
k
3
Λ(
t,T
;
α
)
︸
︷︷
︸
spatial context
+
k
4
Ω(
t,
mv(
t
);
β
)
︸
︷︷
︸
map image
)
−
Z
(1)
7
where lat(
t
), lng(
t
) are shorthand for latitude and longitude of
t
; av(
t
), mv(
t
) are aerial
and map view image IDs where tree
t
is visible, and sv(
t
) is the street view image set ID that
contains tree
t
(with associated meta-data for the camera position). Potential functions Ψ(
·
)
and Φ(
·
) represent detection scores from aerial and street view images whereas Λ(
·
) and Ω(
·
)
encode prior knowledge. Parameters
α
,
β
,
δ
,
γ
of the individual potentials are learned from
training data whereas scalars
k
1
,
k
2
,
k
3
,
k
4
that weight each potential term separately are
trained on validation data (more details in 3.3).
Z
is a normalization constant that turns
overall scores into probabilities.
The aerial view potential Ψ(
t,
av(
t
);
γ
) is the detection score evaluated at aerial image
X
(av(
t
)):
Ψ(
t,
av(
t
);
γ
) = CNN (X(av(
t
))
,
P
av
(
t
);
γ
)
(2)
where
P
av
(
t
) transforms between pixel location in the image and geographic coordinates.
γ
represents all weights of the aerial detection CNN learned from training data.
The street view potential Φ(
t,s
;
δ
) for a street view image
X
(
s
) is
Φ(
t,s
;
δ
) = CNN (X(
s
)
,
P
sv
(
t,
c(
s
));
δ
)
(3)
where
P
sv
(
t,c
) (defined in Eq. A.5) projects a pixel location in the image to geographic
coordinates. We empirically found that simply taking the closest street view image per tree
worked best to evaluate detections
7
.
δ
encodes all weights of the street view detection CNN
learned from training data.
The spatial context potential Λ(
t,T
;
α
) encodes prior knowledge about the spacing of
adjacent trees along the road. Two different trees can hardly grow at exactly the same
location and even extremely closely located trees are rather unlikely, because one would
obstruct the other from the sunlight. The large majority of trees along roads have been
artificially planted by city administration, which results in relatively regular tree intervals
parallel to the road. We formulate this prior on spacing between trees as an additional
potential, where the distribution of distances between neighboring objects is learned from
the training data set:
Λ(
t,T
;
α
) =
α
·
Q
s
(
d
s
(
t,T
))
(4)
where
d
s
(
t,T
) = min
t
′
∈
T
‖
t
−
t
′
‖
2
is the distance to the closest neighboring tree.
Q
s
(
d
s
(
t,T
))
is a quantized version of
d
s
, i.e. a vector in which each element is 1 if
d
s
lies within a given
distance range and 0 otherwise. We then learn a vector of weights
α
(Fig.2(top)), where
each element
α
i
can be interpreted as the likelihood that the closest object is within the
appropriate distance range.
A second contextual prior Ω(
t,
mv(
t
);
β
) based on map data models the common knowl-
edge that trees rarely grow in the middle of a road but are usually planted alongside at a
fixed distance. We compute distances for each pixel to the closest road based on downloaded
7
Straight-forward combination of detections of the same tree in multiple street view images performed
worse probably due to occlusions. A more elaborate approach would be to weight detections inversely to
the distance between acquisition position and tree candidate, for example. We leave this for future work.
8
maps
8
. We formulate the spatial prior potential as
Ω(
t,
mv(
t
);
β
) =
β
·
Q
m
(
d
m
(
t
))
(5)
where
d
m
(
t
) is the distance in meters between a tree
t
and the closest road and, similar to
the spatial context term Λ(
t,T
;
α
), function
Q
m
() quantizes this distance into a histogram.
Weight vector
β
i
is learned from the training data set and entries can be viewed as likelihoods
of tree-to-road distances (Fig.2(center)).
3.3. Training and inference of the full model
Inspired by (Shotton et al., 2006) we use piecewise training to learn CRF parameters
α
∗
,β
∗
,δ
∗
,γ
∗
= arg max
α,β,δ,γ
log(
p
(
T
)) that maximize Eq. 1 (with
T
the set of objects in
our training set). The Pasadena data set is subdivided into train, validation, and test
set. We first learn parameters for each potential term separately, optimizing conditional
probabilities:
δ
∗
= arg max
∑
t
∈D
t
log
p
(
t
|
av(
t
))
log
p
(
t
|
av(
t
)) = Ψ(
t,
av(
t
);
γ
)
−
Z
3
(6)
γ
∗
= arg max
∑
t
∈D
t
∑
s
∈
sv(
t
)
log
p
(
t
|
s
)
log
p
(
t
|
s
) = Φ(
t,s
;
δ
)
−
Z
4
(7)
α
∗
= arg max
∑
t
∈D
t
log
p
(
t
|
T
)
log
p
(
t
|
T
) =Λ(
t,T
;
α
)
−
Z
1
(8)
β
∗
= arg max
∑
t
∈D
t
log
p
(
t
|
mv(
t
))
log
p
(
t
|
mv(
t
)) = Ω(
t,
mv(
t
);
β
)
−
Z
2
(9)
where normalization terms
Z
1
...
4
are computed for each training example individually to
make probabilities sum to 1. Note that the first two terms (Eq. 6 & 7) match the learning
problems used in Faster R-CNN training (which optimizes a log-logistic loss), and the last
two terms are simple logistic regression problems (Eq. 8 & 9).
8
Roads in Google maps are white. Morphological opening removes other small symbols with grey-scale
value 255, and morphological closing removes text written on roads.
9
0-0.25m
0.25-0.5m
0.5-1m
1-2m
2-4m
4-8m
8-16m
16-32m
32-64m
>64m
Likelihood Weights
Distance to closest tree
0-0.01m
0.01-0.25m
0.25-0.5m
0.5-1m
1-2m
2-4m
4-8m
8-16m
16-32m
32-64m
>64m
Likelihood Weights
Distance to closest road
Spatial
Map
Aerial
Street View
Weight
CRF Potential Weights
Figure 2: Visualization of learned spatial context parameters (top), map potential parameters (center),
and (bottom) scalars
k
1
(Aerial),
k
2
(Street View),
k
3
(Spatial),
k
4
(Map) for combining detection CRF
potentials (Eq. 1).
Next, we fix
α,β,δ,γ
and use the validation set to learn scalars
k
1
,
k
2
,
k
3
,
k
4
(Eq. 1) to
weight each potential term separately. Here, we optimize detection loss (measured in terms
of average precision) induced by our greedy inference algorithm. This allows us to learn a
combination of the different sources of information while optimizing a discriminative loss.
We iteratively select each scalar
k
i
using brute force search.
In Figure 2 we visualize components of the learned model. The first histogram shows
learned weights
α
for the spatial context potential. Intuitively, we see that the model
penalizes most strongly trees that are closer than 2m or further than 32m to the nearest
tree. The 2
nd
histogram shows learned map weights
β
, the model penalizes trees that are too
close (
<
0
.
25m) or too far (
>
8m) from the road. The last histogram shows learned weights
k
1
,
k
2
,
k
3
,
k
4
on each CRF potential term, these match earlier results that streetview and
aerial images are most important.
At test time the trained model is applied to before unseen, new data. The aim is to
10
select the subset of tree detections
T
∗
= arg max
T
log(
p
(
T
)) that maximize Eq. 1, which
is generally a challenging (NP-hard) problem because all possible combinations would have
to be tried. Here, we resort to a greedy approach that begins with
T
=
∅
, and iteratively
appends a new candidate detection
t
′
= arg max
t
log(
p
(
T
∪
t
))
(10)
until no new tree candidate remains that would increase log
p
(
T
). In fact, despite being a
greedy procedure, this is quite efficient because we compute the combined detection score
Ω(
t,
mv(
t
);
β
) + Ψ(
t,
av(
t
);
γ
) +
∑
s
∈
sv(
t
)
Φ(
t,s
;
δ
) only once for each location
t
in the com-
bined multi-view region proposal set
R
. That is, we can pre-compute three out of four
potentials and only have to update the spatial term Λ(
t,T
;
α
) every time we add a new de-
tection
t
′
. Note that, in a probabilistic interpretation, the greedy accumulation is known to
approximate non-maximum suppression (Blaschko, 2011; Sun and Batra, 2015), with known
approximation guarantees for some choices of Λ(
·
).
4. Tree species classification
Tree species recognition can be viewed as a n instance offine-grained object recognition,
which has recently received a lot of attention in computer vision (Wah et al., 2011; Angelova
and Zhu, 2013; Branson et al., 2013; Deng et al., 2013; Duan et al., 2013; Krause et al., 2014;
Zhang et al., 2014b; Russakovsky et al., 2015). Fine-grained object recognition expands
traditional object detectors (e.g., birds versus background) to distinguish subclasses of a
particular object category (e.g., species of birds (Branson et al., 2013)). In this sense, given
detections of tree objects, our goal is to predict their fine-grained species. For this purpose
we again use a CNN.
In contrast to our first work on this topic (Wegner et al., 2016), we no longer run
two separate detection and species recognition modules
9
, but directly recognize species at
previously detected tree locations. More precisely, given tree detections we download one
aerial image and three cropped regions at increasing zoom level of the closest street-view
panorama for each tree. Examples of image sets for four different species of trees are shown
in Fig. 5. Here, we use the VGG16 network (Simonyan and Zisserman, 2015) (in contrast
to (Wegner et al., 2016), which used the shallower and less powerful GoogLeNet Szegedy
et al. (2015).
Four VGG16 CNNs are trained, separately per view and zoom level, to extract features
for each image. We follow the standard procedure and deploy a log-logistic loss function
and stochastic gradient descent for training, where pre-trained model parameters on Ima-
geNet Russakovsky et al. (2015) are re-fined using our tree data set. Fully-connected layers
are discarded per CNN and features from all four views’ networks are concatenated into one
feature vector per tree to train a linear SVM.
9
In (Wegner et al., 2016) tree species recognition is demonstrated for ground truth tree locations.
11
5. Tree change tracking
The simplest way for change tracking would be running the detection framework twice
and comparing results. However, preliminary tests showed that much manual filtering was
still necessary, for two main reasons:
•
Inaccuracies in the heading measurements of street-view panoramas and adding up of
detection inaccuracies of two detector runs leads to the same tree often being mapped
to two slightly different positions (by
≈
1m - 8m). This in turn leads to many situations
wrongly detected as changes.
•
Revisit-frequency of the street-view mapping car is very inhomogeneous across the
city. Main roads are mapped frequently whereas side roads have only very few images
in total. Fig. 3 shows the revisit-frequency of the Google street view car in Pasadena
as a heat map. As a consequence, it seems impossible to just run the detector for the
entire scene for two different, exact points in time. A better strategy seems comparing
per individual tree anytime new street-view data is acquired.
We use another CNN for change tracking per individual tree, a so-called Siamese CNN. A
Siamese architecture consists of two or more identical CNN branches that extract features
separately from multiple input images. The branches share their weights, which implements
the assumption that the inputs have identical statistics, and at the same time significantly
reduces the number of free, learnable parameters. Two main strategies exist for similarity
assessment via Siamese CNNs. The first passes extracted features to a contrastive loss (Had-
sell et al., 2006), which forces the network to learn a feature representation where similar
samples are projected to close locations in feature space whereas samples that show signif-
icantly different appearance will be projected far apart (e.g., (Lin et al., 2015; Simo-Serra
et al., 2015). The second strategy uses Siamese CNNs as feature extraction branches that
are followed by fully-connected layers to classify similarity (e.g., (Han et al., 2015; Altwaijry
et al., 2016). We follow this change-classifier strategy because it best fits our problem of
classifying change types of individual trees at different points in time. Network branches
are based on the VGG-16 architecture (Simonyan and Zisserman, 2015). We use VGG16
branches, because we can then start from our tree species classifier, which amounts to a
pre-trained model already adapted to the visual appearance of street trees. Naturally, the
top three fully-connected layers of the pre-trained VGG16 network are discarded, retaining
only the convolutional and pooling layers as “feature extractor”. All features per network
branch are concatenated and passed to the change classifier module, consisting of three fully-
connected layers. The output of the top layer represents the class scores, which are passed
through a SoftMax to get class probabilities. An illustration of our Siamese CNN architec-
ture is shown in Figure 4. Starting from the pre-trained species classifier, the Siamese CNN
is trained end-to-end for tree change classification. We use standard stochastic gradient
descent in order to minimize the SoftMax cross-entropy loss. Batch-size is set to 16, im-
plying that 16 image pairs are drawn per iteration, after having shuffled the data such that
successive data samples within one batch never belong to a single class, so as to maximize
the information gain per iteration.
12
Figure 3: Temporal street view coverage of Pasadena with number of images taken per 50
m
2
cell between
2006 and 2016. Empty cells are shown in white.
Note that this first version of the tree change tracker needs given geographic positions
to accurately compare per given tree position. It can already be useful in its present form
if geographic coordinates per tree are available from GIS data. The siamese approach can
then be directly applied.
6. Experiments
We evaluate our method on the Pasadena Urban Trees data set (Sec. 6.1). First, we
describe our evaluation strategy and performance measures. Then, we report tree detection,
tree species classification, and tree change classification performance separately
10
.
10
Interactive demos of results for a part of Pasadena and the greater Los Angeles area (
>
1 million trees
detected and species recognized) can be explored on our project website
http://www.vision.caltech.edu/
registree/
.
13
featuresA
featuresA
class scores
SoftMax
Cross-entropy Loss
featuresB
shared
weights
„concatenate“
featuresB
Feature extraction
Feature extraction
Change classifier
convolution
max-pooling
fully-connected
feature vector
Figure 4: Siamese CNN network architecture.
6.1. Data and Test Site
We densely download all relevant, publicly available aerial and street view images (and
maps) from Google Maps within a given area. As test region we choose Pasadena, because
a reasonably recent inventory with annotated species is available from 2013. Image data are
from October 2014 (street view), respectively March 2015 (aerial images). The Pasadena tree
inventory is made available as kml-file that contains rich information (geographic position,
genus, species, trunk diameter, street address, etc.) of
>
140 different tree species and about
80,000 trees in total in the city of Pasadena, California, USA. We found this to be a very
valuable data source to serve as ground truth. The original inventory data set does not only
14
contain trees but also potential planting sites, shrub etc. We filtered the data set such that
only trees remain. Densely downloading all images of Pasadena results in 46,321 street-level
panoramas of size 1664
×
832
px
(and corresponding meta data and sensor locations), 28,678
aerial image tiles of size 256
×
256
px
(at
≈
15
cm
resolution), and 28,678 map tiles of size
256
×
256
px
, over
>
100
,
000 images in total for our test region in the city of Pasadena.
Panorama images are downloaded at medium resolution for the detection task, to speed up
download and processing. A limitation of the Pasadena inventory is that it does only include
trees on public ground, which we estimate constitute only
≈
20% of all trees in Pasadena.
For training the tree detector we thus crowd-source labels for all trees (also those on private
ground) in a subset of 1,000 aerial images and 1,000 street view panoramas, via Amazon
Mechanical Turk
TM
. Note that we follow the standard approach and do zero-mean centering
and normalization of all CNN input datasets (i.e., also for species recognition and change
classification). More precisely, we first subtract the mean of all samples and normalize with
the standard deviation of the training data. As a consequence, values of the pre-processed
training image values range from 0 to 1 whereas values of validation and test data set may
slightly exceed that range, but still have almost the same scale.
For species recognition of detected trees, we use four images per tree, one aerial image and
three street-level images of different resolution (zoom) levels. If requesting street view data
for a specific location, the Google server selects the closest panorama and automatically sets
the heading towards the given location. Various parameters can be used to select particular
images, the most important ones being the zoom level (i.e., level of detail versus area covered)
and pitch. Preliminary tests showed that pitch could be fixed to 20 degrees, which leads to
a slightly upward looking camera with respect to the horizontal plane (as measured during
image acquisition by the Google camera car), which makes sense for flat ground and trees
higher than the camera rig. In order to gather as much characteristic and discriminative
appearance properties per tree as possible, street view imagery was downloaded at three
different zoom levels (40, 80, 110) from the same point of view. This ensures that usually
the entire tree (i.e., its overall, characteristic shape) as well as smaller details like branch
structures, bark texture, and sometimes even leafs can be recognized. See example images
for four different trees in Fig. 5 and for the 20 most dominant species in Fig. 6.
15
Aerial
SV40
SV80
SV110
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
Figure 5: Four images of the same tree (Row-wise top to bottom: American Sweet Gum, Bottle Tree,
California Fan Palm, Canary Island Date Palm) automatically downloaded from Google servers if tasked
with a specific geographic coordinate. From left to right: aerial image, zoom levels 40, 80, and 110 of the
corresponding Google street view panorama.
16
Mex. F. Palm: 2533
Camph. Tree: 2250
Live Oak: 1653
Holly Oak: 1558
South. Magn.: 1284
Ca. I. D. Palm: 593
Bottle Tree: 566
Cal. Fan Palm: 522
Ind. Laur. Fig: 335
Chinese Elm: 330
Jacaranda: 315
Carob: 314
Brush Cherry: 313
Brisbane Box: 309
Carrotwood: 305
Italian Cypress: 270
Date Palm: 170
Shamel Ash: 166
Fern Pine: 160
Am. Sweetgum: 155
Figure 6: Example street-view images and total number of instances in our dataset for the 20 most frequent
tree species.
17
For tree change classification we download two images per tree acquired in 2011 and
2016, respectively. Both sets were processed with the detection algorithm. We then cut out
an image patch at medium zoom level 80, as a compromise between level of detail and global
tree shape, at each location in both image sets. We do this in both directions of time, i.e.
we detect trees in the 2011 panoramas and extract image patches for 2011 and 2016 and
we detect trees in the 2016 panoramas and extract patches for 2011 and 2016, suppressing
double entries. This procedure ensures that we get image pairs showing both, removed and
newly planted trees that show up at only one time, while also minimising the loss due to
omission errors of the detector. We manually went through 11’000 image pairs to come up
with a useful subset for experiments. Changes are relatively rare, the large majority of trees
remains unchanged. From all manually checked image pairs we generate a balanced data
subset of 479 image pairs with the three categories
unchanged tree
(200 samples),
new tree
(131), and
removed tree
(148).
6.2. Evaluation strategy
We randomly split the data into 70% training, 20% validation, and 10% testing subsets
for experiments.
Evaluation of the tree detector uses precision and recall to assess detection performance
by comparing results with ground truth. Precision is the fraction of all detected trees that
match corresponding entries in ground truth. Recall refers to the fraction of ground truth
trees detected by our method. Precision-recall curves are generated by sweeping through
possible thresholds of the detection score, where each detection and ground truth tree could
be matched at most once. More precisely, if two detected trees would equally well match
the same ground truth tree, only one would be counted as true positive, the other as false
positive. To summarize performance per tree detection variant in a single number, we report
mean average precision (mAP) over all levels of recall
11
.
We count a tree detection within a 4 meters radius of ground truth as true positive. This
may seem high, but discussions with arborists showed that many tree inventories today,
at least in the US, do not come with geographic coordinates but with less accurate street
addresses. The major requirement in terms of positioning accuracy is that two neighboring
trees can be distinguished in the field. In these circumstances a 4 meter buffer seemed the
most reasonable choice. Empirically, most of our detections (
≈
70%) are within 2
m
of the
ground truth, in part due to limited GPS and geo-coding accuracy.
For species classification we report dataset precision and average class precision. Dataset
precision measures the global hit rate across the entire test set regardless of the number of
instances per tree species. Consequently, those species that occur most frequently dominate
this measure. In contrast, average class precision first evaluates precision separately per
species and then returns the average of all separate values, i.e. it accounts for the long-
tailed distribution of tree instances per species. For tree change classification experiments
we report full confusion matrices as well as mean overall accuracy (mOA).
18
0.0
0.2
0.4
0.6
0.8
1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Precision
Tree Detection
Full Model (0.706AP)
No Spatial (0.690AP)
No Map (0.667AP)
No Aerial (0.619AP)
No Streetview (0.462AP)
No CRF Learning (0.660AP)
Aerial (0.420AP)
Streetview (0.581AP)
Figure 7: Precision-recall curve of the full tree detection model (green) compared to reduced models that are
missing one of the prior terms and to only using aerial images (dotted rose) or only street level panoramas
(dotted grey).
6.3. Tree detection results
We plot precision-recall curves of our results in Figure 7. It turns out that the combi-
nation of multiple views significantly improves results. A pure aerial image detector that
applies the Faster R-CNN detection system (Ren et al., 2015) achieves only 0
.
42 mAP (dot-
ted rose curve in Fig. 7) compared to 0
.
706 mAP of our full model (green curve in Fig. 7).
Visual inspection of detections reveals that aerial images are very useful for identifying trees,
but less so for accurately determining the location of the (occluded) trunks. Detections from
aerial images can at best assume that a tree trunk is situated below the crown center, which
might often not be the case. Particularly in cases of dense canopy, where individual trees can
hardly be told apart from overhead views alone, this implicit assumption becomes invalid
and results deteriorate. This situation changes if using street level panoramas because tree
trunks can directly be seen in the images. A pure street view detector (dotted gray curve in
Fig. 7) that combines multiple street views achieves better performance (0
.
581 mAP) than
the aerial detector. This baseline omits all non-street view terms from Eq. 1 and replaces
the spatial context potential (Eq. 4) with a much simpler local non-maximum suppression
that forbids neighboring objects to be closer than
τ
nms
Λ
nms
(
t,T
;
α
) =
{
−∞
if
d
s
(
t,T
)
< τ
nms
0
otherwise
(11)
11
standard measure used in the VOC Pascal Detection Challenge (Everingham et al., 2010)
19
In contrast to this simpler approach with a hard threshold, the learned approach has the
advantage that it softly penalizes trees from being too close. It also learns that it is highly
unlikely that a certain tree is completely isolated from all other trees.
We also find that in many cases detections evaluated as false positives are in fact trees,
but located on private land near the road. The tree inventory of public street trees that we
use as ground truth does not include these trees, which consequently count as false positives
if detected by our system. Adding a GIS layer to the system, which specifies public/private
land boundaries would help discarding tree detections on private land from evaluation. We
view this as an important next step, and estimate that performance of the full model would
increase by at least 5% in mAP.
To verify the impact of each potential term on detection results, we run lesion studies
where individual parts of the full model are left away. Results of this experiment are shown in
Figure 7. ”No Aerial”, ”No Streetview”, and ”No Map” remove the corresponding potential
term from the full model in Eq. 1. ”No Spatial” replaces the learned spatial context term
(Eq. 4) with a more conventional non-maximal suppression term (Eq. 11) introduced earlier
in this section. We see the biggest loss in performance if we drop street view images (0
.
706
→
0
.
462 mAP) or aerial images (0
.
706
→
0
.
619 mAP). Dropping the map term results in a
smaller drop in performance (0
.
706
→
0
.
667 mAP). Replacing the learned spatial context
potential with non-maximal suppression results in only a small drop (0
.
706
→
0
.
69 mAP).
For each lesioned version of the system we re-learn an appropriate weight for each potential
function on the validation set. The method ”No CRF Learning” shows results if we use
the full model but omitted learning these scaling factors and set them all to 1 (results in a
0
.
706
→
0
.
66 mAP drop).
6.3.1. Detection Error Analysis
In this subsection we present a detailed analysis of the full detection system. We manually
inspect a subset of 520 error cases to analyze the most frequent failure reasons (see summary
in Tab. 1). Examples for several typical failure cases are shown in Figures 8-11. In the top
row of each figure, the first column shows the input region, with blue circles representing the
location of available street view cameras. The 2nd column shows results and error analysis
of our full detection system, with combined aerial, street view, and map images and spatial
context. Here, true positives are shown in green, false positives are shown in red, and false
negatives are shown in magenta. The 3rd column shows single view detection results using
just aerial images. The bottom two rows show two selected street view images–the images
are numbered according to their corresponding blue circle in the 1st row, 1st column. The
2nd row shows single view detection results using just street view images. The bottom row
visualizes the same results and error analysis visualized in the 1st row, 2nd column, with
numbers in the center of each box matching across views.
We identify 8 main categories
that can explain all 520 errors, and manually assign each error to one of the categories.
Table 1 details failure reasons, total numbers of errors per category, as well as percentage
out of the sample of 520 investigated cases in total, and references to examples in Figures 8-
11. We also include a likely explanation for each failure case in the image captions. Each
example is denoted by a number and a letter, where the letter denotes the figure number, and
20
Error
Name
Error Description
#
%
Detection
examples
Private
tree
Detection corresponds to a real tree. The tree
is on private land, whereas the inventory only
includes trees on public land, resulting in a
false positive.
56
10.8
F10, F11
Missing
tree
A tree on public land appears to be missing
from the inventory, which is older than the
Google imagery (results in a false positive).
Usually a recently planted tree.
39
7.5
F7, N10, N11
Extra tree
An extra tree appears in the inventory. Often
the case if a tree has been cut down since the
inventory.
66
12.7
-
Telephone
pole
False positive because a telephone pole or
lamp post resembles the trunk of a tree. This
usually happens when foliage of a nearby tree
also appears near the pole.
49
9.4
B6, B10,
B11, F6, F9,
F14
Duplicate
detection
A single tree is detected as 2 trees.
19
3.6
B7, B9, F13
Localization
Error
A detected tree is close to ground truth, but
not within the necessary distance threshold,
resulting in a false positive and negative. This
usually happens when the camera position and
heading associated with a Google street view
panorama are slightly noisy. Another reason
are inaccurate GPS positions in the inventory.
40
7.7
N7/N6,
N8/N1
Occluding
object
A tree is occluded (e.g., by a car or truck) in
street view, resulting in a false positive or
error localizing the base of the trunk.
120
23.1
F3
Other false
negatives
An existing and clearly visible tree (included
in the inventory) remains undetected. Often
happens for very small, recently planted trees.
Another reason is if the closest street view
panorama is rather far away from a tree, which
then appears small and blurry at low
resolution in the images.
131
25.2
-
Table 1: Analysis of a subset of 520 detection errors. Detection examples are shown overlaid to images in
Figures 8-11.
the number denotes the bounding box number. For example
B3
corresponds to bounding
box 3 (see numbered boxes in Figure 9) in example B (Figure 9).
We note that at least 31% of measured errors (Private Tree, Missing Tree, Extra Tree)
21