iccv2009w_276.pdf - 05457542.pdf

Scaling Object Recognition: Benchmark of Current State of the Art Techniques

Mohamed Aly

Peter Welinder

Mario Munich

Pietro Perona

Computational Vision Group, Caltech

Pasadena, CA 91106

vision.caltech.edu

{

malaa, welinder, perona

}

@caltech.edu

Evolution Robotics

Pasadena, CA 91106

www.evolution.com

mario@evolution.com

Abstract

Scaling from hundreds to millions of objects is the next

challenge in visual recognition. We investigate and bench-

mark the scalability properties (memory requirements, run-

time, recognition performance) of the state-of-the-art object

recognition techniques: the forest of k-d trees, the locality

sensitive hashing (LSH) method, and the approximate clus-

tering procedure with the tf-idf inverted index. The charac-

terization of the images was performed with SIFT features.

We conduct experiments on two new datasets of more than

100,000 images each, and quantify the performance using

artificial and natural deformations. We analyze the results

and point out the pitfalls of each of the compared method-

ologies suggesting potential new research avenues for the

field.

1. Introduction

Techniques for visual recognition of individual objects

have come of age in the past few years bringing to life com-

mercial deployments of high performance, reliable recogni-

tion systems that are able to recognize hundreds, sometimes

thousands, of objects. Since the seminal work on SIFT fea-

tures by David Lowe [11], there has been much research

activity in the area of feature detectors and descriptors with

embedded invariance to image transformations. Systems

based on SIFT have been already commercialized [13] in

products that successfully use object recognition in real life

conditions. And most recently, a number of companies

have started using object recognition in mobile search, see

[

bzhjkd

]

If one wished, say, to build a system to recognize any

building of interest to tourists around the world or any CD

cover ever published, one would have to deal with at least

millions of images. Therefore, it is easy to conceive of

applications that require recognition of

indi-

We use

tinyurl.com

throughout the paper, thus

[bzhjkd]

means

http://tinyurl.com/bzhjkd

vidual objects. Some researchers have already acknowl-

edged the fact that current techniques can only handle up

to few thousand objects. The challenge is then to scale up

these techinques by three to five orders of magnitude. Kd-

trees are used canonically to search for approximate nearest

neighbors [12]. Indyk [7] approximates nearest neighbor

search with a set of efficient hash functions. Nist

er and

Stew

enius [14] propose hierarchical clustering of descrip-

tors to produce a vocabulary which is scored with the vector

space model. Philbin

et al

. [15] improve on the this method

by replacing the hierarchical vocabulary with a flat k-means

clustering. These different methods all attempt to tackle

scalability by trading off memory usage and time complex-

ity. Some approaches propose clustering vocabularies to

save memory by not having to store all descriptors. Others

propose simpler matching models, such as the vector space

model, to reduce complexity.

We are interested in exploring the computational cost

and the precision-recall performance of these methods as

the database size increases. We explore why, and with what

tradeoffs, these performance decreases are manifested un-

der different recognition models. We have three contribu-

tions: (a) we collected two large databases of images for

testing scaling of recognition algorithms, we will make this

benchmark public, (b) we developed a number of perfor-

mance metrics that analyze both end-to-end performance

and module performance of recognition algorithms, (c) we

measured compared the performance of the most promising

current approaches to scalable recognition, using off-the-

shelves implementations when possible.

2. Related Work

Scaling object recogntion to millions of objects has a

dual in the problem of specific object retrieval in large

image collections. Indeed, many techniques developed to

solve one of the problems may be applied successfully to

the other. This section summarizes the latest research in

these fields.

Many effective object retrieval approaches are based on

2117

2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops

techniques developed in the textual information retrieval

community. The idea was introduced by Sivic

et al

. [18],

and later improved on by [14, 17, 15, 3, 16]. In particular,

the importance of large vocabularies of visual words, the

analog of textual words, for retrieval performance was stud-

ied by [14]. When dealing with vocabularies of 1 million

words, exact clustering methods such as K-means become

infeasible, as their computational cost is

(

)

for

words, where

is the size of the training set. Philbin

. [15] improved upon the hierarchical K-means scheme

used by Nist

er and Stew

enius [14] by using a Kd-tree based

clustering algortihm. Vocabularies built taking into account

the information gain of features in the vocabulary con-

struction perform better [17]. Further gains in classifica-

tion rates can be achieved by considering contextual dis-

tance metrics, proposed by [9], which adapts the metric for

comparing bag-of-words vectors to better discriminate be-

tween database images. The problem with the latter method

is that it requires pre-computing the nearest neighbors to

all images in the database, which takes

(

)

for large

databases. Recent work [9, 16] has shown that assigning

many words to a visual feature, so-called soft word assign-

ment, can further improve the retrieval performance.

A related class of techniques uses hashing to group simi-

lar images in large image databases [4, 19]. Here the whole

image is represented by a single short descriptor, similar to

the bag-of-words representation. The hashing techniques

then become an alternative for finding nearest neighbors

in the image database. Although still in the early stages,

hashing-based algorithms might be very important for large

databases recognition given their

(

)

time, as current methods such as the popular vector space

search have

(

)

search time.

As shown in this section, there is a deep connection be-

tween object recogntion and object retrieval in image col-

lections. However, it is important to reiterate that the em-

phasis of this paper is to benchmark the best techniques de-

veloped for both problems specifically on the object recog-

nition task with large databases.

3. Datasets

We used three testing scenarios, each consisting of a

model

set and a

probe

set. Each model set consists of a large

number of objects, any of which can be recognized. Each

probe set consists of a smaller set of additional images of

objects in the model set, which are used to index into the

model set. Probe sets are used to benchmark recognition

performance. We collected two

model

sets and three

probe

sets.

3.1. CD/DVD Covers Dataset

The first model set is a set of 197,311 medium resolu-

tion (640

480) CD/DVD covers from [

kgaxg

]. The set

included 1,979 covers of software packages, 11,444 games

covers, 102,353 movies and TV shows covers, and 81,535

music records covers. In this dataset is that there are dupli-

cates and highly similar images, e.g. covers of the same

game on different console or different language versions

of the same movie, see Fig 1. We pruned the duplicates

from the dataset to avoid having competing matches when

trying to recognize an input probe image. Duplicates were

eliminated (see below) yielding the final dataset of 132,380

unique images.

Fig. 2 explains the database cleaning algorithm. We used

SIFT [11] features

, and a set of 4 randomized Kd-trees [10]

as the dictionary. The thresholds were determined empir-

ically for the dataset, and were set to

= 100

features, where

is the minimum number of matching

features an image must have to be further considered and

is the maximum number of matching features an image

must have to be considered unique. The dictionary keeps

only a set of

=50

unique images, making use of the

fact that the images were in alphabetical order and that we

only need to match images to the previous images already

in the dictionary. If new images are matched to one in the

dictionary, then it is marked as similar, otherwise it is con-

sidered a unique image and is added to the list of unique

images. Figure 1 shows examples of similar images found

by the pruning algorithm.

We used two probe sets with the model set above. The

first probe set was obtained by applying 5 synthetic trans-

formations to 100 randomly chosen images from the model

set, which gives a set of 500 probe images. Having the

ground truth transformations allows us to quantify the per-

formance of the nearest neighbor algorithms, see fig. 8. The

transformations are: 1) identity (image without any change,

to test retrieval performance), 2) subsampling to half the

size, 3) rotation and sheer using the affine transformation

defined by

(

[

−

]

[

]

)

, 4) rota-

tion and sheer using the affine transformation defined by

(

[

−

]

[

87 0

]

)

, and 5) adding salt-

and-paper noise with noise density of 0.1.

The second probe is obtained by selecting photographs

of 97 CDs from the dataset used in [14] available from

[ddoht2]. This dataset has 4 medium resolution photos

of each object taken from different viewpoints, providing

97 = 388

probe images in total.

A. Vedaldi’s implementation: [

argwtg

]

2118

Figure 1. Examples of images from CD/DVD dataset. The first two

rows show examples of similar images in the dataset: two similar

images of a 007 game on Xbox and Playstation2 (first row), and

an English and a French version of the movie Ratatouille (second

row). The third row shows a CD cover in the dataset (left) and its

photographed image (right) cited in [14].

3.2. Pasadena Buildings

A second model-probe set was based on 750 photos of

facades of 103 houses in the Pasadena area and 22 build-

ings from the Caltech campus. Each building was pho-

tographed six times, three times in the afternoon, and three

times in the morning the next day. Thus, the light conditions

vary between the two sets. Each time, the buildings were

photographed from the front, and from the left and right

at approximately

◦

. For increased generality, the pho-

tographs were taken with two different digital cameras. For

each building, one image was added to the model database

(frontal view, afternoon), and the remaining five were used

as its test images, which gives us a total of

125

5 = 625

Extract local features from all images

foreach

image

•

Set

for all images

in the dictionary

•

foreach

feature in image

get the matching feature

and its

image label

from the dictionary and update

P

(

)

•

foreach

image

with

≥

perform RANSAC affine ver-

ification and update

with the number of features consistent

with affine transform

•

Get

max

{

}

and

argmax

{

}

•

m<T

then

add image

to the dictionary

else

mark image

as similar to image

•

size(dictionary)

≥

then

remove earliest images from dic-

tionary and rebuild

Figure 2. Dataset Cleaning Algorithm

Figure 3. Small subset of the Pasadena buildings dataset. Each

row show a different building. Photos in the first two columns

were taken in the afternoon, while photos in the last column were

taken the next morning with a different camera.

images in the probe set.

In order to test the algorithms’ performance on large

database sizes, we diluted the Pasadena buildings model set

by adding to it 99,599 photos downloaded from Flickr by

searching for the 145 most popular tags. All photos were

downsampled to a size of

640

480

pixels.

4. Recognition Methods

We focus on three methods, that fall under two broad

approaches for object recognition. The first, which we call

the

SIFT/Match/Hough/RANSAC

pipeline, is the one pro-

posed by Lowe [11], which represents each image by a

set of SIFT features extracted at interesting points. Dur-

ing training the features extracted from all the images in the

training set (one per object) are placed in a database. Dur-

ing recognition, the features of the test image are extraced,

and for each such feature the database is searched for its (ap-

proximate) nearest neighbors. Potential object match candi-

dates are checked for spatial consistency by using the Gen-

eralized Hough Transform followed by a RANSAC affine

fitting. The database image with the maximum number of

inliers is considered the matching image. We consider two

methods for building a database that supports fast approx-

imate nearest neighbors: Kd-trees and Locality Sensitive

Hashing (LSH).

The second approach is the bag-of-words approach,

which we call

SIFT/Quantize/Rank

pipeline [14, 15].

Here SIFT features are extracted and quantized into a code-

book of visual words. Each image is represented by a his-

togram of word occurences. Given a test image, its fea-

tures are extracted and quantized using the codebook com-

puted during training, and its histogram is used to compute

2119

its similarity to all images in the database, which are thus

ranked. Next we explain these methods in more detail.

4.1. Kd-tree and Kd-forest

Exhaustive nearest neighbor search scales linearly with

the number of features in the database, and the run time

becomes unacceptable when the size of the database ex-

ceeds millions of features. Building the Kd-tree scales as

(

)

where

is the number of features in the database

and

is the number of dimensions (there are

−

nodes

in the tree with

leaves). Memory requirements scale as

(

)

as we need to store split information at every node.

However, searching through the Kd-tree scales as

(

log

)

where log

is the depth of the tree.

Kd-trees work best in low dimensions [6] (up to 10 rather

than SIFT’s 128), so following [11] we use an approximate

version called Best-Bin-First Kd-tree [2] (we just call it Kd-

tree hereinafter for simplicity), where promising branches

are placed in a heap-based priority queue, and only a cer-

tain number of branches popped off the top of the queue are

processed. It is approximate because it is not guaranteed to

return the exact nearest neighbor in the database, however

it provides a speed up of several orders of magnitude over

exhaustive search while producing an acceptable number of

false matches [11].

The number of false matches can be reduced by using a

set of Kd-trees [10] (Kd-forest

). In order to build a Kd-

forest, the individual Kd-trees are randomized at each step

when choosing a dimension upon which to split the data,

so that a random dimension among those with top variance

is chosen rather than picking the dimension with maximum

variance (as is the case with Kd-tree). When searching the

Kd-forest, there is a single priority queue in which branches

from all the trees are pushed in. This improves performance

considerably by decreasing the chance of missing the true

nearest neighbor.

4.2. Locality Sensitive Hashing

The key idea of the LSH approximate nearest neighbor

(NN) algorithm is to construct a set of hash functions such

that the probability of nearby points being close after trans-

formation with the hash function is larger than the proba-

bility of two distant points being close after the same trans-

formation. The range space of the function is discretized

into buckets and we say that there is a ‘collision’ when two

points end up in the same bucket. LSH has been shown to

work well on high-dimensional datasets, and has a query

time that scales sub-linearly with the dataset size under cer-

tain conditions [7].

We use Kd-tree and Kd-forest interchangeably in the paper, and the

distinction should be clear from context

A locality sensitive family

of hash functions is de-

fined such that for any two points

p, q

∈

R

and

∈H

chosen uniformly at random,

Pr [

(

)]

≥

−

≤

, where

and

are parameters specific

to the application and dataset. In order to achieve the de-

sired collision probability, we choose

functions

(

)

,...,h

k,j

(

))

where

t,j

≤

)

are chosen uniformly at random from

[7]. The param-

eters

and

are chosen to minimize the search time and

maximize the collision probability of nearby points while

minimizing collisions of distant points. We have omitted

some of the details of the definitions, see [5] for a precise

definition and review of LSH.

Given a query point

, all points stored in the buckets

(

)

,...,g

(

)

are retrieved. In the final step, the distance

between

and all the points in all the buckets is computed

to determine the NN of

. This scheme could in the worst

case make the query time grow like

(

)

We use the E

LSH implementation available from [1],

which measures distances according to the

norm. For op-

timization reasons, that implementation defines the param-

eter

(

−

. We tuned

and

by hand

on a separate training set to keep the average search time

per query point low (on the order of 1 ms), even for large

datasets, while keeping the collision probability for nearby

points as high as possible.

It is crucial to use a training set that resembles the an-

ticipated test set when tuning the parameters in LSH. This

is particularly important when using LSH to find NNs of

SIFT descriptors, as we show in the following paragraphs.

The final step of the LSH pipeline is to use RANSAC to

fit an affine transformation from the database image to the

test image, and then keep only the descriptors in the test

image with NNs in the database image that are consistent

with this transformation. Therefore, we focus on tuning the

parameters of LSH to find true NNs for descriptors in test

images, i.e. descriptors which survive the RANSAC spatial

verification step. The task thus becomes to separate the set

of all features,

, in the test image into two separate sub-

sets: features that have the potential of surviving RANSAC,

, and all other features

−

(that may still be matched

to the correct image in the database, but will not survive

RANSAC). The NNs of

and

−

in the database image

are denoted by

and

−

respectively. We optimize LSH

to find accurate NNs for features in

The two sets

and

−

are created by taking a test

image and its corresponding database image and finding the

NNs of the test image features in the database image.

After the NNs are found, we apply RANSAC to fit an

affine transformation,

. Using

we back-project the

database features to the test image and prune all the fea-

tures which are inconsistent with the transfomation. The

features that remains (that ‘survived’ RANSAC) and their

2120

200

400

600

x 10

−3

distance

(b) pruned pts (B)

200

400

600

x 10

−3

distance

probability density function

(a) verified pts (Q)

NN in img

NN in db (D)

distance distr.

Figure 4. Distance distributions for SIFT descriptors in

(left)

and

(right). The curves show the distance distributions to the

NNs in the ground truth database image (black), in the whole

database

(green), and the overall distance to all points (red).

The distance distribution from the pruned points has been inset as

a blue curve in the left plot to show how the shapes differ (see text

for analysis).

corresponding NNs in the test image make up

and

while all other features belong to

−

. We repeat

the procedure on

∼

100

pairs of images, and merge all the

’s into a query set

. Similarly, we take all the features

from the

−

’s and call them

. We take all the features

from all the database images (no matter if they survived

RANSAC or not), and call them the database set,

. Fig.

4a (black curve) shows the distances of descriptors in

to NNs in their corresponding database images. The green

curve shows the distances from descriptors in

to their

NNs in the database,

. The red curve shows the overall

distance distribution from

to all points in

, be they NNs

or not. Fig. 4b shows the corresponding curves for fea-

tures in

. Fig. 4 shows that features in

have NNs in the

matching database image at a shorter distance than features

(black curves). Thus, LSH should be tuned to retrieve

NNs only if they are “close enough” to the query point, oth-

erwise the NN will probably not survive the final RANSAC

step anyway.

Fig. 5 highlights the differences in performance when

using descriptors from

and

when querying

. Fig. 5a

shows what percentage of the query points for which LSH

found a NN, as the parameter

is varied. The next plot

shows that even though NNss are found for a similar frac-

tion of the query points for the two query sets, the points

consistently return higher quality neighbors (i.e. true

nearest neighbors).

We picked

by examining the curves in fig. 4, and op-

timized the other parameters to minimize the query time

while keeping a high collision probability for points in

We found

= 250

=30

to yield good results.

4.3. Bag-of-Words Search

Textual information retrieval algorithms such as the vec-

tor space search method were introduced in image retrieval

by [18, 14]. In this scheme, each image in the database

percent with NNs

(a) fraction with NNs

100

percent correct NNs

(b) correct NNs

Figure 5. Importance of an appropriate training set when tuning

LSH. The black and red curves show results for points in

and

respectively. (a) The increase of NNs reported as a fraction of the

the size of the query set, as

is increased. (b) The percentage of

the reported NNs that are also exact NNs nearest neighbors for the

same data as used in (a).

is represented as a sparse vector of “visual words” occur-

rences. The bag-of-words vector space method was further

improved by [15] and shown to work well on photos of Ox-

ford buildings. We implement the method outlined by [15],

with minor modifications, as described briefly below.

In [14, 15] it was found that vocabularies on the order of

or more words are necessary for good retrieval perfor-

mance. Thus, our vocabularies are built using approximate

K-means (AKM) which scales like

(

log

)

, and was

shown to give best results in [15]. The approximate nearest

neighbor search is performed using the publicly available

FLANN package [12] with a forest of 8 Kd-trees trees and

512 checks. The number of features used for training the

codebooks with

and

codewords was 5 and 20

million respectively.

Each image in the database is represented as a normal-

ized histogram of its visual words occurrences using the

tf-idf weighting scheme, which downweights frequently

occuring, less discriminative words [18]. Test images

are compared to all images in the index using the dot-

product, measuring the cosine distance between the query

and database vectors.

Enhanced retrieval performance may be achieved by a

spatial, RANSAC-based, re-ranking of the

most simi-

lar database images [15]. We have on purpose omitted this

step in here, since its purpose is not to re-rank, but to find

the true positive amongst the top

results from the vector

space search. Because

is an application-specific param-

eter that represents a tradeoff between classification accu-

racy and computational efficiency, we will mention it only

briefly in this paper.

Although a query expansion method can improve perfor-

mance in image retrieval [3], it does not help in the object

recognition setting this study is concerned with. This is be-

cause we assume that there is only one example image per

object in the database, and so there are no near-identical ex-

amples to expand the query against.

2121

Model

#feat

Probe

#ims

Scenario 1

3.1

CD/DVD

∼

Synthetic

500

Scenario 2

3.1

CD/DVD

∼

Photographed

388

Scenario 3

3.2

Flickr

∼

.

Buildings

625

Table 1. The three testing scenarios investigated. The third column

lists the total number of features in the model set, and the fifth

colum lists the number of images in the probe set.

The big advantage of the vector space indexing method

is its efficient memory storage. Because of the quantiza-

tion into visual words, the storage requirements in RAM

(

img

)

, where

is the average number of features

per database image and

img

is the number of images in the

database. This is completely independent of the dimension-

ality of the feature vectors, unlike the requirements for the

Kd-forests and LSH algorithms, which use

(

img

)

memory, where

is the dimensionality of the descriptor

(

= 128

in the case of SIFT).

5. Experiments

5.1. Setup

We performed extensive experiments comparing the

three methods in

4.1-4.3 on the two datasets described in

3 in the three testing scenarios, summarized in Table 1. In

each of the three scenarios, we benchmarked the recogni-

tion performance by increasing the problem size using 1,

4, 8, 16, 64, and 128 thousand images in the model set.

Experiments were run on a Quadcore Intel Xeon 2.83GHz

machine with 32GB of memory. We used the standard SIFT

feature descriptor with DoG interest point detector [11] ex-

tracted by the code written by Vedaldi and Fulkerson [20].

Benchmark results for these three scenarios are shown in

Fig. 6.

−2

−1

Number of features in database

Mean query time per feature (msec)

Kd−tree

Kd−forest

LSH

Bag−of−words 500K

Figure 7. Comparison of how the query time per feature vary with

database size for Kd-trees and LSH for the three scenarios. The

curve for the Bag-of-words method shows the query time divided

by the number words used in the query vector. The error bars show

the maximum and minimum times, while the point indicates the

median time. The curves have been shifted slightly horizontally to

make error bars more visible.

number of featurs in database

percent of g.t. matches

(a) Kd−forest & LSH feature matches

Kd−tree

Kd−forest

LSH

number of features in database

NN distance ratio

(b) Vector Space Search NN Distance Ratio

Correct matches

Incorrect matches

Figure 8. Degradation at large database sizes. (a) The percent of

matching feature pairs that are consistent with the ground truth

transformation for Kd-trees and LSH. (b) The ratio of the next-NN

and NN distances when querying with bag-of-word histograms in

the vector space search. The blue and green points show the cor-

rectly and incorrectly classified images respectively. Points were

artificially shifted horizontally to be more visible.

5.2. Kd-tree and Kd-forest

In our experiments we considered two implementa-

tions of Kd-forests: FLANN, the implementation available

from [12], and our implementation. The FLANN imple-

mentation only accepts floating-point inputs, so we could

only fit in memory a bit over 1 million features. Our imple-

mentation of the kd-forest takes integer-valued features, en-

abling us to reach the maximum number of features that we

have in our databases. Thus, we used our implementation

in mex/Matlab for most of the experiments. We compared

our implementation with FLANN and we found they give

comparable results, see Fig. 9(a).

We also checked the effect of the Kd-forest size on the

recognition performance. Fig. 9(b) shows results of com-

paring Kd-forests with 1, 5, 10 and 15 trees. Performance

increases with the number of trees in the forest, though there

is almost no gain from increasing the size above 10 trees.

In our experiments we used single Kd-trees and Kd-forests

with 5 trees, which provide a significant improvement while

being manageable computationally.

The Kd-forest provides the best recognition performance

in all three test scenarios, as shown in Fig. 6. The Kd-

tree provides comparable results to the other methods, but is

worse than Kd-forest. Performance of both degrades as the

database size increases, which is expected as the probability

of finding the true nearest neighbor decreases as the Kd-tree

size increases. Due to the limited size of RAM available, we

were not able to run the Kd-forest for database sizes 64K

images and above (over

features).

5.3. Locality Sensitive Hashing

LSH was used to find the nearest neighbor in the databse

to each feature in a probe image. In all the figures we used

Lowe’s [11] suggestion of using the Hough-transform to

find the correct model in the database.

The performance of the LSH algoritm was comparable

2122

100

Number of features in database

Percent of images correctly classified

(a) Scenario 1 (Synthetic Distortions)

Kd−tree

Kd−forest

LSH

Bag−of−words 10K

Bag−of−words 500K

Bag−of−words 10K (*)

Bag−of−words 500K (*)

Bag−of−words 500K (*) top 100

100

Number of features in database

(b) Scenario 2 (Photographed CDs)

100

Number of features in database

100

Number of features in database

(d) Scenario 3’ (Reduced Probe Set)

Figure 6. Benchmark on classification performance for the different techniques in the different testing scenarios. (*) denotes the instances

where the codebook was trained on images of the same type as the test set. (d) shows results on scenario 3 with a reduced probe set, where

for every house image we use the two test images that are taken at the same time of day so as to neutralize the effect of lighting change.

to, but slightly worse than, Kd-trees, see fig. 6. However,

LSH seems to handle larger database sizes better; its perfor-

mance stays approximately constant. The available imple-

mentation does not scale to experimnts larger than 16,000

images. This does not appear to be an instrinsic limit of the

algorithm.

5.4. Bag-of-words Search

When benchmarking the bag-of-words nearest neighbor

search with the tf-idf weighted index, we have defined a

classification as correct when the test image gives the high-

est ranking score to the ground truth training image. It may

be argued that this is an unfair comparison with the feature-

based methods, which also include a spatial verification step

(the Hough-transform). Indeed, it has been shown spatial

verification of the top

images does improve recogni-

tion results [17]. However, this re-ranking technique would

have to be compared to spatially verifying the

models

with the highest numbers of matched features in the Kd-

tree and LSH methods. The only fair comparison would be

a comparison with a method where geometric verification

has been incorporated into the indexing scheme [8]. As an

upper bound, we have included in figure 6 the results when

100

Number of features in database

Percent of images correctly classifie

(a) FLANN Vs Our Kd−Forest

Kd−forest−flann

Kd−forest−ours

100

Number of features in database

(b) Effect of Kd−forest size

Kd−tree

Kd−forest−5

Kd−forest−10

Kd−forest−15

Figure 9. Kd-tree recognition performance vs number of database

features in Scenario 1. (a) Comparison of our implementation of

Kd-forest with FLANN shows comparable results. (b) Effect of

Kd-forest size on recognition performance, with comparison of

Kd-forests with 1, 5, 10, and 15 trees. Performance increases with

increasing the number of trees.

an image is considered as correctly classified if it is ranked

among the top 100 images, i.e.

= 100

In fig. 6, it is shown that the method performes as well as

the Kd-trees on both scenarios 1 & 3. However, the bag-of-

words representation fares much worse on scenario 2. This

is most likely due to the fact that the vocabulary used for

quantization was trained only on the scanned DVD covers,

so background features in the test image makes the query

vector more noisy. Indeed, as is shown in fig. 6(b), train-

ing the vocabulary on a combination of other photographs

from the dataset in [14] and the scanned CD/DVD covers

yields better results, although not as good as the Kd-trees

and LSH.

6. Discussion

On both scenarios 1 & 3, Kd-forest outperformed LSH

and Bag-of-words for smaller databases. This difference

in performance was reduced as the database size was in-

creased. It is worth reiterating here that it may be possible

to tune the individual algorithms more extensively to do bet-

ter on one or more of the datasets. However, our aim was

to use the most basic implementation of each algorithm, in

order to benchmark the core performance of the methods.

We noticed the performance was very poor on scenario 3,

and we believe it is because some of the photos in the probe

set were taken under different lighting conditions (at differ-

ent times of the day). To verify this, we plotted peformance

for a reduced probe set with photos taken in the same light-

ing condition as the database image, see fig. 6(d). It shows

improvement of about 25%.

Fig. 7 shows the mean query time per features as the

database size increases. It is interesting to see that the query

time for the Kd-tree is almost constant, and in fact slightly

decreasing with increasing database size. This is because

as the Kd-tree size increases, the number of nodes deep

down the tree increases. This increases the chance of having

these nodes at the top of the priority queue, which decreases

the search time with increased database sizes. On the other

2123

hand, the searching time for LSH increases with increasing

the database size as expected.

The performance of both the Kd-trees and the bag-of-

words search degrade as the size of the database is in-

creased. This is not the case for the LSH algortihm, which

performs more or less constant for all database sizes we

tried it on. Since both LSH and Kd-trees require the same

amount of memory, this could make LSH preferable for

larger databases, if the trend continues. Fig. 8(a) shows that

the number of feature matches obtained with LSH decreases

as database size increases (similar to kd-trees). However,

the performance of LSH stays constant while the perfor-

mance of the kd-trees deteriorates. We have no explanation

for this behavior, but we hypothesize that the NNs provided

by LSH are more likely to create a proper match given the

special tuning of LSH parameters (

4.2).

The performance of the bag-of-words search also falls

off with increasing database size, but for a slightly differnet

reason. In the case of bag-of-words, the nearest neighbor

search is between the images, represented as weighted bag-

of-words histograms, and not between individual features.

The bag-of-words representation breaks down as the dis-

tance between vectors in the database approaches the noise

level for the query vectors. Fig. 8(b) shows the ratio of the

next-NN and the NN distances in the index for the query im-

ages in scenario 2 (photographed CDs). For correctly clas-

sified images, this ratio is much higher (albeit with large

variance) than for mis-classified query images, for which

the ratio is

∼

, independent of the database size. How-

ever, as the size of the database is increased, the ratio de-

creases even for correctly classified images, and for more

than 100,000 images the ratio approaches unity. One solu-

tion to this problem might be to apply a local distance metric

to the images in the database [9].

In summary, our main findings are:

1. The performance of all algorithms is very different in

the three scenarios. In particular, performance on syn-

thetic distortions (scenario 1) overestimates performance

on ‘real’ data, and is overall quite uninformative. Fur-

thermore, differences in performance between CD covers

and buildings show that the statistics of images count for

a lot. Thus, one should use a diverse collection of bench-

mark datasets of real images for the purpose of evaluating

recognition algorithms.

2. Performance of bag-of-words techniques, which were de-

signed to scale recognition to large datasets, degrades

sharply with increasing database size in both real image

scenarios. Our experiments do not show recognition per-

formance advantages of these techniques with respect to

Lowe’s original method. There is, however, a significant

advantage in memory usage, allowing larger experiments

to be carried out. This suggests that a spatial consistency

check among the top

images is crucial for this method,

as indicated by the upper-bound curve in fig. 6.

3. Kd-trees and LSH scale differently. Recognition time

with Kd-trees remains virtually constant as database

size increases, while recognition performance decreases

sharply. Instead, LSHs performance decreases slowly

with database size, while recognition time increases

quickly. The query time for the Bag-of-words search also

increases fast.

4. RAM size will be a serious bottleneck if scaling to mil-

lions of images is an objective, it appears essential to

develop approaches that can use RAM more efficiently

and effectively, ideally allowing the user to set an upper

bound on memory usage. Solutions allowing to distribute

the database across multiple servers would of course be

useful, but only if the number of servers grows sublin-

early with the number of objects to be recognized.

References

[1] A. Andoni and P. Indyk. E2lsh: Exact euclidian locality-sensitive hashing,

2004.

http://web.mit.edu/andoni/www/LSH

[2] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal

algorithm for approximate nearest neighbor searching.

Journal of the ACM

45:891–923, 1998.

[3] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Auto-

matic query expansion with a generative feature model for object retrieval. In

ICCV

, 2007.

[4] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-

hash and tf-idf weighting. In

BMVC

, 2008.

[5] T. Darrell, P. Indyk, and G. Shakhnarovich, editors.

Nearest Neighbor Methods

in Learning and Vision: Theory and Practice

. MIT Press, 2006.

[6] J. Friedman, J. Bentley, and R. Finkel. An algorithm for finding best matches in

logarithmic expected time.

ACM Trans. on Mathematical Software

, 3:209–226,

1977.

[7] P. Indyk and R. Motwani. Approximate nearest neighbor: Towards removing

the curse of dimensionality. In

Symposium on Theory of Computing

, 1998.

[8] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric

consistency for large scale image search. In

ECCV

, volume I, pages 304–317,

oct 2008.

[9] H. Jegou, H. Harzallah, and C. Schmid. A contextual dissimilarity measure for

accurate and efficient image search. In

CVPR

, 2007.

[10] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-time keypoint

recognition. In

CVPR

, pages 775–781, 2005.

[11] D. Lowe. Distinctive image features from scale-invariant keypoints.

IJCV

60:91–110, 2004.

[12] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic

algorithm configuration. In

Int. Conf. on Comp. Vision Theory and Applica-

tions

, 2009.

[13] M. Munich, P. Pirjanian, E. Di Bernardo, L. Goncalves, N. Karlsson, and

D. Lowe. Application of Visual Pattern Recognition to Robotics and Automa-

tion.

IEEE Robotics & Automation Magazine

, pages 72–77, 2006.

[14] D. Nist

er and H. Stew

enius. Scalable recognition with a vocabulary tree. In

CVPR

, volume 2, pages 2161–2168, June 2006.

[15] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with

large vocabularies and fast spatial matching. In

CVPR

, 2007.

[16] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization:

Improving particular object retrieval in large scale image databases. In

CVPR

2008.

[17] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In

CVPR

, Minneapolis, June 2007.

[18] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object

matching in videos. In

ICCV

, 2003.

[19] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large databases for

recognition. In

CVPR

, 2008.

[20] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of com-

puter vision algorithms.

http://www.vlfeat.org/

, 2008.

2124