CaltechAUTHORS
  A Caltech Library Service

Disorder inequality: a combinatorial approach to nearest neighbor search

Goyal, Navin and Lifshits, Yury and Schütze, Hinrich (2008) Disorder inequality: a combinatorial approach to nearest neighbor search. In: WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining. ACM , New York, NY, pp. 25-32. ISBN 978-1-59593-927-2. https://resolver.caltech.edu/CaltechAUTHORS:20161027-131947661

Full text is not posted in this repository. Consult Related URLs below.

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20161027-131947661

Abstract

We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a'th most similar object to z and y is the b'th most similar object to z, then x is among the D(S) (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.


Item Type:Book Section
Related URLs:
URLURL TypeDescription
http://dx.doi.org/10.1145/1341531.1341538DOIPaper
http://dl.acm.org/citation.cfm?doid=1341531.1341538PublisherPaper
Additional Information:© 2008 ACM. Supported by the Lee Center of Advanced Networking and the Center for the Mathematics of Information. Supported by DFG grant SFB732.
Funders:
Funding AgencyGrant Number
Caltech Lee Center for Advanced NetworkingUNSPECIFIED
Center for the Mathematics of Information, CaltechUNSPECIFIED
Deutsche Forschungsgemeinschaft (DFG)SFB732
Subject Keywords:Algorithms, Theory, Experimentation, Nearest neighbor search, similarity search, proximity search, random walk, randomized algorithms, disorder inequality, disorder dimension, disorder constant
Classification Code:E.1 [ Data Structures ]; F.2.2 [ Analysis of Algorithms and Problem Complexity ]: Nonnumerical Algorithms and Problems: computations on discrete structures, sortin g and searching; H.2.4 [ Database Management ]: Systems: query processing; H.3.3 [ Informa
DOI:10.1145/1341531.1341538
Record Number:CaltechAUTHORS:20161027-131947661
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20161027-131947661
Official Citation:Navin Goyal, Yury Lifshits, and Hinrich Schütze. 2008. Disorder inequality: a combinatorial approach to nearest neighbor search. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM '08). ACM, New York, NY, USA, 25-32. DOI=http://dx.doi.org/10.1145/1341531.1341538
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:71552
Collection:CaltechAUTHORS
Deposited By: Kristin Buxton
Deposited On:27 Oct 2016 20:51
Last Modified:11 Nov 2021 04:47

Repository Staff Only: item control page