Goyal, Navin and Lifshits, Yury and Schütze, Hinrich
(2008)
*Disorder inequality: a combinatorial approach to nearest neighbor search.*
In:
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining.
ACM
, New York, NY, pp. 25-32.
ISBN 978-1-59593-927-2.
https://resolver.caltech.edu/CaltechAUTHORS:20161027-131947661

Full text is not posted in this repository. Consult Related URLs below.

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20161027-131947661

## Abstract

We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a'th most similar object to z and y is the b'th most similar object to z, then x is among the D(S) (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.

Item Type: | Book Section | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Related URLs: |
| |||||||||

Additional Information: | © 2008 ACM. Supported by the Lee Center of Advanced Networking and the Center for the Mathematics of Information. Supported by DFG grant SFB732. | |||||||||

Funders: |
| |||||||||

Subject Keywords: | Algorithms, Theory, Experimentation, Nearest neighbor search, similarity search, proximity search, random walk, randomized algorithms, disorder inequality, disorder dimension, disorder constant | |||||||||

Classification Code: | E.1 [ Data Structures ]; F.2.2 [ Analysis of Algorithms and Problem Complexity ]: Nonnumerical Algorithms and Problems: computations on discrete structures, sortin g and searching; H.2.4 [ Database Management ]: Systems: query processing; H.3.3 [ Informa | |||||||||

DOI: | 10.1145/1341531.1341538 | |||||||||

Record Number: | CaltechAUTHORS:20161027-131947661 | |||||||||

Persistent URL: | https://resolver.caltech.edu/CaltechAUTHORS:20161027-131947661 | |||||||||

Official Citation: | Navin Goyal, Yury Lifshits, and Hinrich Schütze. 2008. Disorder inequality: a combinatorial approach to nearest neighbor search. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM '08). ACM, New York, NY, USA, 25-32. DOI=http://dx.doi.org/10.1145/1341531.1341538 | |||||||||

Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | |||||||||

ID Code: | 71552 | |||||||||

Collection: | CaltechAUTHORS | |||||||||

Deposited By: | Kristin Buxton | |||||||||

Deposited On: | 27 Oct 2016 20:51 | |||||||||

Last Modified: | 11 Nov 2021 04:47 |

Repository Staff Only: item control page