The K-armed dueling bandits problem
We study a partial-information online-learning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure (e.g., user-perceived quality of a set of retrieval results, taste of food, product attractiveness), but where pairwise comparisons are easy to make. We propose a novel regret formulation in this setting, as well as present an algorithm that achieves information-theoretically optimal regret bounds (up to a constant factor).
© 2012 Elsevier. Received 28 March 2010, Received in revised form 31 January 2011, Accepted 22 December 2011, Available online 21 January 2012. The work is funded by NSF Awards IIS-0812091 and IIS-0905467. The first author is also supported by a Microsoft Research Graduate Fellowship and a Yahoo! Key Scientific Challenges Award. The third author is supported by NSF Awards CCF-0643934 and AF-0910940, an Alfred P. Sloan Foundation Fellowship, and a Microsoft Research New Faculty Fellowship.