CaltechAUTHORS
  A Caltech Library Service

Dueling Posterior Sampling for Preference-Based Reinforcement Learning

Novoseller, Ellen R. and Sui, Yanan and Yue, Yisong and Burdick, Joel W. (2019) Dueling Posterior Sampling for Preference-Based Reinforcement Learning. . (Unpublished) https://resolver.caltech.edu/CaltechAUTHORS:20190905-154320945

[img] PDF - Submitted Version
See Usage Policy.

1247Kb

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20190905-154320945

Abstract

In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present Dueling Posterior Sampling (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the user's preferences. Because preference feedback is provided on trajectories rather than individual state/action pairs, we develop a Bayesian approach to solving the credit assignment problem, translating user preferences to a posterior distribution over state/action reward models. We prove an asymptotic no-regret rate for DPS with a Bayesian logistic regression credit assignment model; to our knowledge, this is the first regret guarantee for preference-based RL. We also discuss possible avenues for extending this proof methodology to analyze other credit assignment models. Finally, we evaluate the approach empirically, showing competitive performance against existing baselines.


Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription
http://arxiv.org/abs/1908.01289arXivDiscussion Paper
ORCID:
AuthorORCID
Novoseller, Ellen R.0000-0001-5263-0598
Sui, Yanan0000-0002-9480-627X
Yue, Yisong0000-0001-9127-1989
Additional Information:This work was supported by NIH grant EB007615 and an Amazon graduate fellowship.
Funders:
Funding AgencyGrant Number
NIHEB007615
AmazonUNSPECIFIED
Record Number:CaltechAUTHORS:20190905-154320945
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20190905-154320945
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:98462
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:05 Sep 2019 22:54
Last Modified:03 Oct 2019 21:41

Repository Staff Only: item control page