Azizzadenesheli, Kamyar and Lazaric, Alessandro and Anandkumar, Animashree (2016) Reinforcement Learning in Rich-Observation MDPs using Spectral Methods. . (Unpublished) https://resolver.caltech.edu/CaltechAUTHORS:20190327-085718507
![]() |
PDF
- Submitted Version
See Usage Policy. 433kB |
Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20190327-085718507
Abstract
Reinforcement learning (RL) in Markov decision processes (MDPs) with large state spaces is a challenging problem. The performance of standard RL algorithms degrades drastically with the dimensionality of state space. However, in practice, these large MDPs typically incorporate a latent or hidden low-dimensional structure. In this paper, we study the setting of rich-observation Markov decision processes (ROMDP), where there are a small number of hidden states which possess an injective mapping to the observation states. In other words, every observation state is generated through a single hidden state, and this mapping is unknown a priori. We introduce a spectral decomposition method that consistently learns this mapping, and more importantly, achieves it with low regret. The estimated mapping is integrated into an optimistic RL algorithm (UCRL), which operates on the estimated hidden space. We derive finite-time regret bounds for our algorithm with a weak dependence on the dimensionality of the observed space. In fact, our algorithm asymptotically achieves the same average regret as the oracle UCRL algorithm, which has the knowledge of the mapping from hidden to observed spaces. Thus, we derive an efficient spectral RL algorithm for ROMDPs.
Item Type: | Report or Paper (Discussion Paper) | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Related URLs: |
| ||||||||||||||||||||
ORCID: |
| ||||||||||||||||||||
Additional Information: | © 2018 Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. K. Azizzadenesheli is supported in part by NSF Career Award CCF-1254106 and AFOSR YIP FA9550-15-1-0221. A. Lazaric is supported in part by a grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020, CRIStAL (Centre de Recherche en Informatique et Automatique de Lille), and the French National Research Agency (ANR) under project ExTra-Learn n.ANR-14- CE24-0010-01. A. Anandkumar is supported in part by Microsoft Faculty Fellowship, Google faculty award, Adobe grant, NSF Career Award CCF-1254106, AFOSR YIP FA9550-15-1-0221, and Army Award No. W911NF-16-1-0134. The work is partially developed when the first K. Azizzadenesheli was visiting INRIA, Lille and Simons Institute for the Theory of Computing, UC. Berkeley. | ||||||||||||||||||||
Funders: |
| ||||||||||||||||||||
Subject Keywords: | Tensor Method, Regret, Confidence Bound, Rich Observability, Clustering | ||||||||||||||||||||
Record Number: | CaltechAUTHORS:20190327-085718507 | ||||||||||||||||||||
Persistent URL: | https://resolver.caltech.edu/CaltechAUTHORS:20190327-085718507 | ||||||||||||||||||||
Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | ||||||||||||||||||||
ID Code: | 94165 | ||||||||||||||||||||
Collection: | CaltechAUTHORS | ||||||||||||||||||||
Deposited By: | George Porter | ||||||||||||||||||||
Deposited On: | 28 Mar 2019 22:23 | ||||||||||||||||||||
Last Modified: | 11 Nov 2020 00:59 |
Repository Staff Only: item control page