Batch Policy Learning under Constraints

Creators: Le, Hoang M.; Voloshin, Cameron; Yue, Yisong

Abstract

When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We thus study the problem of batch policy learning under multiple constraints, and offer a systematic solution. We first propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting.

Additional Information

Attached Files

Published - le19a.pdf

Submitted - 1903.08738.pdf

Supplemental Material - le19a-supp.pdf

Files

1903.08738.pdf

Files (4.7 MB)

Name	Size	Download all
1903.08738.pdf md5:5968ba09cb47f84072cd49526aa88047	2.3 MB	Preview Download
le19a.pdf md5:757e819749e05d52a5640ab600ac9912	1.2 MB	Preview Download
le19a-supp.pdf md5:65f675b2f620324907ef53d68d274a61	1.2 MB	Preview Download

Additional details

	All versions	This version
Views	45	45
Downloads	23	23
Data volume	35.4 MB	35.4 MB