Information Aggregation for Constrained Online Control

We consider a two-controller online control problem where a central controller chooses an action from a feasible set that is determined by time-varying and coupling constraints, which depend on all past actions and states. The central controller's goal is to minimize the cumulative cost; however, the controller has access to neither the feasible set nor the dynamics directly, which are determined by a remote local controller. Instead, the central controller receives only an aggregate summary of the feasibility information from the local controller, which does not know the system costs. We show that it is possible for an online algorithm using feasibility information to nearly match the dynamic regret of an online algorithm using perfect information whenever the feasible sets satisfy a causal invariance criterion and there is a sufficiently large prediction window size. To do so, we use a form of feasibility aggregation based on entropic maximization in combination with a novel online algorithm, named Penalized Predictive Control (PPC).


PROBLEM STATEMENT
We consider a general dynamic model over a discrete time horizon [ ] := {1, . . . , } with time-varying and time-coupling constraints: where the deterministic function represents the transition of the state and it satisfies the following assumption: Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s The dynamical system is governed by a local controller, which manages a large fleet of controllable units. The collection of the states of the units is represented by in a state space X ⊆ R . Both the state and action at each time are confined by safety sets that maybe time-varying and time-coupling, i.e., ∈ X (x < , u < ) and For simplicity, we denote them by U and X in future contexts.
There is a distant central controller that communicates with the local controller. The central controller selects an action at each time ∈ [ ]. The actions must be selected from a closed and bounded domain U ⊆ R . The initial point 0 is assumed to be the origin without loss of generality. The central controller receives time-varying cost functions online from an external environment and each (·) : U → R + only depends on the action chosen by the central controller. We assume that the local controller does not know the costs and has to choose the action given by the central controller and the central controller cannot access the constraints directly, but some information about the constraints, summarized as a density function (·) : U → [0, 1], can be transmitted during the control. The goal of an online control policy in this setting is to make the local and central controllers jointly minimize a cumulative cost (u) := =1 ( ) while satisfying (1). We suppose the online controller has (perfect) predictions of the cost functions and feedback of the current and the next time slots. Throughout this paper, we make the following regularity and smoothness assumptions on the model.
Assumption 3. For each ∈ [ ], the cost function (·) : U → R + is Lipschitz continuous. We assume that there exists a Lipschitz con- Our work is motivated by settings where a local controller governs a large-scale system and a central controller operates remotely. In many situations, full information about the local controllers' dynamics and constraints is not available to the central controller, due to complexity or privacy concerns and the local controller cannot access the system's costs. In such a two-controller system, the central and local controllers each have part of the information needed to control the whole dynamical system online. The task Session: Bandits and Friends SIGMETRICS '21 Abstracts, June 14-18, 2021, Virtual Event, China of designing algorithms is therefore made even more challenging than the single controller case. Note that there are a wide variety of situations that face this challenge, including operator-aggregator coordination in smart grid [2], data center scheduling [3] and fog computing [1].

ALGORITHM AND MAIN RESULTS
Our proposed design, termed Penalized Predictive Control (PPC), is a combination of Model Predictive Control (MPC), which is a competitive policy for online optimization with predictions, and the idea of using feasibility information about the safety sets as a penalty term. This design makes a connection between maximum entropy feedback (MEF) [2], a special design of satisfying where (·) denotes the Lebesgue measure and S(u ≤ ) consists of all feasible − actions at time + 1, . . . , , given the past actions u ≤ , and the well-known MPC scheme. The MEF as a feedback function, only contains feasibility information about the dynamical system in the local controller's side. We present PPC in Algorithm 1, where we use the following notation. The novel use of MEF as a penalty term in MPC allows PPC to achieve nearly optimal dynamic regret (defined in (3)  It is important to note that without any restrictions on U and X, Regret(u) can be no better than Ω( ) for any deterministic online policy , even with predictions, as the following theorem shows.
Theorem 2.1 (Fundamental limit). Suppose I is a collection of safety sets satisfying Assumption 2. For any sequence of actions u ∈ S generated by a deterministic online policy that has full information about the safety sets, Regret(u) = Ω ( ( − )) for any ≥ 1, where := diam(U) := sup{|| − || 2 : , ∈ U} is the diameter of the action space U, is the prediction window size and is the total number of time slots.
Therefore, the focus of this paper is to find conditions on (U, X) so that given enough predictions, the regret can be bounded by a sublinear (in ) function. This motivates the following causal invariance criterion. Let u ≤ = ( 1 , . . . , ) be a subsequence of optimal actions that maximizes the volume of the set of feasible actions, defined as u ≤ := arg sup u∈U (S(u)). Define the maximizing lengthsubsequence of actions as u +1: + := arg sup u∈U (S (u ≤ , u)). (1) For all ∈ [ ] and sequences of actions u ≤ and v ≤ , where B denotes the unit ball in R ×( − ) .
(2) For all ∈ [ ] and sequences of actions u ≤ , We are now ready to present our main result, which bounds the dynamic regret by a decreasing function of the prediction window size under the assumption that the safety sets are causally invariant. where denotes the diameter of the action space U, is the prediction window size and is the total number of time slots.
This theorem implies that, with additional assumptions on the safety constraints, a sub-linear (in ) dynamic regret is achievable, given a sufficiently large prediction window size = (1) (in ). The effectiveness of our online algorithm for closed-loop coordination between central and local controllers is validated via an electric vehicle charging application in power systems.