Learning-Based Predictive Control via Real-Time Aggregate Flexibility

Aggregators have emerged as crucial tools for the coordination of distributed, controllable loads. To be used effectively, an aggregator must be able to communicate the available flexibility of the loads they control, as known as the aggregate flexibility to a system operator. However, most of existing aggregate flexibility measures often are slow-timescale estimations and much less attention has been paid to real-time coordination between an aggregator and an operator. In this paper, we consider solving an online optimization in a closed-loop system and present a design of real-time aggregate flexibility feedback, termed the maximum entropy feedback (MEF). In addition to deriving analytic properties of the MEF, combining learning and control, we show that it can be approximated using reinforcement learning and used as a penalty term in a novel control algorithm -- the penalized predictive control (PPC), which modifies vanilla model predictive control (MPC). The benefits of our scheme are (1). Efficient Communication. An operator running PPC does not need to know the exact states and constraints of the loads, but only the MEF. (2). Fast Computation. The PPC often has much less number of variables than an MPC formulation. (3). Lower Costs. We show that under certain regularity assumptions, the PPC is optimal. We illustrate the efficacy of the PPC using a dataset from an adaptive electric vehicle charging network and show that PPC outperforms classical MPC.


I. INTRODUCTION
The uncertainty and volatility of renewable sources such as wind and solar power has created a need to exploit the flexibility of distributed energy resources (DERs) and aggregators have emerged as dominate players for coordinating these loads [1], [2]. An aggregator can coordinate a large pool of DERs and be a single point of contact for independent system operators (ISOs) to call on for flexibility. This enables ISOs to minimize cost, respond to unexpected fluctuations of renewables, and even mitigate failures quickly and reliably. Typically, an ISO communicates a time-varying signal to an aggregator, e.g., a desired power profile, that optimizes ISO objectives and the aggregator coordinates with the DERs to collectively respond to the time-varying signal as faithfully as possible, e.g., by shaping their aggregate power consumption to follow ISO's power profile, while satisfying DER constraints. These constraints are often private to the loads, e.g., satisfying energy demands of electric vehicles before their deadlines. They limit the flexibility available to the aggregator so the aggregator must also communicate with the ISO by providing a feedback signal that quantifies its available flexibility. This feedback provides ISO with crucial information for determining the signal it sends to the aggregator. Thus the aggregator and the ISO form a closed-loop control system to manage the aggregate flexibility of DERs.
This paper focuses on the design of this closed-loop system and, in particular, the design of real-time feedback signal from the aggregator to the ISO quantifying the available flexibility. The design of the aggregate flexibility feedback signal is complex and has been the subject of significant research over the last decade, e.g., [3], [4], [5], [6], [7], [8], [9], [10], [11]. Any feedback design must balance a variety of conflicting goals. Given the scale, complexity and privacy of the load constraints, it may neither be possible nor desirable to communicate precise information about every load. Instead, aggregate flexibility feedback must be a concise summary of a system's constraints and it must limit the leakage about specific load constraints. On the other hand, the feedback sent by an aggregator needs to be informative enough that it allows the ISO to achieve operational objectives, e.g., minimize cost, and, most importantly, containing feasibility information of the whole system with respect to the private load constraints. Moreover, a design for a flexibility feedback signal must be general enough to be applicable for a wide variety of controllable loads, e.g., electric vehicles (EVs), heating, ventilation, and air conditioning (HVAC) systems, energy storage units, thermostatically controlled loads, residential loads, and pool pumps. It is impractical to design different feedback signals for each load, so the same design must work for all DERs.
The challenge and importance of the design of flexibility feedback signals has led to the emergence of a rich literature. In many cases, the literature focuses on specific classes of controllable loads, such as EVs [12], heating, ventilation, and air conditioning (HVAC) systems [13], [14], energy storage units [10], thermostatically controlled loads [4] or residential loads and pool pumps [5], [15]. In the context of these applications, a variety of approaches have been suggested, e.g., convex geometric approximations via virtual battery models [4], [6], hyper-rectangles [8] and graphical interpretations [10]; scheduling based aggregation [16], [17]; linear combination of demand bit curves [14]; and probability-based characterization [5], [15]. These approaches have all yielded some success, especially in terms of quantifying available aggregate flexibility (see Section I-B for more detail on related work). However, nearly all prior work only focused on slowertimescale estimations and does not meet the goal of providing real-time aggregate flexibility feedback. The fast-changing environment and the uncertainties of the DERs, however, demand real-time flexibility feedback. For example, in an EV charging facility, it is notoriously challenging to predict future EV arrivals and their battery capacities. With on-site solar generation, the aggregator's dynamical system can be time-varying and non-stationary, so it is crucial that real-time feedback be defined and approximated for it to be used in online feedback-based applications. Furthermore, most of the existing frameworks are designated for specific tasks, such as managing HVAC systems [13], [14], and therefore may not be applicable to other applications. Reinforcement learning (RL), especially, deep RL, has been used widely as approximation tools in smart grid applications. Joint pricing and EV charging scheduling for a single EV charger is considered in [18] using state-action-reward-state-action (SARSA). But it is unclear how the proposed method in [18] can be extended to allow multiple chargers. Q-learning is used to estimate the residual energy in an energy storage system at the end of each day in [19] and determine the aggregate action for thermostatically controlled loads (TCLs) [20]. The authors in [21] combine evolution strategies and model predictive control (MPC) to coordinate heterogeneous TCLs. Most existing studies, including the aforementioned works typically use RL for a "central controller" (which is an operator in our context). Instead we use it for the aggregator to learn flexibility representations.
To the best of our knowledge, no paper has focused on the design of real-time coordination between an aggregator and a system operator that achieves the goals laid out above, except for some preliminary results in [22], [23]. Those results rely on a novel design of a real-time feedback signal that can be used to quantify the aggregate flexibility and coordinate real-time control. In this paper, we extend the design of the feedback signal to a more general dynamic system with time-varying and non-stationary constraints, and we mainly focus on how to apply the real-time feedback to practical applications (e.g., EV charging) in power systems. Towards this goal, we propose a reinforcement learning based approach to approximate this feedback and further incorporate the feedback into a penalized predictive control (PPC) scheme. On the theory side, we prove the optimality of the proposed PPC scheme, and through extensive numerical tests, we validate the superior empirical performance of PPC over classic benchmarks, such as MPC.

A. Contributions.
In summary, to complement previous research, this paper considers a closed-loop control model formed by a system operator (central controller) and an aggregator (local controller) and propose a novel design of real-time aggregate flexibility feedback, called the maximum entropy feedback (MEF) that quantifies the flexibility available to an aggregator. Based on the definition of MEF, we design a reward function, which allows MEF to be efficiently learned by model-free RL algorithms. Our main contributions are: 1) We introduce a model of the real-time closed-loop control system formed by a system operator and an aggregator. This work is the first to close the loop and both define a concise measure of aggregate flexibility and show how it can be used by the system operator in an online manner to optimize system objectives while respecting the constraints of the aggregator's loads. 2) Within this model we define the "optimal" real-time flexibility feedback as the solution to an optimization problem that maximizes the entropy of the feedback vector. The use of entropy in this context is novel and to the best of our knowledge, this article is among the first to rigorously define a notion for real-time aggregate flexibility with provable properties. In particular we show that the exact MEF allows the system operator to maintain feasibility and enhance flexibility. 3) Furthermore, we propose a novel combination of control and learning by integrating model predictive control (MPC) and the defined MEF. Using the MEF as a penalty term, we introduce an algorithm called the penalized predictive control (PPC), which only requires the system operator to receive the MEF at each time, without knowing the states and dynamics of the aggregator. We also prove that, under certain regularity conditions, the actions given by PPC are optimal. 4) Finally, we demonstrate the efficacy of the proposed scheme using real EV charging data from Caltech's ACN-Data [24]. Our experiments show that by sending simple action signals generated by the PPC, a system operator is able to coordinate with an EV charging aggregator to satisfy almost all EV charging demands, while only knowing the MEF learned by a model-free off-policy RL algorithm. The PPC is also showed to achieve lower cost than MPC, which in addition needs to have access to the complete state of the loads.
B. Related literature.
The growing importance of aggregators for the integration of controllable loads and the challenge of defining and quantifying the flexibility provided by aggregators has led to the emergence of a rich literature. Broadly, this work can be separated into three approaches.
Convex geometric approximation. The idea of representing the set of aggregate loads as a virtual battery model dates back to [3], [4]. In [6], flexibility of an aggregation of thermostatically controlled loads (TCLs) was defined as the Minkowski sum of individual polytopes, which is approximated by the homothets of a virtual battery model using linear programming. The recent paper [8] takes a different approach and defines the aggregate flexibility as upper and lower bounds so that each trajectory to be tracked between the bounds is disaggregatable and thus feasible. However, convex geometric approaches cannot be extended to generate real-time flexibility signals because the approximated sets cannot be decomposed along the time axis. In [11], a belief function of setpoints is introduced for real-time control. However, feasibility can only be guaranteed when each setpoint is in the belief set and this may not be the case for systems with memory.
Scheduling algorithm-driven analysis. Scheduling algorithms that enable the aggregation of loads have been studied in depth over the past decade. The authors of [25], [26] introduced a decentralized algorithm with a real-time implementation for EV charging to track a given load profile. The authors of [27] considered the feasibility of matching a given power trajectory and show that causal optimal policies do not exist. In this work, aggregate flexibility was implicitly considered as the set of all feasible power trajectories. Three heuristic causal scheduling policies were compared and the results were extended to aggregation of deferrable loads and storage in [16]. Furthermore, decentralized participation of flexible demand from heat pumps and EVs was addressed in [17]. Notably, the flexibility signals that have emerged from this literature generally are applicable only to specific policies and DERs.
Probability-based characterization. There is much less work on probabilistic methods. The aggregate flexibility of residential loads was defined based on positive and negative pattern variations by analyzing collective behaviour of aggregate users [5]. A randomized and decentralized control architecture for systems of deferrable loads was proposed in [15], with a linear time-invariant system approximation of the derived aggregate nonlinear model. Flexibility in this work was defined as an estimate of the proportion of loads that are operating. Our work falls into this category, but differs from previous papers in that entropy maximization for a closed-loop control system yields an interpretable signal that can be informative for operator objectives in real-time, as well as guarantee feasibility of the private constraints of loads (if the signal is accurate). In our previous work [23], we study the problem of real-time coordination of an aggregator and a system operator under the paradigm of a control framework and provide regret analysis assuming feasibility predictions are available.
Other approaches. Beyond the works described above, there are many other suggestions for metrics of aggregate flexibility, e.g., graphical-based measures [28] and data-driven approaches [28]. Most of these, and the approaches described above, are evaluated on the aggregator side only, and much less attention has been paid to the question of real-time coordination between an ISO and an aggregator that controls decentralized loads.
The assessment and enhancement of aggregate flexibility are often considered independent of the operational objectives. For instance, in a reserve market, an aggregator will report to the ISO a day in advance an offline notion of aggregated flexibility based on forecast for the ISO to compute a energy and reserve schedule for the following day, e.g., [3], [29], [8], [7], with notable exceptions, such as [12], which considered charging and discharging of EV fleets batteries for tracking a sequence of automatic generation control (AGC) signals. However, this approach has several limitations. First, in large-scale systems, knowing the exact states of each load is not realistic. Second, classical flexibility representations often rely on a precise state-transition model on the aggregator's side. Third, traditional ISO market designs, such as a day-ahead energy market, often make use of ex ante estimates of future system states. The forecasts of the future states can sometime be far from reality, because of either an inaccurate model is used, or an uncertain event occurs. In contrast, a realtime energy market [30], [31] provides more robust system control when facing uncertainty in the environment, e.g., from fast-changing renewable resources or human behavioral parameters. This further highlights the need for real-time flexibility feedback, and serves to differentiate the approach in our paper. Below we present the notation frequently used in the remainder of this paper. Notation and Conventions. We use P(·) and E(·) to denote the probability distribution and expectation of random variables. The (differential) entropy function is denoted by H(·). To distinguish random variables and their realizations, we follow the convention to denote the former by capital letters (e.g., U) and the latter by lower case letters (e.g., u). Furthermore, we denote the length-t prefix of a vector u by u ≤t := (u 1 , . . . , u t ). Similarly, u <t := (u 1 , . . . , u t−1 ) and u a→b := (u a , . . . , u b ). The concatenation of two vectors u and v is denoted by (u, v). Given two vectors u, v ∈ R N , we write u v if u i ≤ v i for all i = 1, . . . , N. For x ∈ R, denote [x] + := max{0, x}. The set of non-negative real numbers is denoted by R + . The rest of the paper is organized as follows. We present our closed-loop control model in Section II. We define real-time aggregate flexibility, called the MEF, and prove its properties in Section III. An RL-based approach for estimating the MEF is provided in Section IV. Combining MEF and model MPC, we propose an algorithm, termed the PPC in Section V-B. Numerical results are given in Section VI. Finally, we conclude this paper in Section VII.

II. PROBLEM FORMULATION
In this paper, we consider a real-time control problem involving two parties -a load aggregator and an independent system operator (ISO), or simply called an operator that interact over a discrete time horizon [T ] := {1, . . . , T }.

A. Load aggregator
A load aggregator is a device, often considered as a local controller that controls a fleet of controllable loads. In this part, we formally state the model of an aggregator and its objective. Let x t denote the aggregator state at time t that takes value in a certain set X ⊆ R m . To this end, the aggregator receives an action u t ∈ U where U ⊆ R denotes a closed and bounded set of actions at each time t from a system operator, which will be formally defined in Section II-B. The action space U and state space X are prefixed and known as common knowledge to both the aggregator and the system operator. The goal of the aggregator is to accomplish a certain task over the horizon [T ], e.g., delivering energy to a set of EVs by their deadlines while minimizing the costs, subject to system constraints. Mathematically, the constraints are represented by two collections of time-varying and time-coupling sets For notational simplicity, we denote X t (x <t , u <t ) by X t and U t (x <t , u <t ) by U t in the remaining contexts. The states and actions must satisfy x t ∈ X t and u t ∈ U t for all t ∈ [T ]. The decision changes the aggregator state x t according to a state transition function f t : where f t represents the transition of the state x t . The initial state x 1 is assumed to be the origin without loss of generality. The aggregator state x t and decision u t need to be chosen from two time-varying sets X t and U t . We make the following model assumptions: The aggregator has flexibility in its actions u t for accomplishing its task and, we assume for this paper, is indifferent to these decisions as long as the task is accomplished by time T . At each time t, based on its current state x t , the aggregator needs to send flexibility feedback, p t , a probability density function, from a collection of feedback signals P, to the system operator, which describes the flexibility of the aggregator for accepting different actions u t . We formally define p t and P in Section III-A. Designing p t is one of the central problems considered in this paper (see Section III for more details). Below we state the aggregator's goal in the real-time control system.
Aggregator's Objective. The goal of the aggregator is twofold: (1). Maintain the feasibility of the system and guarantee that x t ∈ X t and u t ∈ U t for all t ∈ [T ]. (2). Generate flexibility feedback p t and send it to the operator at time t ∈ [T ].
Remark 1. We assume that the action space U is a continuous set in R only for simplicity of presentation. The results and definitions in the paper can be extended to discrete setting by changing the integrals to summations, and replacing the differential entropy functions by discrete entropy functions, e.g., see the definition of maximum entropy feedback (Definition III.1) and Lemma 2. In practical systems e.g., an electric system consisting of an EV aggregator and an operator, U often represents the set of power levels and when the gap between power levels is small, U can be modeled as a continuous set.

B. System operator
A system operator is a central controller that operates the power grid. Knowing the flexibility feedback p t from the aggregator, the operator sends an action u t , chosen from U to the aggregator at each time t ∈ [T ]. Each action is associated with a cost function c t (·) : U → R + , e.g., the aggregate EV charging rate increases load on the electricity grid. The system's objective is stated as follows.
Operator's Objective. The goal of the system operator is to provide an action u t ∈ U at time t ∈ [T ] to the aggregator so as to minimize the cumulative system costs given by C T (u 1 , . . . , u T ) := ∑ T t=1 c t (u t ).

C. Real-time operator-aggregator coordination
Overall, considering the aggregator and operator's objectives, the goal of the closed-loop system is to solve the following problem in real-time, by coordinating the operator and aggregator via {p t : t ∈ [T ]} and {u t : t ∈ [T ]}: i.e., the operator aims to minimize its cost C T in (2a) while the load aggregator needs to fulfill its obligations in the form of constraints (2b)-(2d). This is an offline problem that involves global information at all times t ∈ [T ]. Remark 2. For simplicity, we describe our model in an offline setting where the cost and the constraints in the optimization problem (2) are expressed in terms of the entire trajectories of states and actions. The goal of the closed-loop control system is, however, to solve an online optimization via operatoraggregator coordination.
The challenges are: (i) the aggregator and operator need to solve the online version of (2) jointly, and (ii) the cost function C T is private to the operator and the constraints (2b)-(2d) are private to the operator. It is impractical for the aggregator to communicate the constraints to the operator because of privacy concerns or computational effort. Moreover, in an online setting, even the aggregator will not know the constraints that involve future information, e.g., future EV arrivals in an EV charging station. Formally, at each time t ∈ [T ], we assume that the operator and aggregator have access to the following information respectively: 1) An operator knows the costs (c 1 , . . . , c t ) and feedback (p 1 , . . . , p t ), but not the future costs (c t+1 , . . . , c T ) and feedback (p t+1 , . . . , p T ). 2) An aggregator knows the state transition functions ( f 1 , . . . , f T ), the initial state x 1 and actions (u 1 , . . . , u t ).
System's Goal. Overall, the goal of a aggregator-operator system is to jointly solve the online version of (2a)-(2d) whose partial information is known to an aggregator and an operator respectively.

D. Necessities of combining learning and control
With the assumptions above, on the one hand the aggregator cannot solve the problem independently because it does not have cost information (since the costs are often sensitive and only of the operator's interests) from the operator and even if the aggregator could, it may not have enough power to solve an optimization to obtain an action. On the other hand, the operator has to receive flexibility information from the aggregator in order to act. Well-known methods in pure learning or control cannot be used for this problem directly. From a learning perspective, the aggregator cannot simply use reinforcement learning and transmit parameters of a learned Qfunction or an actor-critic model to the operator because the aggregator does not know the costs. From a control perspective, although model predictive control (MPC) is widely used for EV charging scheduling in practical charging systems [32], [24], it requires precise state information of electric vehicle supply equipment (EVSE). Thus, to solve the induced MPC problem, the system operator or aggregator needs to solve an online optimization at each time step that involves hundreds or even thousands of variables. This not just a complex problem, but the state information of the controllable units is potentially sensitive. This combination makes controlling sub-systems using precise information impractical for a future smart grid [22], [23] In this work, we explore a solution where the system operator and the aggregator jointly solve an online version of (2) in a closed loop, as illustrated in Figure 1.
The real-time operator-aggregator coordination illustrated in Figure 1 combines learning and control approaches. It does not require the aggregator to know the system operator's objective in (2a), but only the action u t at each time t ∈ [T ] from the operator. In addition, it does not require the system operator to know the aggregator constraints in (2b), but only a feedback signal p t (to be designed) from the aggregator. After receiving flexibility feedback p t , which could be generated by machine learning algorithms, the system operator outputs an action u t using a causal operator function φ t (·) : P → U. Knowing the state x t , the aggregator generates its feedback p t using a causal aggregator function ψ t (·) : X → P where P denotes the domain of flexibility feedback that will be formally defined in Section III-A. By an "online feedback" solution, we mean that these functions (φ t , ψ t ) use only information available locally at time t ∈ [T ].
In summary, the closed-loop control system in our model proceeds as follows. At each time t, the aggregator learns or computes a length-|U| vector p t based on previously received action trajectory u <t = (u 1 , . . . , u t−1 ), and sends it to the system operator. 1 The system operator thencomputes a (possibly random) action u t = φ t (p t ) based on the flexibility feedback Operator (Central Controller) Generate actions using the PPC: Aggregator (Local Controller) Update system state: Compute estimated MEF: Return Total cost C T ; Algorithm 1: Closed-loop online control framework of a system operator (central controller) and an aggregator (local controller).
p t and sends it to the aggregator. The operator chooses its signal u t in order to solve the time-t problem in an online version of (2), so the function φ t denotes the mapping from the flexibility feedback p t to an optimal solution of the time-t problem. See V-B for examples. The aggregator then computes the next feedback p t+1 and the cycle repeats; see Algorithm 1. The goal of this paper is to provide concrete constructions of an aggregator function ψ (as an MEF generator; see Section III) and an operator function φ (via the PPC scheme; see Section V-B).
In the sequel, we demonstrate our system model using an EV charging application, as an example of the problem stated in (2).

E. An EV Charging Example
Consider an aggregator that is an EV charging facility with N accepted users. Each user j has a private vector (a( j), d( j), e( j), r( j)) ∈ R 4 where a( j) denotes its arrival (connecting) time; d( j) denotes its departure (disconnecting) time, normalized according to the time indices in [T ]; e( j) denotes the total energy to be delivered, and r( j) is its peak charging rate. Fix a set of N users with their private vectors (a( j), d( j), e( j), r( j)), the aggregator state for each EV that has arrived and has not departed by time t.
Here e t ( j) is the remaining energy demand of user j at time t and d t ( j) is the remaining charging time. The decision s t ( j) is the energy delivered to each user j at time t, determined by a scheduling policy π t such as the well-known earliest-deadlinefirst, least-laxity-first, etc. Let s t := (s t (1), . . . , s t (N)) and we have s t = π t (u t ) where u t in this example is the aggregate substation power level, chosen from a small discrete set U.
The aggregator decision s t ( j) ∈ R + at each time t updates the state, in particular e t ( j) such that where ∆ denotes the time unit and we assume that there is no energy loss. The laws (3a)-(3b) are examples of the generic transition functions f 1 , . . . , f T in (1). Suppose, in the context of demand response, the system operator (a local utility company, or a building management) sends a signal u t that is the aggregate energy that can be allocated to EV charging. The aggregator makes charging decisions s t ( j) to track the signal u t received from the system operator as long as they will meet the energy demands of all users before their deadlines. Then the constraints in (2b)-(2d) are the following constraints on the charging decisions s t , as a function of u t : In above, constraint (4c) ensures that the aggregator decision s t tracks the signal u t at each time t ∈ [T ], the constraint (4d) guarantees that EV j's energy demand is satisfied, and the other constraints say that the aggregator cannot charge an EV before its arrival, after its departure, or at a rate that exceeds its limit. Inequalities In this section we propose a specific function ψ t in the class defined by (6) for computing flexibility feedback to quantify its future flexibility. We will justify our proposal by showing that the proposed ψ t has several desirable properties for solving an online version of (2) using the real-time feedback-based approach described in Section II.

A. Definition of Flexibility Feedback p t
A major challenge in our problem is that the operator has access to neither the feasible set nor the dynamics directly. Therefore, a notion termed aggregate flexibility has to be designed. It is often a "simplified" summary of the constraints in (2b)-(2d), as we reviewed in Section I-B. Notably, existing aggregate flexibility definitions (for instance, in [3], [4], [5], [6], [7], [8], [9], [10]) all focus on the offline version of (2). It remains unclear that first, what is the right notion of realtime aggregate flexibility? i.e., what is the right form of the flexibility feedback p t ? Second, how can this p t be used by an operator?
In the following, we present a design of the flexibility feedback p t , which is first proposed in our previous work [22] for discrete U and [23] for continuous U. It quantifies future flexibility that will be enabled by an operator action u t . The feedback p t therefore is a surrogate for the aggregator constraints (2b) to guide the operator's decision. Let u := (u 1 , . . . , u T ). Specifically, define the set of all feasible action trajectories for the aggregator as: The following property of the set S is useful, whose proof can found in Appendix A.
Lemma 1. The set of feasible action trajectories S is Borel measurable.
Existing aggregate flexibility definitions focus on approximating S such as finding its convex approximation (see Section I-B for more details). Our problem formulation needs a real-time approximation of this set S, i.e., decompose S along the time axis t = 1, . . . , T . Throughout, we assume that S is non-empty. Next, we define the space of flexibility feedback p t . Formally, we let P denote a set of density functions such that p t (·|u <t ) : U → [0, 1] is a conditional density function in P. We refer to p t as flexibility feedback sent at time t ∈ [T ] from the aggregator to the system operator. In this sense, (6) does not specify a specific aggregator function ψ t , but a class of possible functions ψ t . Every function in this collection is causal in that it depends only on information available to the aggregator at time t. In contrast to most aggregate flexibility notions in the literature [3], [4], [5], [6], [7], [8], [9], [10], the flexibility feedback here is specifically designed for an online feedback control setting.

B. Maximum entropy feedback
The intuition behind our proposal is using the conditional probability p t (u t |u <t ) to measure the resulting future flexibility of the aggregator if the system operator chooses u t as the signal at time t, given the action trajectory up to time t − 1. The sum of the conditional entropy of p t thus is a measure of how informative the overall feedback is. This suggests choosing a conditional distribution p t that maximizes its conditional entropy. Consider the optimization problem: where the variables are conditional density functions: U ∈ U is a random variable distributed according to the joint distribution ∏ T t=1 p t and H(U t |U <t ) is the differential conditional entropy of p t defined as: By definition, a quantity conditioned on "u <1 " means an unconditional quantity, so in the above, H(U 1 |U <1 ) := H(U 1 ) := H(p 1 ).
The chain rule shows that ∑ T t=1 H(U t |U <t ) = H(U). Hence (7) can be interpreted as maximizing the entropy H(U) of a random trajectory U sampled according to the joint distribution ∏ T t=1 p t , conditioned on U satisfying U ∈ S, where the maximization is over the collection of conditional distributions (p 1 , . . . , p T ).
Remark 3. Even though the optimization problem (7) involves variables p t for the entire time horizon [T ], the individual variables p t in (7c) are conditional probabilities that depend only on information available to the aggregator at times t. Therefore the maximum entropy feedback p * t in Definition III.1 is indeed causal and in the class of p * t defined in (6). The existence of p * t is guaranteed by Lemma 2 below, which also implies that p * t is unique. We demonstrate Definition III.1 using a toy example.
Example III.1 (Maximum entropy feedback p * ). Consider the following instance of the EV charging example in Section II-E. Suppose the number of charging time slots is T = 3 and there is one customer, whose private vector is (1, 3, 1, 1) and possible energy levels are 0 (kWh) and 1 (kWh), i.e., U ≡ {0, 1}. Since there is only one EV, the scheduling algorithm u (disaggregation policy) assigns all power to this single EV. For this particular choices of x and u, the set of feasible trajectories is S = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}, shown in Figure 2 with the corresponding optimal conditional distributions given by (7).

C. Properties of p * t
We now show that the proposed maximum entropy feedback p * t has several desirable properties. We start by computing p * t explicitly. Given any action trajectory u ≤t , define the set of subsequent feasible trajectories as: As a corollary The size |S(u ≤t )| of the set of subsequent feasible trajectories is a measure of future flexibility, conditioned on u ≤t . Our first result justifies our calling p * t the optimal flexibility feedback: p * t is a measure of the future flexibility that will be enabled by the operator's action u t and it attains a measure of system capacity for flexibility. By definition, p * 1 (u 1 |u <1 ) := p * 1 (u 1 ). Lemma 2. Let µ(·) denote the Lebesgue measure. The MEF as optimal solutions of the maximization in (7a)-(7c) are given by Moreover, the optimal value of (7a)-(7c) is equal to log µ(S).
Remark 4. When the denominator µ (S(u <t )) is zero, the numerator µ (S((u <t , u))) has also to be zero. For this case, we set p * t (u|u <t ) = 0 and this does not affect the optimality of (7a)-(7b).
The proof can be found in Appendix B. The volume µ (S) is a measure of flexibility inherent in the aggregator. We will hence call log µ (S) the system capacity. Lemma 2 then says that the optimal value of (7) is the system capacity, = log µ (S). Moreover the maximum entropy feedback (p * 1 , . . . , p * T ) is the unique collection of conditional distributions that attains the system capacity in (7). This is intuitive since the entropy of a random trajectory x in S is maximized by the uniform distribution q * in (21) induced by the conditional distributions (p * 1 , . . . , p * T ). Lemma 2 implies the following important properties of the maximum entropy feedback. 2) For all u t , u t ∈ U at each time t ∈ [T ], if p * t (u t |u <t ) ≥ p * t (u t |u <t ) then µ (|S((u <t , u t ))) ≥ µ (S((u <t , u t ))).
The proof is provided in Appendix C. We elaborate the implication of Corollary III.1 for our online feedback-based solution approach.
Remark 5 (Feasibility and flexibility). Corollary III.1 says that the proposed optimal flexibility feedback p * t provides the right information for the system operator to choose its action u t at time t.
1) (Feasibility) Specifically, the first statement of the corollary says that if the operator always chooses an action u t with positive conditional probability p * t (u t ) > 0 for each time t, then the resulting action trajectory is guaranteed to be feasible, u ∈ S, i.e., the system will remain feasible at every time t ∈ [T ] along the way. 2) (Flexibility) Moreover, according to the second statement of the corollary, if the system operator chooses an action u t with a larger p * t (u t ) value at time t, then the system will be more flexible going forward than if it had chosen another signal u t with a smaller p * t (u t ) value, in the sense that there are more feasible trajectories in S((u <t , u t )) going forward.
As noted in Remark 2, despite characterizations that involve the whole action trajectory u, such as u ∈ S, these are online properties. This guarantees the feasibility of the online closedloop control system depicted in Figure 1, and confirms the suitability of p * t for online applications.

IV. APPROXIMATING MAXIMUM ENTROPY FEEDBACK VIA REINFORCEMENT LEARNING
For real-world applications, computing the maximum entropy feedback (MEF) could be computationally intensive. Thus, instead of computing it precisely, it is desirable to approximate it. In this section, we discuss the use of modelfree reinforcement learning (RL) to generate an aggregator function ψ. For practical implementation, we switch to the case when U is a discrete set and reuse the notation P to denote a probability simplex that contains all possible discrete MEF: We demonstrate that RL can be used to train a generator that outputs approximate MEF, given the state of the system. To be more precise, the learned aggregator function ψ : X → P outputs an estimate of the MEF given the state x t at each time t ∈ [T ], where X is the state space and P is the set of all possible MEF. Note that the aggregator does not know the cost functions, so it cannot directly use an RL algorithm and transmit the learned Q-function or actor-critic model to the operator. Moreover, even if the aggregator knows the cost functions, generating actions using RL needs to solve two contradicting tasks of both optimizing rewards and penalizing feasibility violations, which makes the design of reward function and reward clipping a challenging goal. In our approach, we separate the tasks of enforcing feasibility and minimizing costs. We generate MEF as feasibility signals via reinforcement learning methods, and optimize the operator's objective via a MPC-based method (introduced in Section V-B). It is also worth noting that a number of effective heuristics may be available such as a greedy approximation in [22] and other gradient-based or density estimation [33] methods. We leave to future work the question of finding an optimal approximation algorithm.

A. Offline learning of aggregator functions
To learn an aggregator function ψ for estimating MEF, we use an actor-critic architecture [34] with separate policy and value function networks to enable the learning of policies on continuous action and state spaces. The actor-critic architecture is presented in Figure 3, which shows the information update between actor and critic networks. Note that in practical actor-critic algorithms, typically the policy, Q-function(s) and value function(s) are modeled using deep neural networks and the parameters are updated using policy iteration via stochastic gradient descent. We omit those details in Figure 3.

B. Training process
During the training process, the data used for defining training dynamics are the episodes (U t , X t , f t ) T t=1 . For example, for the EV charging application in Section II-E, the training data of each episode (day) consist of historical private vectors (a( j), d( j), e( j), r( j)) specified by the users visited the charging station on the corresponding day. Among actor-criticbased RL algorithms, off-policy actor-critic methods, such as deep deterministic policy gradient (DDPG) [35] and soft actorcritic (SAC) [36] are known to attain better data efficiency in many applications. Below we take SAC, a maximum entropy deep RL algorithm, as an example to demonstrate the offline learning of an aggregator function ψ. In particular, for learning ψ, the objective of SAC is to maximize both the expected return and the expected entropy of the policy: where x t := (x t , u t ); ρ ψ denotes the state-action marginals of the trajectory distribution induced by a policy ψ and r(x t , p t ) is a customized reward function. To estimate MEF, we need to determine a reward function r(x t , p t ) in (10). We adopt the following reward function that incorporates the constraints and the definition of MEF: where the first term is critical and it maximizes the entropy of the probability distribution p t , based the definition of the MEF in Definition III.1; g(x t ) = g(x t , u t ) is a function that rewards the state and action if they satisfy the constraints x t ∈ X t and u t ∈ U t . The reward function is independent of the cost functions, which are synthetic costs in the training stage. A concrete example of g(x t ) is given in Section VI. We clip the output MEF given by the policy to make sure it is a probability vector in the probability defined in simplex (9). In Figure 4, a training curve is given and it displays the changes of rewards regarding to the number of training episodes.

C. Testing process
With a trained aggregator function ψ that tries to optimize J(ψ) in (10), we test the closed-loop system on new episodes defined by testing data, as shown in Figure 3. The trained aggregator function (parameterized by a deep neural network) is used as a "black box" function that maps each state x t to feedback p t . 2 Note that the real costs used in the testing process may not be same as the synthetic costs used in the training process, because the aggregator has no access to the costs as assumed in Section II.
In the sequel, with the learned MEF, we introduce a closedloop framework that combines model predictive control (MPC) and RL to coordinate a system operator and an aggregator in real-time. It is worth noting that the learned MEF may be different from the exact MEF provided in Definition III.1. However, later we show in Section VI that with the learned MEF, the constraints on the aggregator's side can almost be satisfied with a reasonable tuning parameter. In the EV charging example described in Section II-E, this means the EV's batteries are fully charged; see Figure 6 for details. V. PENALIZED PREDICTIVE CONTROL Consider the system model in Section II. In this setting, the operator seeks to minimize the cost in an online manner, i.e., at time t ∈ [T ] the operator only knows the objective functions c 1 , . . . , c t and the flexibility feedback p 1 , . . . , p t . The task of the operator is to, given the maximum entropy feedback, design a sequence of operator functions φ 1 , . . . , φ T to generate actions u 1 , . . . , u T that are always feasible with respect to the constraints and that minimize the cumulative cost.

A. Key Idea: Maximum entropy feedback as a penalty term
There is in general a trade-off between ensuring future flexibility and minimizing the current system cost in predictive control. The action u t guaranteeing the maximal future flexibility, i.e., having the largest p * t (u t |u <t ) may not be the one that minimizes the current cost function c t and vice versa. Therefore, the online algorithm for the central controller must balance future flexibility and current cost. The key idea is to use MEF as a penalty term in the offline optimization problem. Note that Corollary III.1 guarantees that the online agent can always find a feasible action u ∈ S. Indeed, knowing the MEF p * t for every t ∈ [T ] is equivalent to knowing the set of all admissible sequences of actions S. To see this, consider the unique maximum entropy feedback (p * 1 , . . . , p * T ) guaranteed by Lemma 2 and let q(u) = ∏ T t=1 p * t (u t |u <t ) denote the joint distribution of the action trajectory u. Then (8) implies that the joint distribution q is the uniform distribution over the set S of all feasible trajectories: Using this observation, the constraints (2b)-(2d) in the offline optimization can be rewritten as a penalty in the objective of (2a). We present a useful lemma that both motivates our online control algorithm and builds up the optimality analysis in Section V-D.
Lemma 3. The offline optimization (2a)-(2d) is equivalent to the following unconstrained minimization for any β > 0: The proof of Lemma 3 can be found in Appendix E. It draws a clear connection between MEF and the offline optimal, which we exploit in the design of an online system operator in the next section.

B. Algorithm: Penalized Predictive Control via MEF
Our proposed design, termed penalized predictive control (PPC), is a combination of model predictive control (MPC) (c.f. [37]), which is a competitive policy for online optimization with predictions, and the idea of using MEF as a penalty term. This design makes a connection between the MEF and the well-known MPC scheme. The MEF as a feedback function, only contains limited information about the dynamical system in the local controller's side. (It contains only the feasibility information of the current and future time slots, as explained in Section III). The PPC scheme therefore is itself a novel contribution since it shows that, even if only feasibility information is available, it is still possible to incorporate the limited information to MPC as a penalty term.
We present PPC in Algorithm 2, where we use the following notation. Let β t > 0 be a tuning parameter in predictive control to trade-off the flexibility in the future and minimization of the current system cost at each time t ∈ [T ]. The next corollary follows whose proof is in Appendix C.
Corollary V.1 (Feasibility of PPC). When p t = p * t for all t ∈ [T ], the MEF defined in Definition III.1, the sequence of actions u = (u 1 , . . . , u T ) generated by the PPC in (14) always satisfies u ∈ S for any sequence of tuning parameters (β 1 , . . . , β T ).

C. Framework: Closed-loop control between local and central controllers
Given the PPC scheme described above, we can now formally present our online control framework for the distant central controller and local controller (defined in Section II). Recall that an overview of the closed-loop control framework has been given in Algorithm 1, where φ denotes an operator function and ψ is an aggregator function. To the best of our knowledge, this paper is the first to consider such a closed-loop control framework with limited information communicated in real-time between two geographically separate controllers seeking to solve an online control problem. We present the framework below.
At each time t ∈ [T ], the local controller first efficiently generates estimated MEF p t ∈ P using an aggregator function ψ t trained by a reinforcement learning algorithm. After receiving the current MEF p t and cost function c t (future w MEF and costs if predictions are available), the central controller uses the PPC scheme in Algorithm 2 to generate an action u t ∈ U and sends it back to the local controller. The local controller then updates its state x t ∈ X to a new state x t+1 based on the system dynamic in (1) and repeats this procedure again. In the next Section, we use an EV charging example to verify the efficacy of the proposed method.

D. Optimality Analysis
To end our discussion of PPC we focus on optimality. For the ease of analysis, we assume that the action space U is the set of real numbers R; however, as noted in Remark 1, our system and the definition of MEF can also be made consistent with a discrete action space.
To understand the optimality of PPC we focus on standard regularity assumptions for the cost functions and the timevarying constraints. We assume cost functions are strictly convex and differentiable, which is common in practice. Further, let µ(·) denote the Lebesgue measure. Note that the set of subsequent feasible action trajectories S(u ≤t ) is Borel-measurable for all t ∈ [T ], implied by the proof of Corollary 1. We also assume that the measure of the set of feasible actions µ(S(u ≤t )) is differentiable and strictly logarithmically-concave with respect to the subsequence of actions u t = (u 1 , . . . , u t ) for all t ∈ [T ], which is also common in practice, e.g., it holds in the case of inventory constraints ∑ T t=1 u t 2 ≤ B with a budget B > 0. Finally, recall the definition of the set of subsequent feasible action trajectories: Putting the above together, we can state our assumption formally as follows.  Crucially, Theorem V.1 shows that there exists a sequence of "good" tuning parameters so that the PPC scheme is able to generate optimal actions under reasonable assumptions. However, note that the assumption of U = R is fundamental. When the action space U is discrete or U is a high-dimensional space, it is impossible to generate the optimal actions because, in general, fixing t, the differential equations in the proof of Theorem V.1 (see Appendix F) do not have the same solution for all β t > 0. Therefore a detailed regret analysis is necessary in such cases, which is a challenging task for future work.

VI. NUMERICAL RESULTS
In this section, we present experimental results for the case of online EV charging, introduced in Section II-E as an example of our system model (see Section II). The notation used in this section, if not defined, can be found in Section II-E.

A. Experimental setups
In the following, we present settings of parameters and useful metrics in our experiments.
1) Dataset and hardware: We use real EV charging data from ACN-Data [24], which is a dataset collected from adaptive EV charging networks (ACNs) at Caltech and JPL. The detailed hardware setup for that EV charging network structure can be found in [38].
2) Error Metrics: Recall the EV charging example in optimization (5a)-(5b). We first introduce two error metrics to measure the EV charging constraint violations. Note that the constraints (4a), (4b) and (4e) are hard constraints depending only on the scheduling policy, but not the actions and energy demands. Therefore they can be automatically satisfied in our experiments by fixing a scheduling policy satisfying them such as least laxity first. Violations may happen on constraint (4c) and (4d). To measure the violation of (4c), we use the (normalized) mean squared error (MSE) as the tracking error: where u (k) t is the t-th power signal for the k-th test and s (k) t ( j) is the energy scheduled to the j-th charging session at time t for the k-th test. To better approximate real-world cases, we consider an additional operational constraints for the operator (central controller) and require that u t ≤ ξ (kWh) for every t ∈ [T ]. The total number of tests is L and the total number of charging sessions is N. Additionally, define the mean percentage error with respect to the undelivered energy corresponding to (4d) as where e j is the energy request for each charging session j ∈ [N]; s (k) t ( j) is the energy scheduled to the j-th charging session at time t for the k-th test.
3) Hyper-parameters: The detailed parameters used in our experiments are shown in Table I  used (see Section II-E); otherwise the vector is an all-zero vector. The control action space is U = {0, 15, 30, . . . , 150} (unit: kW) with |U| = 10, unless explicitly stated. The scheduling policy π is fixed to be least-laxity-first (LLF). b) RL spaces: The RL action space 3 of the Markov decision process used in the RL algorithm is [0, 1] 10 . The outputs of the neural networks are clipped into the probability simplex (space of MEF) P afterwards. c) RL rewards: We use the following specific reward function for our EV charging scenario, as a concrete example of (11): where σ 1 , σ 2 and σ 3 are positive constants; N is the number of EVs being charged; φ t is the operator function, which is specified by (14); I(·) denotes an indicator function and a( j i ) is the arrival time of the i − th EV in the charging station with j i being the index of this EV in the total accepted charging sessions [N]. The entropy function H(p t ) in the first term is a greedy approximation of the definition of MEF (see Definition III.1). The second term is to further enhance charging performance and the last two terms are realizations of the last term in (11) for constraints (4c) and (4d). Note that The other constraints in the example shown in Section II-E can automatically be satisfied by enforcing the constraints in the fixed scheduling algorithm π. With the settings described above, in Figure 4 we show a typical training curve of the reward function in (17). We observe policy convergence with respect to a wide range of choices of the hyper-parameters σ 1 , σ 2 and σ 3 . In our experiments, we do not optimize them but fix the constants in (17) as σ 1 = 0.1, σ 2 = 0.2 and σ 3 = 2. d) Cost functions: We consider the specific form of costs in (5a). In the RL training process, we train an aggregator function ψ using linear price functions c t = 1 − t/24 where t ∈ [0, 24] (unit: Hrs) is the time index and we test the trained system with real price functions c 1 , . . . , c T being the average locational marginal prices (LMPs) on the CAISO (California Independent System Operator) day-ahead market in 2016 (depicted at the bottom of Figure 7). e) Tuning parameters: In PPC defined in Algorithm 2, there is a sequence of tuning parameters (β 1 , . . . , β T ). In our experiments, we fix β t = β for all t ∈ [T ] where β > 0 is a universal tuning parameter that can be varied in our experiments.
B. Experimental results 1) Sensitivity of β : We first show how the changes of the tuning parameter β affect the total cost and feasibility. Figure 5 compares the results by varying β . The agents are trained on data collected from Nov. 1, 2018 to Dec. 1, 2019 and the tests are performed on data from Dec. 2, 2019 to Jan. 1, 2020 . Weekends and days with less than 30 charging sessions are removed from both training and testing data. For charging performance, we show in Figure 6 the battery states of each session after the charging cycle ends, tested with tuning parameters β = 2 × 10 3 , 4 × 10 3 and 6 × 10 3 respectively. The results indicate that with a sufficiently large tuning parameter, the charging actions given by the PPC is able to satisfy EVs' charging demands and in practice, there is a trade-off between costs and feasibility depending on the choice of tuning parameters.
2) Charging curves: In Figure 7, substation charging rates (in kW) are shown. The charging rates generated by the PPC correspond to a trajectory (∑ j s 1 ( j)/∆, . . . , ∑ j s T ( j))∆), which is the aggregate charging power given by the PPC for all EVs at each time t = 1, . . . , T . The agent is trained on data collected at Caltech from Nov. 1, 2018 to Dec. 1, 2019 and tested on Dec. 16, 2019 at Caltech using real LMPs on the CAISO dayahead market in 2016. We use a tuning parameter β = 4 × 10 3 for both training and testing. The figure highlights that, with a suitable choice of tuning parameter, the operator is able to schedule charging at time slots where prices are lower and avoid charging at the peak of prices, as desired. In particular, it achieves a lower cost compared with the commonly used MPC scheme described in (18a)-(18f).The offline optimal charging rates are also provided.
3) Comparison of PPC and MPC: In Figure 8 we show the changes of the cumulative costs by varying the mean  percentage error (MPE) with respect to the undelivered energy defined in (16). There are in total K = 14 episodes tested for days selected from Dec. 2, 2019 to Jan. 1, 2020 (days with less than 30 charging sessions are removed, i.e. we require, N ≥ 30). Note that 0 ≤ MPE ≤ 1 and the larger MPE is, the higher level of constraint violations we observe. We allow constraint violations and modify parameters in the MPC and PPC to obtain varying MPE values. For the PPC, we vary the tuning parameter β to obtain the corresponding costs and MPE. For the MPC in our tests, we solve the following optimization at each time for obtaining the charging decisions where at time t, the integer N denotes the number of EVs being charged at the charging station and the time horizon of the online optimization is from τ = t to t , which is the latest departure time of the present charging sessions; a(i) and d(i) are the arrival time and departure time of the i-th session; γ > 0 relaxes the energy demand constraints and therefore changes the MPE region for MPC. The offline cost-energy curve is obtained by varying the energy demand constraints in (4d) in a similar way. We assume there is no admission control and an arriving EV will take a charger whenever it is idle for both MPC and PPC. Note that this MPC framework is widely studied [40] and used in EV charging applications [32].
It requires the precise knowledge of a 108-dimensional state vector of 54 chargers at each time step. We observe that with only feasibility information, PPC outperforms MPC for all 0 ≤ MPE ≤ 1. The main reason that PPC outperforms vanilla MPC is that PPC utilizes MEF as its input, which is generated by a pre-trained aggregator function. Therefore the MEF may contain useful future feasibility information that vanilla MPC does not know, despite that it is trained and tested on separate datasets.

VII. CONCLUDING REMARKS
This paper formalizes and studies the closed-loop control framework created by the interaction between a system operator and an aggregator. Our focus is on the feedback signal provided by the aggregator to the operator that summarizes the real-time availability of flexibility among the loads controlled by the aggregator. We present the design of an maximum entropy feedback (MEF) signal based on entropic maximization. We prove a close connection between the MEF signal and the system capacity, and show that when the signal is used the system operator can perform online cost minimization while provably respecting the private constraints of the loads controlled by the aggregator and satisfying optimality under certain regularity assumptions. Further, we illustrate the effectiveness of these designs using simulation experiments of an EV charging facility.
There is much left to explore about this MEF signal presented in this work. In particular, computing it is computationally intensive and we use reinforcement learning for approximating the MEF. Improving the learning design and developing other approximations are of particular interest. Further, exploring the use of flexibility feedback for operational objectives beyond cost minimization and capacity estimation is an important goal. Finally, exploring the application of the defined real-time aggregate flexibility in other settings, such as multi-aggregator systems, frequency regulation and real-time pricing, is exciting.