Iterative Model Predictive Control for Piecewise Systems

In this paper, we present an iterative Model Predictive Control (MPC) design for piecewise nonlinear systems. We consider finite time control tasks where the goal of the controller is to steer the system from a starting configuration to a goal state while minimizing a cost function. First, we present an algorithm that leverages a feasible trajectory that completes the task to construct a control policy which guarantees that state and input constraints are recursively satisfied and that the closed-loop system reaches the goal state in finite time. Utilizing this construction, we present a policy iteration scheme that iteratively generates safe trajectories which have non-decreasing performance. Finally, we test the proposed strategy on a discretized Spring Loaded Inverted Pendulum (SLIP) model with massless legs. We show that our methodology is robust to changes in initial conditions and disturbances acting on the system. Furthermore, we demonstrate the effectiveness of our policy iteration algorithm in a minimum time control task.


I. INTRODUCTION
Robots performing complex tasks can be described as hybrid systems, which are characterized by continuous dynamics and discrete events. Therefore, controllers designed for such systems can take control actions based on continuous and discrete decision variables. Yet the presence of discrete variables make planning and control problems challenging, as it is required to reason about all possible combinations of discrete events. This challenge can be mitigated by designing hierarchical strategies, where a high-level planner computes the discrete variables and a low-level controller optimizes the system trajectory described by continuous variables [1]- [4].
A popular methodology to synthesize policies, which can jointly plan over discrete and continuous states is Model Predictive Control (MPC) [5]- [8]. MPC is a control strategy which systematically uses forecast to compute control actions. At each time step, an MPC plans a trajectory over a short time window, then the first control action is applied to the system and the process is repeated at the next time step based on new measurements. When the system dynamics are hybrid, the MPC planning problem is a Mixed Integer Program (MIP) that is hard to solve online with limited computational resources. For this reason, significant work has focused on explicit MPC strategies where the solution to the MIP is solved offline as a parametric optimization problem [7]- [9]. For hybrid systems described by piecewise affine dynamics the parametric optimization problem can be solved exactly. Once the solution is computed offline, the MPC policy is given by a look-up table of feedback gains that can be efficiently implemented online in real-time [10], [11]. However, computing the explicit solution to hybrid MPC problems is computationally demanding. Ugo Rosolia and Aaron D. Ames are with the AMBER lab at Caltech, Pasadena, USA. E-mails: {urosolia, ames}@caltech.edu. The authors would like to acknowledge the support of the NSF award #1932091 .
Another strategy to speed-up the computation of the MPC policy is to leverage warm-starting strategies, where the optimization algorithm is initialized using a candidate solution. Several strategies have been proposed for warm-starting hybrid MPC problems [12]- [15]. These approaches leverage the trajectory computed at the previous time step to warm-start both the continuous and discrete variables. As the complexity of solving MIPs is given by the computation of the optimal integer variables, recent works have investigated the possibility of leveraging learning algorithms to predict the set of active discrete variables used to warm-start the MPC [16]- [18].
In this work, we focus on control tasks where the goal is to steer the system from a starting configuration to a goal state in finite time, while satisfying state and input constraints. We assume that a feasible trajectory that is able to perform the task is available. Then, we synthesize a control policy, which plans the system trajectory over a finite horizon that is shorter than the control task duration and may cause the controller to take unsafe shortsighted control actions. Thus, building upon [19], [20], we present a methodology to construct the MPC terminal components in order to guarantee satisfaction of the safety constraints and convergence in finite time of the closed-loop system to the goal set.
Compared to previous works [19], [20], we show how to handle piecewise systems by warm-starting the integer variables and we present a shrinking horizon strategy tailored to finite time control tasks. We present an algorithm which solves at most M smooth optimization problems and is guaranteed to find a feasible solution to the original MIP planning problem. Our approach is based on a sub-optimal trajectory that can complete the task and affects the closedloop performance of the controller. Therefore, we present a policy iteration algorithm, where simulations are used to iteratively update the controller. We prove that our algorithm returns safe policies that have non-decreasing performance. Finally, we demonstrate the effectiveness of our approach on the discretized Spring Loaded Inverted Pendulum (SLIP) [21].

II. PROBLEM FORMULATION
In this section, we describe the system model and the control synthesis objectives. We consider discrete time piecewise nonlinear systems defined over R disjoint regions D i ⊆ R n for i ∈ {1, . . . , R}: where the state x t ∈ R n and the input u t ∈ R d . In the above equation f i : R n × R d → R n represents the system dynamics, which describe the evolution of the discrete time system when the state x t belongs to the region D i ⊆ R n . Furthermore, the system is subject to the following state and input constraints: where T ∈ {0, 1, . . .} is the duration of the control task. Notice that several robotic systems can be described by piecewise constrained nonlinear models defined over disjoint regions, such as the SLIP model presented in Section V-A. Objective: Our goal is to design a control policy π : R n → R d which maps states to actions, i.e., The above control policy should guarantee that the state and input constraints from (2) are satisfied and that the closedloop system (1) and (3) converges in finite time to a goal set G ⊂ X . More formally, the control policy (3) should guarantee that for an initial condition x I in a neighborhood of a starting state x S ∈ X , the trajectory of the closed-loop system (1) where the stage cost l : R n × R d → R. Notice that in the above problem the system dynamics are a function of the integer variables i t ∈ {1, . . . , R}. Therefore, for a feasible set of continuous inputs [u 0 , . . . , u T −1 ] and integer variables [i 0 , . . . , i T −1 ], we have that the resulting vector of states [x 0 , . . . , x T ] must satisfy x t ∈ D it ∩ X , ∀t ∈ {0, . . . , T − 1}.
Throughout the paper we make the following assumptions.
Assumption 1. We are given the state-input trajectories which are feasible for the FTOCP (4) with x I = x S : For any x ∈ G, the stage cost l(x, u) = 0, for all u ∈ U. Moreover, the set G is control invariant, i.e, for all x ∈ G, there exists a control u ∈ U and index i ∈ {1, . . . , R} such that f i (x, u) ∈ G and x ∈ D i . Remark 1. The proposed methodology requires only feasibility of a trajectory x 0 . However, the optimality of the trajectory x 0 affects the performance of the proposed control synthesis strategy. For this reason, in Section III-C we present a policy iteration scheme that may be used to iteratively improve the closed-loop performance of the policy.

III. CONTROL DESIGN
In this section, we first introduce an FTOCP which can be recast as a non-linear program (NLP). Afterwards, we present the proposed strategy which leverages this NLP and the feasible trajectory from Assumption 1. Finally, we present a policy iteration scheme which can be used to improve the performance of the control policy.

A. Model Predictive Control
The FTOCP (4) is challenging to solve as at each time t the system dynamics change as a function of the state x t , i.e., However, the computational complexity may be reduced when searching for a feasible sub-optimal solution. In particular, a feasible solution to the FTOCP (4) may be computed fixing a priori a sequence of regions {D i0 , . . . , D i T −1 }, where the system should be constrained at each time t. Clearly, a trajectory which steers the system from the starting state x I to the goal set G, while visiting the sequence of regions {D i0 , . . . , D i T −1 } is a feasible trajectory for the original FTOCP (4).
In order to reduce the computational complexity, we introduce an FTOCP defined over a horizon N shorter than T and for a set of indices In particular, given a set of indices I t , the terminal state x F and an associated terminal cost q F ∈ R we define the FTOCP: The optimal state-input sequences to the above FTOCP steer the system from the starting state x t to the terminal state x F while satisfying state and input constraints. In the FTOCP (5), at each predicted time k the system state x k|t ∈ D i k ∩ X , and therefore problem (5) can be recast as an NLP, which is easier to solve than problem (4) where the optimization is carried out over continuous and integer variables. Next, we present the proposed algorithm which chooses the set I t = {i t , . . . , i t+N −1 } associated with the sequence of regions {D it , . . . , D i t+N −1 }, the terminal state x F , and terminal cost q F that are used in the FTOCP (5).

B. Policy Synthesis
This section describes the proposed strategy. For each state x 0 t of the feasible trajectory x 0 from Assumption 1, we define the vector where i 0 t ∈ {1, . . . , R} identifies the region containing the system's state at time t, i.e., x 0 t ∈ D i 0 t . Moreover, for each state x 0 t we introduce the cost-to-go q 0 t given by the recursion: for q 0 T = 0 and the stage cost l : R n × R d → R from (4). Finally, we define the cost vector (5) 10: The cost vector q 0 , the feasible trajectory x 0 and the vector of indices i 0 , together with the parameters M ∈ {0, 1, 2, . . .} and N ∈ {0, 1, . . .}, are used to initialize the proposed control policy, which is described by Algorithm 1. At each time t, Algorithm 1 takes as input the state of the system x t and it returns the best cost and the control action u t , which is applied to system (1). First, given the state x t we identify the region's index i t such that x t ∈ D it (line 1). Afterwards, we solve M times the FTOCP (5) with a different terminal state x F = x 0 t F , terminal cost q F = q 0 t F and set of indices I t (lines [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. Note that the set of indices I t (line 8) is computed appending to the current region's index i t a sequence of N t − 1 indices selected backward from time t F , which is chosen independently of N t for t > 0. As a result, at each time t the controller solves the FTOCP (4) using a sequence of indices which is different from the one associated with the first feasible trajectory 1 . As we will discuss in Section IV at time t and for m = 0, the FTOCP (5) is feasible when the terminal components are defined as in lines 5-7. Therefore if for m > 0 we have that c m−1 < c m , we stop solving the set of M NLPs and we save the best cost and the associated index m * t . Then, we update the parameter k t that is used to define the terminal MPC components and we shrink the prediction horizon, if the terminal predicted statē x F,m * t equals the terminal state x 0 T . Finally, we return the optimal control action u t =ū m * t . 1 For example, at time t = 0, for m = 2, k 0 = N = 4 and T = 100, we have that t F = k 0 + m = 6; therefore compute q i from (7) with q i T = 0 9: define π i+1 via Algorithm 1 initialized with q i , x i , i i , M , k 0 = N , N 0 = N 10: end for 11: Outputs (π 0 , x 0 , q 0 ), · · · , (π j+1 , x j+1 , q j+1 )

C. Policy Iteration
The control policy from Algorithm 1 leverages the feasible trajectory x 0 to compute the terminal components used in the MPC problem (5) and, as a result, the performance of the proposed methodology is affected by the optimality of x 0 . In this section, we discuss a policy iteration strategy that may be used to improve the performance of the control policy from Algorithm 1. In particular, we simulate the closed-loop system and we iteratively update the control policy.
At iteration j ≥ 1, we define the control policy π j : R n → R d which is given by Algorithm 1 initialized using the feasible trajectory x j−1 , the vector of indices i j−1 , and the cost vector q j−1 . For an initial condition x j 0 = x S , the policy π j may be used to simulate the trajectory of the closed-loop system: which is then used to update the policy π j+1 at the next iteration j + 1, as shown in Algorithm 2. In what follows, we show that the closed-loop performance of the control policy π j is non-decreasing at each jth policy update.

IV. PROPERTIES A. Recursive Feasibility and Finite-Time Convergence
We show that at each time t Algorithm 1 in closed-loop with system (1) guarantees constraint satisfaction and that the closed-loop system converges in finite time to the set G. (1) and (3), where the policy π is given by Algorithm 1. Let Assumptions 1-2 hold. If at time t = 0 the initial condition x 0 = x S , then the closed-loop system (1) and (3) satisfies state and input constraints from (2) and it reaches the terminal set G at time T , i.e., x t ∈ X , u t ∈ U, ∀t ∈ {0, . . . , T − 1} and x T ∈ G.

Theorem 1. Consider the closed-loop system
Proof: The proof follows by standard MPC arguments [?], [8] and the construction of the time-varying component from Algorithm 1. Assume that at time t Algorithm 1 returns a feasible control action u t and let be the optimal solution associated with the m * t th FTOCP (5) solved at time t. Next, we consider three cases to show that the time-varying components defining the FTOCPs solved at line 9 of Algorithm 1 guarantee that Algorithm 1 returns a feasible control action u t+1 at the next time step t + 1: Case 1: Ifx F,m * t = x 0 T and the horizon N t = 1, then k t+1 = T , N t+1 = 1 and x t+1 = x 0 T , therefore by the invariance of G for m = 0 the FTOCP J(x t+1 , x F , q F , I t+1 , N t+1 ) with x F = x kt+1 = x 0 T is feasible. Case 2: Ifx F,m * t = x 0 T and the horizon N t > 1, then k t+1 = T and for m = 0 the following state-input se- . From Cases 1-3, we have that if at time t Algorithm 1 returns a feasible action u t , then at time t + 1 Algorithm 1 returns a feasible control action u t+1 . Now, we notice that at time t = 0 the sequence of actions [u 0 0 , . . . , u 0 N0 , which in turns implies that Algorithm 1 returns a feasible action u t at all times and that state and input constraints are satisfied.
Finally, we show finite time convergence of the closed-loop system to the goal set G. From Algorithm 1, we have that k t increases at each time step until k t = T after at most T − N time steps and, afterwards, that the horizon shrinks. Therefore, at time T − 1 we have that N T −1 = 1, t F = k T −1 = T and x F = x 0 T , thus the predicted state at time T − 1 of the optimal trajectory from (9) satisfies x * T |T −1 = x 0 T , which in turns implies that x T = x * T |T −1 = x 0 T ∈ G. Corollary 1. Consider the closed-loop system (1) and (3), where the policy π is given by Algorithm 1. Let Assumptions 1-2 hold. If at time t Algorithm 1 returns a feasible control action u t ∈ U, then the closed-loop system (1) and (3) satisfies constraints (2) and it converges to the goal set G.
The above corollary highlights the advantage of computing a policy that maps states to actions and it can be used to deal with perturbed initial conditions and uncertainties. In the result section we perform an empirical study where we test the robustness of the proposed methodology by changing initial conditions and simulating disturbances acting on the system.

B. Iterative Improvement
This section discusses the properties of the iterative Algorithm 2. In particular, we show that at each policy update the cumulative cost associated with the closed-loop trajectories from (8) is non-increasing. Notice that our strategy guarantees non-increasing cost at each update, but no guarantees are given about the optimality of the trajectory at convergence.

A. Spring Loaded Inverted Pendulum Model
This section describes the Spring Loaded Inverted Pendulum (SLIP) model with massless legs [21]. The system state is , where δ represents the change of leg stiffness, v z the velocity along the x-axis of the foot not in contact with the ground, and (v l y , v r y ) the velocities of the two feet on the y-axis. Given the state of the system, we can compute the angle θ k = tg −1 ((p x − z k x )/p y ), ∀k ∈ {l, r} and the length l k = l 0 − (p x − z k x ) 2 + p 2 y , ∀k ∈ {l, r}, where l 0 = 0.55 is the rest leg length, for more details please refer to [21]. These quantities are used to define the dynamics: where the integer variable γ k equals one if the kth foot is in contact with the ground. In particular, we have that (γ l , γ r ) = (1, 1) if x ∈ D ds , (γ l , γ r ) = (1, 0) if x ∈ D l , and (γ l , γ r ) = (0, 1) if x ∈ D r , where the regions are defined as follows: D ds = {x ∈ R n |l r ≤ l max , l l ≤ l max , z l y = 0 and z r y = 0}, D l = {x ∈ R n |l r ≤ l max , z l y = 0 and z r y > 0}, D r = {x ∈ R n |l l ≤ l max , z l y > 0 and z r y = 0}. Finally, we notice that when a sequence of regions is fixed, the vertical motion of the feet can be computed independently from the other states.

B. Simulations
First, we compute a feasible trajectory that steers the system from standing still at (p x , p y ) = (0, 0.85) to a goal state x g = (10, 0.85, 0, 0, 9.9, 10.2, 0, 0). In order to compute a feasible trajectory x 0 , we fixed a sequence of regions {D t } T t=0 where the system should be at each time t and we solved the resulting NLP with IPOPT [22] using CASADI [23]. Furthermore, we added a slack variable to the terminal constraint, we used and the input constraints δ ∈ {δ : ||δ|| ≤ 10} and v z ∈ {v z : ||v z || ≤ 10}. Code available at https://github.com/ urosolia/SLIP. Figure 1 shows the feasible trajectory which steers the system from the starting configuration to the goal position while transitioning through phases of double support (black) and single support with the left (red) and right (blue) legs. The figure also shows the locations where the feet are in contact with the ground. This feasible trajectory is used to initialize the control policy from Algorithm 1. We tested the proposed strategy with N = 30 on 10 different initial conditions in the neighborhood of the starting state x S . We set M = 1 to test the robustness to disturbances and changes in initial conditions of Algorithm 1, when only one FTOCP is solve at each time t. Figure 2 shows that for all initial conditions the controller is able to steer the system to the goal state. Notice that in order to stabilize the system the proposed strategy is able to plan a sequence of foot steps, which are different from the one associated with the feasible trajectory. Finally, it takes on average less than 0.05s to run Algorithm 1 with a maximum computation time of 0.114s across all simulations.
Furthermore, we compared the proposed strategy with a tracking controller which is defined removing the terminal constraint from (4) and using a tracking cost instead of (14). Both our method and the tracking controller are able complete the task. However, as shown in Figure 3, the tracking controller fails to reach the goal when a disturbance hits the system. In Figure 3, it is interesting to notice that the proposed approach initially deviates more from the offline trajectory compared to the tracking controller. This deviation allows the controller to compensate for the disturbance and to stabilize the system back to a periodic gait.
Finally, we tested Algorithm 2 to iteratively update the control policy. We set j = 40 and we changed the stage cost to encode the objective of steering the system from the starting state to the goal state in minimum time. In particular, we defined the stage costl(x, u) = 1 G (x) + 0.0001l(x, u), where the function 1 G (x) = 0 if x ∈ x 0 T and 1 G (x) = 1 Fig. 3. Comparison between an MPC tracking controller and the proposed strategy, when a disturbance hits the system.
otherwise. Moreover, we set M = 40 to allow Algorithm 1 to solve multiple instances of the FTOCP (4). Note that as M gets larger the controller can better explore the state space, but as a trade off the computational cost increases. In this example it takes on average less than 0.4s and at most 0.9s to run Algorithm 1. We initialized the proposed policy iteration strategy with the feasible trajectory x 0 from Figure 1 that steers the system from the starting point to the goal state in 396 time steps, and our algorithm returned a policy which completes the task in 193 time steps. The closed-loop trajectory is shown in Figure 4. First the controller accelerates and as a result the CoM oscillates more compared to the first feasible trajectory from Figure 1. Finally, the controller slows down and reduces the oscillation to reach the goal position with zero speed and two feet on the ground.

VI. CONCLUSIONS
We presented an algorithm to synthesize control policies for hybrid systems by leveraging a feasible trajectory for the control task. We showed that the proposed methodology guarantees constraints satisfaction and convergence in finite time. Building upon the proposed synthesis strategy, we presented a policy iteration algorithm which guarantees that at each policy update the closed-loop performance is non-decreasing. Finally, we tested the proposed strategy on a discretize SLIP model.