Episodic Koopman Learning of Nonlinear Robot Dynamics with Application to Fast Multirotor Landing

This paper presents a novel episodic method to learn a robot's nonlinear dynamics model and an increasingly optimal control sequence for a set of tasks. The method is based on the {\em Koopman operator} approach to nonlinear dynamical systems analysis, which models the flow of {\em observables} in a function space, rather than a flow in a state space. Practically, this method estimates a nonlinear diffeomorphism that lifts the dynamics to a higher dimensional space where they are linear. Efficient Model Predictive Control methods can then be applied to the lifted model. This approach allows for real time implementation in on-board hardware, with rigorous incorporation of both input and state constraints during learning. We demonstrate the method in a real-time implementation of fast multirotor landing, where the nonlinear ground effect is learned and used to improve landing speed and quality.


I. INTRODUCTION
While modeling and identification (ID) techniques are well developed for some robotic mechanisms, such as manipulators, there are an increasing number of applications where modeling and ID are difficult. Consider a multirotor drone, whose basic flight mechanics in open air are well understood [1], [2]. However, when multirotors fly close to the ground or a wall, or inside a narrow tunnel (e.g., for noninvasive inspection), the unmodeled effects of the complex vehicle-air-environment interaction can substantially reduce the drone's path tracking accuracy, and perhaps its stability. While ground effect models can be incorporated, their accuracy is limited, and their parameters must still be estimated in a slow process. This particular problem motivates this paper. There are many other applications, such as soft robotic structures or robotic manipulation of soft materials, where first principles modeling and parameter identification remain challenging in practice. Moreover, one must still design a controller for the nonlinear mechanics that are identified.
Learning can capture the salient aspects of a robot's complex mechanics and environmental interactions. Gaussian process dynamical systems models [3] can identify nonlinear affine control models in a non-parametric way. Yet, effective nonlinear control design after identifying the model can be challenging. Model-free reinforcement learning (MFRL) [4] learns feedback policies that implicitly incorporate the robot's dynamics. However, sample efficiency is very low. Moreover, while safety during MFRL is now possible [5], [6], one cannot yet guarantee that learned policies will satisfy performance requirements or state and actuator limits.  This paper presents a new method, based on Koopman spectral analysis, to learn nonlinear robot dynamical models. Conventionally, a system's behavior is characterized via its state space flows. In contrast, Koopman-based approaches study the evolution of observables, which are functions over the state-space. In this space, the system can be represented by a linear (but possibly infinite dimensional) operator [7], [8]. Practically, the nonlinear dynamics are lifted to a higher dimensional space where they are linear. Efficient linear and optimal control principles can be applied to the lifted system.
Koopman inspired modeling and identification techniques have received substantial recent attention [9]. The Dynamic Mode Decomposition (DMD) and extended DMD (EDMD) methods [10] have been successfully used in the field of fluid mechanics to capture low-dimensional structure in complex flows. More recently, Koopman-style modeling has been extended to controlled nonlinear systems [11], [12].
In practice, to model and identify a nonlinear system in the Koopman framework, one can identify a finite dimensional approximation to the linear Koopman operator, or identify the Koopman eigenfunctions. We use the latter approach (see Section II-B). Previous methods for identifying Koopman eigenfunctions (e.g., [13], [14]) depend upon assumptions that are problematical for robotic systems: the ID data is gathered while the robot operates under open loop controls, which can lead to catastrophic system damage.
In very recent work [15], the authors and collaborators have developed a Koopman Eigenfunction Extended Dynamic Mode Decomposition (KEEDMD) method to identify/learn an unknown nonlinear dynamical system from a batch of data gathered while the (robotic) system operates under a stabilizing (but not necessarily optimal) linear controller. This paper substantially extends our prior work [15] to make it practically useful for robotics. First, the method gathers data while the system operates under any nonlinear stabilizing controller. This enables input vector field nonlinearities to be captured, unlike prior Koopman-based model ID approaches. Second, we introduce an episodic learning procedure, by considering the closed-loop dynamics obtained with a non-linear controller as the autonomous dynamics for the next episode. This feature increases sample efficiency (i.e., fewer learning trials) for improving specific tasks, and enables nonlinear actuation effects, which are important in robotics, to be captured in the Koopman eigenfunctions. Third, it should be noted that data collected from robots executing trajectories formally violates the i.i.d. assumption underlying the performance guarantees of most learning paradigms. In practice, this fact can lead to error cascades and poor performance guarantees. Episodic learning mitigates this problem [16]. Finally, our method integrates Model Predictive Control (MPC) [17] into its structure, thereby allowing control and state constraints to be satisfied during the learning process.
Previous approaches have implemented MPC in real time on modest computational hardware [18], on multirotors using an explicit solution in simulation [19], and designed feedback linearizing controllers for multirotors in real experiments [20], [21]. Bouffard et al. [22] also used MPC to learn ground effects using an experimental multirotor, but used an Extended Kalman Filter (EKF) in combination with Learning-Based MPC (LBMPC). Shi et al. [23] experimentally demonstrated using a spectrally-normalized neural network to learn the ground effect and improve drone landing by designing a feedback linearizing controller utilizing the learned model. We introduce a new approach to solve this problem and aim to demonstrate that our method represents a first step towards practical Koopman-based learning and control of real-world robotic systems.
Section II reviews relevant facts about the Koopman operator and KEEDMD [15]. Sections III and IV introduce and develop the first main paper contribution-the episodic KEEDMD algorithm. Section V presents our second contribution, the experimental demonstration that our method can learn the ground effect in a fast landing multirotor.

A. The Koopman Operator
Consider the autonomous dynamical system: with the state x ∈ X ⊂ R n and f Lipschitz continuous on X . We assume that the system (1) has a fixed point at the origin, f (0) = 0. The flow of this dynamical system is denoted by S t (x) and is defined as for all x ∈ X and all t ≥ 0. The Koopman operator semigroup (U t ) t≥0 , from now on simply denoted as the Koopman operator, is defined as for all g ∈ C(X ), where • denotes the function composition and C(X ) is the space of all continuous functions on X . Each element of the Koopman operator maps continuous functions to continuous functions, U t : C(X ) → C(X ). Crucially, each U t is a linear operator. An eigenfunction of the Koopman operator associated to an eigenvalue e λ ∈ C is any function φ ∈ C(X ) that defines a coordinate evolving linearly along the flow of (1) satisfying

B. Data-driven Construction of Koopman Eigenfunctions
For any sufficiently smooth autonomous dynamical system that is asymptotically stable to a fixed point, Koopman eigenfunctions can be constructed by finding the eigenfunctions of the system's linearization around the fixed point and then composing them with a diffeomorphism [13]. The linearization of the dynamics (1) around the origin iṡ The following proposition describes how to construct eigenfunction-eigenvalue pairs for the linearized system (5). Proposition 1. LetÂ denote the linearization (5) of nonlinear system (1) and let {v 1 , . . . , v n } be a basis of the eigenvectors ofÂ corresponding to nonzero eigenvalues {λ 1 , . . . , λ n }. Let {w 1 , . . . , w n } be an adjoint basis to {v 1 , . . . , v n } such that v q , w r = δ qr and w q is an eigenvector ofÂ * at eigenvalueλ q . Then, the linear functional ψ q (y) = y, w q (6) is a nonzero eigenfunction of UÂ, the Koopman operator associated toÂ. Furthermore, for any (m 1 , . . . , is an eigenpair of the Koopman operator UÂ. These linear functionals (6), termed principal eigenfunctions, are used to construct the eigenfunctions associated with the Koopman operator of the nonlinear dynamics through the use of a conjugacy map (See [9], Prop. 7).
Proposition 2. Assume that the nonlinear system (1) is topologically conjugate to the linearized system (5) via the diffeomorphism h : X → Y. Let B ∈ X be a simply connected, bounded, positively invariant open set in X such that h(B) ⊂ Q r ⊂ Y, where Q r is a cube in Y. Scaling Q r to the unit cube Q 1 via the smooth diffeomorphism g : Q r → Q 1 gives (g • h)(B) ⊂ Q 1 . Then, if ψ is an eigenfunction for UÂ at e λ , then ψ • g • h is an eigenfunction for U f at eigenvalue e λ , where U f is the Koopman operator associated with the nonlinear dynamics (1).
An extension of the Hartman-Grobman theorem ( [7], Theorem 2.3) guarantees the existence of a C 1 diffeomorphism between the linearized and nonlinear systems in the entire basin of attraction of a fixed point, such that Dc(0) = I.

C. KEEDMD with Trajectory-tracking Control Laws
Data-driven construction of Koopman eigenfunctions for nonlinear dynamics of the formẋ = a(x) + Bu is based on Section II-B, and summarized in Algorithm 1. General nonlinear dynamics will be considered in Section III. For trajectory-tracking control laws, the algorithm assumes that a linearized nominal dynamics modelẏ = A nom y +B nom u is known, along with a nominal stabilizing feedback control law Products and powers (7) generate arbitrarily many eigenpairs of the linearized system before applying the diffeomorphism (8) between the nonlinear and linearized dynamics. This diffeomorphism h : X → Y is learned in a supervised way (e.g. a neural network trained with gradient descent) by performing empirical risk minimization (ERM) of an appropriate loss function over a model class H h . For the trajectory-tracking case, the loss function is of the form (see [15] for details). Finally, a function scaling Y ⊂ Q r into Q 1 , where Q r is a hyper cube of the same dimension as Y with radius r, is constructed (i.e. by scaling each coordinate into a unit cube) and approximate Koopman eigenpairs for the unknown, nonlinear dynamics are constructed: (e λi , φ i ) = (ẽ λ i ,ψ i (g(h(y)))), whereẽ λ i ,ψ i are the eigenpairs constructed with (7).
. . , N Construct principal eigenpairs for the linearized dynamics: (e λ j , ψj(y)) ← (e λ j , y, wj ), j = 1, . . . , n Construct N eigenpairs from the principal eigenpairs: Construct scaling function: g(y) ← g : Qr → Q1 Construct N eigenpairs for the nonlinear dynamics: (ẽ λ i , φi) ← (ẽ λ i ,ψi(g(h(y)))), i = 1, . . . , N Output: Λ = diag(ẽ λ 1 , . . . ,ẽ λ N ), φ = [φ1, . . . , φN ] T KEEDMD uses the constructed eigenfunctions to lift the system states to a higher dimensional space where a linear dynamical model of the formż = Az+Bu can be identified. For Lagrangian dynamics, as the example in Section V, we use z = [x, φ(x)] T as the lifted state, where x = [p v] T , p the position,ṗ = v the velocity, and φ is a vector of the eigenfunctions. While helping data efficiency, the method does not generally require any a priori information of the structure of the dynamics. Since the time evolution of the eigenfunctions is dictated by their eigenvalues, we can show that the lifted state space model has the structure where 0, I, Λ, K nom are fixed matrices and A vp , A vv , A vφ , B p , B v , B φ are determined from data, and the term −B φ K nom accounts for the nominal controller's state feedback effect on the evolution of the eigenfunctions. The desired trajectory effect of the controller is captured by the learning framework. The unknown elements of A and B can be found using linear regression on the data collected under the nominal control law (see [15] for details). We define L z as the mean squared error for the regression problem, where different forms of regularization can be included if needed. As the feedback controller is statedependent, it is not possible to disambiguate its effect from the passive uncontrolled system dynamics in the learning process. To avoid an ill-conditioned KEEDMD regression problem, Brownian noise is added to perturb the nominal controller [24]. Brownian noise is chosen in this instance because pure sampling from e.g. a Gaussian distribution leads to perturbations that have too high frequency to perturb the movement of the multirotor. This perturbation is also used by our episodic learning framework (Section IV). When the lifted state space model is identified, state estimates can be obtained as x = Cz, where C = [I 0]. C is denoted the projection matrix of the lifted state space model.

A. Problem Setup and Dynamics Modeling
Assume that we have selected a fixed trajectory τ to be tracked by the robot during episodic learning. Further assume a nominal controllerû(x, τ , t) that can stabilize the system to τ within a region of attraction Ω around the trajectory. This controller might be the outcome of a previous learning episode (see below), or the simple linear nominal controller from KEEDMD (Section II). Finally, the system's governing dynamics are assumed to be unknowṅ where x ∈ X ⊂ R n , u ∈ U ⊂ R m . and f (x, u) is assumed to be Lipschitz continuous on X × U. KEEDMD (Section II) requires batch training data to be collected from system executions under a nominal linear control law: u nom (x) = K nom (x − τ (t)). A main contribution of this paper is to iteratively learn an improving sequence of eigenfunctions and nonlinear controllers. Specifically, we will iteratively use the lifted state-space model to design an MPC-controller to track learning trajectories (see Figure 2).
If a candidate nonlinear controllerû(x, τ , t) can stabilize system (10) to a given trajectory τ , the controlled system can be described by the autonomous dynamicṡ Importantly, for the autonomous dynamics (11), there exists an associated Koopman operator U Fû ,τ that depends on control lawû and trajectory τ . Therefore, approximate eigenpairs for U Fû ,τ can be constructed (see Section II-B) from the gathered state and control samples. A lifted statespace model can be constructed from these eigenpairs. However, unlike the framework reviewed in Section II, we aim to learn a dynamical model that assumes that the system is already regulated by the nominal controllerû(x, τ , t). As a result, the A-matrix of the lifted state space model captures the autonomous dynamics under the nominal control law (Eq. 11), and the B-matrix captures the effect of control variations around the nominal controller: This model is used in an MPC framework below to design an augmenting control law that adds optimal control actions to the nominal controller. The augmenting controller leverages the improved system model to make corrections to suboptimal actions taken by the nominal controller.

C. Modifications to Allow the Diffeomorphism to Capture Nonlinear Control and Dynamics Effects
To enable the learning framework to capture nonlinear effects caused by the nonlinear controller and actuated dynamics, a minor modification to the function approximator of h is necessary. Namely, since the diffeomorphism is affected by the forcing signal τ (t) it must be included in the inputs of h. This is motivated by the form of the diffeomorphism loss function (9). In the case considered in the preliminaries however, the actuated dynamics and controller are assumed to be linear. This causes the effect of the forcing signal τ (t) to cancel out such that the diffeomorphism is independent of the desired trajectory. In the general nonlinear case however, the effect is not canceled out and must be captured by the diffeomorphism. As a result, the diffeomorphism is modified such that h : X × X → Y (see [15] for details).

IV. EPISODIC EIGENFUNCTION CONSTRUCTION AND KEEDMD INFERENCE
This section describes the main contribution of this paper, a substantial extension of the KEEDMD framework to allow iterative learning and improvement of the lifted state-space model and its associated controller.

A. Overview of the Episodic Learning Algorithm
Algorithm 2 summarizes our episodic learning approach, which applies three key steps per episode. In each episode, e, the first key step starts when an initial condition is sampled from set X 0 and an experiment is executed with the controller that results from the previous episode u e−1 (x, τ , t).
The state x, control actions u e−1 , Brownian noise control perturbationsũ, and the desired position dictated by the trajectory at the time associated with the i-th sample τ i are sampled. State data can be differentiated numerically to find estimatesẋ. The resulting data set is: denotes the i-th timestep of the e-th episode and T s denotes the number of samples in the episode. From D (e) x we estimate the diffeomorphism h and construct the eigenfunctions φ (e) (x) with associated eigenvalues Λ (e) , via Algorithm 1. Since changes in the control law between episodes are expected to be small, we warm start the learning algorithm with model coefficients from the previous episode.
The second key step is to use the constructed eigenpairs to build a lifted data set D which is the same data as D (e) x , but with the state and its derivative, x z . The lifted state-space model is constructed from this data using the framework of Section III-B. This results in a model of the form (12).
In the third and final step, an augmenting MPC is designed (see Section IV-B) for the lifted state-space model. The evaluation of the previous iteration's controllers is necessitated by the fact that the eigenfunctions depend on the dynamics under closed loop control with the controller deployed in the previous episodes. The controller augmentations are weighted and added to the previous episode's control law: u e = u 0 + e j=1 w j u j , where w e is a weighting factor indicating the confidence in the augmenting controller. The weighting factors can be any monotonically increasing sequence on the interval [0, 1] which allows the augmenting controller to have a bigger impact after a sufficiently rich data set has been collected.

B. Model Predictive Controller Details
Inspired by [25], we transform the original non-linear optimization problem into an efficient quadratic program (QP). The QP formulation requires us to discretize the previously learned linear continuous dynamics. We assume a known objective function of states and controls only. For simplicity, we use a quadratic objective function with respect to the state error and control action, but other objective functions can be used by simply adding it to the lifting functions. We assume known control bounds u min , u max ∈ R m and state bounds x min , x max ∈ R n . Because the control input for each MPC problem refers to the change from the the previous controller, we have to correct for this change in the control bounds. All these assumptions define the following optimization problem that is solved at each time step: where Q ∈ R n×n and R ∈ R m×m are positive semidefinite cost matrices, τ ∈ R n×Np is the reference trajectory, A j ∈ R N ×N and B j ∈ R N ×m are the discrete time versions of (12) for controller j, C j ∈ R n×N is the j th controller's projection matrix, and φ j ∈ R N are the j th controller's eigenfunctions. See Figure 2 to see how each controller is used as more episodes are being executed. In addition, we add a smoothing regularizer to avoid chatter that may arise from optimization-based controllers [26] of the form Np p=1 α R (u p − u p−1 ) 2 where u 0 is the deployed control action at the previous timestep.

V. IMPROVING FAST MULTIROTOR DESCENT AND LANDING BY LEARNING THE GROUND EFFECT
To validate our methodology, we apply it to fast descent and landing of a multirotor 1 . As the vehicle approaches the landing plane, a ground effect from the interaction of the prop downwash and the landing surface becomes prominent. This effect induces added upward thrust on the drone, which can lead to poor tracking performance for control designs that rely on models which omit these fluid flow interactions.

A. Modeling and Problem Statement
To simplify the discussion, we consider a 1-dimensional nominal model of the multirotor's altitude dynamics, consisting of a point mass model having altitude and its derivative, [p z ,ṗ z ] T , as states, mass m, and total thrust, T , as input: Using this model we design a nominal MPC as described in Section IV-B with the goal of reaching a fixed point of 0.05 m above ground at zero velocity. A nominal MPC stabilizes the drone to a fixed point, but uses more control effort and time to reach that point as a result of its simplified model. Importantly, the nominal dynamics model does not capture the ground effect. Our goal is to iteratively learn a better dynamics model (and associated MPC) that will improve speed and tracking performance in both the air and near-ground regimes.

B. Implementation and Experimental Details
Our experiments use the Intel Aero RTF Drone. Drone position is measured using an OptiTrack motion capture system and is fused with the drone's IMU (stock PX4 v1.8) to estimate the state. The diffeomorphism, h, is parameterized by a neural network and implemented with PyTorch [27], and the KEEDMD regression is implemented with elastic net regularization in Scikit-learn [28]. A dense form MPCcontroller is implemented in Python using the QP solver OSQP [29], and commands are sent to the PX4 flight controller via ROS. All computation for learning and control is done on board the drone. Each neural network and MPC evaluation takes 5 ms, limiting us to 5 episodes as update rates below 60 hz lead to poor performance on our hardware. The experiment's key parameters are summarized in Table I. We execute Algorithm 2 as discussed in Section IV on the drone for three episodes in each campaign. Each episode starts with 3 repetitions of the following: (1) the drone takes off and moves to an initial point under PX4 control; (2) the lifted controller takes over to stabilize the fixed point and hovers at that point for a second. After 3 repetitions, the drone lands under lifted control, fits the diffeomorphism and KEEDMD models, and repeats the episode. An additional landing sequence is executed to evaluate the performance of the current episode controller. Figure 3 depicts the drone's trajectory and control effort under the nominal controller (Episode 0), and then final landing for three episodes of a single learning campaign. Episode 0 represents the nominal performance before learning, while episodes 1-3 show the learning effect. Tracking error is reduced by 19.3 percent by the end of the last episode while the total control effort increases 4.5 percent as a consequence of the chosen MPC penalty matrices. Importantly, the thrust constraint is rigorously satisfied, and this constraint is active for longer duration. As the system learns more accurate dynamic models, it relies more on the open-loop bang-bang characteristic, as would be expected from an optimal solution, and less from closed loop control. Less control effort is needed towards the end of the trajectory, indicating that our methodology captures the ground effect. The mean and standard deviation of five independent learning campaigns are reported in Fig. 4. The tracking performance improves in every episode. Furthermore, the methodology has low variance between campaigns.

VI. CONCLUSIONS AND FUTURE WORK
This paper presented a novel episodic method based on a Koopman eigenfunction framework to learn a robotic system's nonlinear dynamics, and learn a near optimal control strategy for given tasks. By using a Koopman approach, we are able to implement a real-time MPC framework for optimal system control during the learning process. The approach improves performance as it gathers more data, augmenting the controller to avoid constant actuation matrix limitations, while respecting state and control input bounds. A current limitation is the addition of a controller in each episode leading to prohibitive computational complexity as the number of episodes grows. Current work is addressing this by consolidating the controllers into a finite set, and by optimally choosing when to switch to the next episode, reducing the number of episodes needed.