Technichal Report: Adaptive Control for Linearizable Systems Using On-Policy Reinforcement Learning

This paper proposes a framework for adaptively learning a feedback linearization-based tracking controller for an unknown system using discrete-time model-free policy-gradient parameter update rules. The primary advantage of the scheme over standard model-reference adaptive control techniques is that it does not require the learned inverse model to be invertible at all instances of time. This enables the use of general function approximators to approximate the linearizing controller for the system without having to worry about singularities. However, the discrete-time and stochastic nature of these algorithms precludes the direct application of standard machinery from the adaptive control literature to provide deterministic stability proofs for the system. Nevertheless, we leverage these techniques alongside tools from the stochastic approximation literature to demonstrate that with high probability the tracking and parameter errors concentrate near zero when a certain persistence of excitation condition is satisﬁed. A simulated example of a double pendulum demonstrates the utility of the proposed theory. 1


I. INTRODUCTION
Many real-world control systems display nonlinear behaviors which are difficult to model, necessitating the use of control architectures which can adapt to the unknown dynamics online while maintaining certificates of stability.There are many successful model-based strategies for adaptively constructing controllers for uncertain systems [1], [2], [3], but these methods often require the presence of a simple, reasonably accurate parametric model of the system dynamics.Recently, however, there has been a resurgence of interest in the use of model-free reinforcement learning techniques to construct feedback controllers without the need for a reliable dynamics model [4], [5], [6].As these methods begin to be deployed in real world settings, a new theory is needed to understand the behavior of these algorithms as they are integrated into safety-critical control loops.
However, the majority of the theory for adaptive control is stated in continuous-time [2], while reinforcement learning algorithms are typically implemented and studied in discrete-time settings [7], [8].There have been several attempts to define and study policy-gradient algorithms in continuous-time [9], [10], yet many real-world systems have actuators which can only be updated at a fixed maximum sampling frequency.Thus, we find it more natural and practically applicable to unify these methods in the sampled-data setting.
Specifically, this paper addresses the model mismatch issue by combining continuous-time adaptive control techniques with discrete-time model-free reinforcement learning algorithms to learn a feedback linearization-based tracking controller for an unknown system, online.Unfortunately, it is well-known that sampling can destroy the affine relationship between system inputs and outputs which is usually assumed and then exploited in the stability proofs from the adaptive control literature [11].To overcome this challenge, we first ignore the effects of sampling and design an idealized continuous-time behavior for the system's tracking and parameter error dynamics which employs a least-squares gradient following update rule.In the sampled-data setting, we then use an Euler approximation of the continuous-time reward signal and implement a policy-gradient parameter update rule to produce a noisy approximation to the ideal continuous-time behavior.Our framework is closely related to that of [12]; however, in this paper we address the problem of online adaptation of the learned parameters whereas [12] considers a fully offline setting.
Beyond naturally bridging continuous-time and sampled-data settings, the primary advantage of our approach is that it does not suffer from the "loss of controllability" phenomena which is a core challenge in the model-reference adaptive control literature [1], [13].This issue arises when the parameterized estimate for the system's decoupling matrix becomes singular, in which case either the learned linearizing control law or associated parameter update scheme may break down.To circumvent this issue, projection-based parameter update rules are used to keep the parameters in a region in which the estimate for the decoupling matrix is known to be invertible.In practice, the construction of these regions requires that a simple parameterization of the system's nonlinearities is available [14].In contrast, the model-free approach we introduce does not suffer from singularities and can naturally incorporate 'universal' function approximators such as radial bases functions or bases of polynomials.
The authors are with the department of Electrical Engineering and Computers Sciences and the University of California, Berkeley.
1 This draft corrects an important which appeared in an earlier draft.In particular, the right hand side of (52) was originally C 2 M ∆t ln( λ 2 ) to display the correct dependence on the confidence paramter λ.

arXiv:2004.02766v1 [cs.LG] 6 Apr 2020
However, due to the non-deterministic nature of our sampled-data control law and parameter update scheme, the deterministic guarantees usually found in the adaptive control literature do not apply here.Indeed, policy-gradient parameter updates are known to suffer from high variances [15].Nevertheless, we demonstrate that when a standard persistence of excitation condition is satisfied the tracking and parameter errors of the system concentrate around the origin with high probability even when the most basic policy-gradient update rule is used.Our analysis technique is derived from the adaptive control literature and the theory of stochastic approximations [8], [16].Proofs of claims made can be found in the Appendix of the document.Finally, a simulation of a double pendulum demonstrates the utility of the approach.

A. Related Work
A number of approaches have been proposed to avoid the "loss of controllability" problem discussed above.One approach is to perturb the estimated linearizing control law to avoid singularities [13], [17], [18].However, this method never learns the exact linearizing controller during operation and hence sacrifices some tracking performance.Other approaches avoid the need to invert the input-output dynamics by driving the system states to a sliding surface [3].Unfortunately, these methods require high-gain feedback which may lead to undesirable effects such as actuator saturation.Several model-free approaches similar to the one we consider here have been proposed in the literature [19], [20], but these focus on actor-critic methods and, to the best of our knowledge, do not provide any proofs of convergence.Recently, non-parametric function approximators have been been used to learn a linearizing controller [21], [22], but these methods still require structural assumptions to avoid singularities.
While our parameter-update scheme is most closely related to the policy gradient literature, e.g., [7], we believe that recent work in meta-learning [23], [24] is also similar to our own work, at least in spirit.Meta-learning aims to learn priors on the solution to a given machine learning problem, and thereby speed up online fine tuning when presented with a slightly different instance of the problem [25].Meta-learning is used in practice to apply reinforcement learning algorithms in hardware settings [26], [27].

B. Preliminaries
Next, we fix mathematical notation and review some definitions used extensively in the paper.Given a random variable X, if they exist the expectation of X is denoted E[X] and its variances is denoted by V ar(X).Our analysis heavily relies on the notion of a sub-Gaussian distribution.We say that a random variable X ∈ R n is sub-Gaussian if there exists a constant C > 0 such that for each t ≥ 0 we have P{|x| 2 ≥ t} ≤ 2 exp(− t 2 C 2 ).Informally, a distribution is sub-Gaussian if it's tail is dominated by the tail of some Gaussian distribution.We endow the space of sub-Gaussian distributions with the norm As an example, if X = N (0, σ 2 I) is a zero-mean Gaussian distribution with variance σ 2 I (with I the n-dimensional identity) then X ψ2 is sub-Gaussian with norm X ψ2 ≤ Cσ, where the constant C > 0 does not depend on σ 2 .

II. FEEDBACK LINEARIZATION
Throughout the paper we will focus on constructing output tracking controllers for systems of the form where x ∈ R n is the state, u ∈ R q is the input and y ∈ R q is the output.The mappings f : R n → R n , g : R n → R n×q and h : R n → R q are each assumed to be smooth, and we assume without loss of generality that the origin is an equilibrium point of the undriven system, i.e., f (x) = 0. Throughout the paper, we will also assume that state x and the output y can both be measured.

A. Single-input single-output systems
We begin by introducing feedback linearization for single-input, single-output (SISO) systems (i.e., q = 1).We begin by examining the first time derivative of the output: Here the terms are known as Lie derivatives [2].In the case that L g h(x) = 0 for each x ∈ R n , we can apply which exactly 'cancels out' the nonlinearities of the system and enforces the linear relationship ẏ = v with v some arbitrary, auxiliary input.However if the input does not affect the first time derivative of the output-that is, if L g h ≡ 0-then the control law (3) will be undefined.In general, we can differentiate y multiple times, until the input shows up in one of the higher derivatives of the output.Assuming that the input does not appear the first γ − 1 times we differentiate the output, the γ-th time derivative of y will be of the form Here, L γ f h(x) and L g L γ−1 f h(x) are higher order Lie derivatives, and we direct the reader to [2, Chapter 9] for further details.
enforces the trivial linear relationship y γ = v.We refer to γ as the relative degree of the nonlinear system, which is simply the order of its input-output relationship.

B. Multiple-input multiple-output systems
Next, we consider square multiple-input, multiple-output (MIMO) systems where q > 1.As in the SISO case, we differentiate each of the output channels until at least one input appears.Let γ j be the number of times we need to differentiate y j (the j-th entry of y) for at least one input to appear.Combining the resulting expressions for each of the outputs yields an input-output relationship of the form where we have adopted the shorthand y (γ) = [y ] T .Here, the matrix A(x) ∈ R q×q is known as the decoupling matrix and the vector b(x) ∈ R q is known as the drift term.If A(x) is non-singular on for each x ∈ R n then we observe that the control law where v ∈ R q yields the decoupled linear system where v k is the k-th entry of v and y γj j is the γ j -th time derivative of the j-th output.We refer to γ = (γ 1 , γ 2 , . . ., γ q ) as the vector relative degree of the system, with |γ| = i γ i the total relative degree of all dimensions.The decoupled dynamics (8) can be compactly represented with the LTI system ξr = Aξ r + Bv r (9) which we will hereafter refer to as the reference model.Here, A ∈ R |γ|×|γ| and B ∈ R |γ|×q is constructed so that B T B = I q×q , where I q×q is the q-dimensional identity matrix.Note that (9) collects ξ r = (y 1 , ẏ1 , . . ., . . ., y γ1−1 1 , . . ., y q , . . ., y γq−1 q ).It can be shown [2,Chapter 9] that there exists a change of coordinates x → (ξ, η) such that in the new coordinates and after application of the linearizing control law the dynamics of the system are of the form That is, the ξ ∈ R |γ| coordinates represent the portion of the system that has been linearized while the η ∈ R n−|γ| coordinates represent the remaining coordinates of the nonlinear system.The undriven dynamics η = q(ξ, η) (11) are referred to as the zero dynamics.Conditions which ensure that the η coordinates remain bounded during operation will be discussed below.

C. Inversion & exact tracking for min-phase MIMO systems
Let us assume that we are given a desired reference signal y d (•) = y 1,d (•), . . ., y q,d (•) .Our goal is to construct a tracking controller for the nonlinear system using the linearizing controller (7), along with a linear controller designed for the reference model (9) which makes use of both feedback terms.We will assume that the first γ j derivatives of y j,d (•) are well defined, and assume that the signal y j,d (•), y (1) j,d (•), . . ., y (γq) q,d (•) can be bounded uniformly.For compactness of notation, we will collect Here, ξ(•) is used to capture the desired trajectory of the linear reference model, and y (γ) d (•) will be used in a feedforward term in the tracking controller.To construct the feedback term, we define the error where ξ(•) is the actual trajectory of the linearized coordinates as in (10).Altogether, the tracking controller for the system is then given by where K ∈ R q×|γ| is a linear feedback matrix designed so that (A + BK) us Hurwitz.Under the application of this control law the closed loop error dynamics become ė = (A + BK)e (14) and it becomes apparent that e → 0 exponentially quickly.However, while the tracking error decays exponentially, the η coordinates may be come unbounded during operation, in which case the linearizing control law will break down.One sufficient condition for η to remain bounded is for the zero dynamics to be globally exponentially stable and for ξ d (•) and y d (•) to remain bounded [1, Chapter 9].When the zero dynamics satiecfy this condition we say nonlinear system is exponentially minimum phase.

III. ADAPTIVE CONTROL
From here on, we will aim to learn a feedback linearization-based tracking controller for the unknown plant in an adaptive fashion.We assume that we have access to a an approximate dynamics model for the plant which incorporates any prior information available about the plant.It is assumed that the state (x m and x p ) for both systems belongs to R n , that the inputs and outputs for both systems belong to R q , and that each of the mappings in ( 15) and ( 16) are smooth.We make the following assumption about the model and plant: Assumption 1: The plant and model have the same well-defined relative degree γ = (γ 1 , γ 2 , . . ., γ q ) on all of R n .
Assumption 2: The model and plant are both exponentially minimum phase.
With these assumptions in place, we know that there are globally-defined linearizing controllers for the plant and model, which respectively take the following form: While u m can be calculated using the model dynamics and the procedures outlined in the previous section, the terms comprising u p are unknown to us.However, we do know that they may be expressed as where ∆β : R n → R q and ∆α : R n → R q×q are unknown but continuous functions.Thus we construct an estimate for u p of the form where β θ1 : R n → R q is a parameterized estimate for ∆β, and α θ2 : R n → R q×q is a parameterized estimate for ∆α.The parameters 2 ) ∈ R K2 are to be learned during online operation of the plant, and the total set of parameters θ ∈ R K1+K2 are collected by stacking θ 1 on top of θ 2 .Our theoretical results will assume that the estimates are of the form where are linearly independent bases of functions, such as polynomials or radial basis functions.

A. Idealized continuous-time behavior
We now introduce a continuous-time update rule for the parameters of the learned linearizing controller which assumes that we know the functional form of the nonlinearities of the system.In Section III-B, we demonstrate how to approximate this ideal behavior in the sampled data setting using a policy gradient update rule which requires no information about the structure of the plant's nonlinearities.
We begin by assuming that there exists a set of "true" parameters θ * = (θ * 1 , θ * 2 ) ∈ R K1+K2 for the plant so that for each x ∈ R n and v ∈ R q we have û(θ * , x, v) ≡ u p (x, v).In this case, we can write our parameter estimation error as 2 ) so that θ = φ + θ * .With the gain matrix K constructed as in Section II-C, an estimate for the feedback linearization-based tracking controller is of the form u = û(θ, x, y γ d + Ke).
When this control law is applied to the system the closed-loop error dynamics take the form where W is a complicated function of x, y γ d and e which contains terms involving b p (x), A p (x), β m (x), α m (x), β p (x) and α p (x).The exact form of this function can be found in the technical report.The term BW φ captures the effects that the parameter estimation error φ has on the closed loop error dynamics.As we have done here, we will frequently drop the arguments of W to simplify notation.We will also write W (t) for W (x(t), y γ d (t), e(t)) when we wish to emphasize the dependence of the function on time.
Ideally, we would like to drive BW φ → 0 as t → ∞ so that we obtain the desired closed-loop error dynamics (14).Recalling from Section II-B that the reference model is designed such that B T B = I, this suggests applying the least-squares cost signal and following the negative gradient of the cost with the following update rule: Least-squares gradient-following algorithms of this sort are well studied in the adaptive control literature [1, Chapter 2].Since we have θ = φ, this suggests that the parameters should also be updated according to θ = −W T W φ. Altogether, we can represent the tracking and parameter error dynamics with the linear time-varying system Letting X = (e T , φ T ) T , the solution to this system is given by where for each t 1 , t 2 ∈ R n the state transition matrix Φ(t 1 , t 2 ) is the solution to the matrix differential equation d dt Φ(t, t 2 ) = A(t)Φ(t, t 2 ) with intial condition Φ(t 2 , t 2 ) = I, where I is the identity matrix of appropriate dimension.From the adaptive control literature, it is well known that if W (t) T W (T ) is "persistently exciting" in the sense that there exists δ > 0 such that for each t 0 ≥ 0 for some c 1 , c 2 > 0, then the time varying system ( 23) is exponentially stable, if W (t) also remains bounded.Intuitively, this condition simply ensures that the regressor term W T W is "rich enough" during the learning process to drive φ → 0 exponentially quickly.Observing (20) we also see that if φ → 0 exponentially quickly then e → 0 exponentially as well.
We formalize this point with the following Lemma: Lemma 1: Let the persistence of excitation condition (25) hold and assume that there exists C > 0 such that W (t) < C for each t ∈ R. Then there exists M > 0 and ζ > 0 such that for each with Φ(t 1 , t 2 ) defined as above.
Proof of this result can be found in Appendix, but variations of this result can be found in standard adaptive control texts [1].Unfortunately, we do not know the terms in (22) since we don't know φ or W so this update rule cannot be directly implemented.In the next section we introduce a model-free update rule for the parameters of the learned controller which approximates the continuous update (22) without requiring direct knowledge of W or φ.

B. Sampled-data parameter updates with policy gradients
Hereafter, we will assume that the control supplied to the plant can only be updated every ∆t seconds.While this setting provides a more realistic model for many robotic systems, sampling has the unfortunate effect of destroying the affine relationship between the plant's inputs and outputs [11] which was key to the continuous-time design techniques discussed above.Nevertheless, we now introduce a framework for approximately matching the ideal tracking and parameter error dynamics introduced in the previous section in the sampled-data setting using an Euler discretization of the continuous-time reward (21) and a policy-gradient based parameter update rule.
Before introducing our sampled-data control law and adaptation scheme, we first fix notation and discuss a few key assumptions our analysis will employ.To begin we let t k = k∆t for each k ∈ N denote the sampling times for the system.Letting x(•) denote the trajectory of the plant, we let x k = x(t k ) ∈ R n denote the state of the plant at the k-th sample.Similarly, we let ξ(•) denote the trajectory of the outputs and their derivatives as in (10), and we set ξ k = ξ(t k ) ∈ R |γ| (not to be confused with the k-th entry of ξ).Next we let u k ∈ R m denote the input applied to the plant on the interval [t k , t k+1 ).The parameters for our learned controller will be updated only at the sampling times, and we let θ k ∈ R K denote the value of the parameters on [t k , t k+1 ).We again let y d (•), ξ d (•) and y   d (•)) be continuous and bounded, but these methods also assume that the input to the plant can be updated continuously.In the sampled data setting, we require the continuity of y γj j,d (•) to ensure that it does not vary too much within a given sampling period.
After sampling the discrete-time tracking error dynamics obey a difference equation of the form where |γ| is obtained by integrating the dynamics of the nonlinear system and reference trajectory over [t k , t k+1 ).Generally, H k will no longer be affine in the input.However, the relationship is approximately affine for small values of ∆t.Indeed, with Assumptions 3 and 5 in place, if we apply the control law then an Euler discretization of the continuous time error dynamics (20) yields where we have set W k = W (x k , ξ k , y γ d,k +Ke k ).Thus, letting Ā = (I +∆t(A+BK)), for small ∆t > 0 the continuous-time cost is well approximated by where we note that e k and e k+1 are both quantities which can be measured by numerically differentiating the outputs from the plant.Intuitively, the sampled-data cost R k provides a measure for how well the control u k matches the desired change in the tracking error (20) over the interval [t k , t k+1 ).
Next, we add probing noise to the control law (28) to ensure that the input is sufficiently exciting and to enable the use of policy-gradient methods for estimating the gradient of the discrete-time cost signal.In particular, we will draw the input according as and W k = N (0, σ 2 I) is additive zero-mean Gaussian noise.Methods for selecting the variance-scaling term σ 2 will be discussed below, however for now it is sufficient to assume that σ 2 is bounded.With the addition of the random noise we now define noting that it is also common for policy gradient methods to use an expected "cost-to-go" as the objective.Regardless, using the policy-gradient theorem [28], the gradient of J k can be written as where the expectation accounts for randomness due to the input Moreover, a noisy, unbiased estimate of ∇J k is given by where u k = π k and is the actual input applied to the plant over the k-th time interval.Recall that R k (x k , e k , u k ) can be directly calculated using e k , e k+1 and (30), and ∇ θ k P{log(π(u k |θ k , s k ))} can also be computed since the derivatives of û (and thus of log P{π k }) are known to us.Thus, Ĵk can be computed using values that we have assumed we can measure.However, since the input u k is random, the gradient estimate is drawn according to where the random variable is constructed using the relationship (34).Using our estimate of the gradient for the discrete-time reward we propose the following noisy update rule for the parameters of our learned controller: Putting it all together, the sampled-data stochastic version of our error dynamics becomes where u k = π k and Ĵk is calculated as in (34).We make the following Assumptions about this stochastic process: Assumption 4: There exists a constant C > 0 such that sup k≥0 w k < C almost surely.
Assumption 5: There exists a constant C > 0 such sup k≥0 x k < C and sup k≥0 θ k < C almost surely.
Assumption 4 ensures that the additive noise does not drive the state to be unbounded during a single sampling interval, while Assumption 5 ensures that the gradient estimate does not become undefined during the learning process.These important technical assumptions are common in the theory of stochastic approximations [8], and allow us to characterize the estimator for the gradient as follows: Lemma 2: Let Assumptions 3-5 hold.Then ∆ Ĵk (•|θ k , x k , e k ) is a sub-Gaussian distribution where and The Lemma demonstrates a trade-off between the bias and variance of the gradient estimate that has been observed in the reinforcement learning literature [29], [15].Specifically, the bias of the gradient estimate decreases as σ 2 → 0 but this causes the gradient of the estimator to blow up, as indicated by the increasing sub-Gaussian norm.However, the bias of the gradient estimate has a term which is O(∆t) which does not depend on the amount of noise added to the system.This term comes from the fact that we have resorted to using a finite difference approximation (30) to approximate the gradient of the continuous-time reward in the sampled data setting.Due to this inherent bias, little is gained by decreasing σ 2 past the point where σ 2 = O(∆t).Next, we analyze the overall behavior of (37).

C. Convergence analysis
The main idea behind our analysis is to model our sampled-data error dynamics (37) as a perturbation to the idealized continuous-time error dynamics (23), as is commonly done in the stochastic approximation literature [8].Under the assumption that W T W is persistently exciting, the nominal continuous time dynamics are exponentially stable and we observe that the total perturbation accumulated over each sampling interval decays exponentially as time goes on.Due to space constraints, we outline the main points of the analysis here but leave the details to the technichal report.
Our analysis makes use of the piecewise-linear curve φ : R → R K which is constructed by interpolating between φ k and φ k+1 along the interval [t k , t k+1 ).That is, we define Combining the tracking and interpolated tracking error into the state X = (e T , φ T ) T we may write where for each t ∈ R the dynamics matrix A(t) constructed as in (23) and the disturbance δ : R → R |γ|+K captures the deviation from the idealized continuous dynamics caused at each instance of time due the sampling, additive noise, and the process of interpolating the parameter error.Again letting Φ(t, τ ) denote the solution to d dt Φ(t, τ ) = A(t)Φ(t, τ ) with initial condition Φ(s, s) = I, for each t, s ∈ R we have that Now, if we let X k = X(t k ) for each k ∈ N we can instead write where the term δ k ∈ R |γ|+K is the total disturbance accumulated over the interval [t k , t k+1 ).We separate the effects the distubance has on the tracking and error dynamics by letting δ e k ∈ R |γ| denote the first |γ| elements of δ k and letting δ φ k ∈ R K denote the remaining entries.On the interval [t k , t k+1 ) the disturbance δ(t) can be written as a function of u k , x k and e k .Since u k is a random function of x k , for fixed x k , e k and θ k , the two elements of δ k are distributed according to These random variables are constructed by integrating the distrubance over [t k , t k+1 ) and an explicit representation of these variable can be found in the proof of the following Lemma, which can be found in the Appendix.
Next, for each k ∈ N we put Our overall discrete-time process can then be written as where ε k ∈ R |γ|+K is constructed by stacking ε e k on top of ε φ k and M k is constructed by stacking M e k on top of M φ k .Now if we assume that W T W is persistently exciting, then for each k 1 , k 2 ∈ N we have where M > 0 and ζ > 0 are as in Lemma 1 and we have put ρ = e −ζ∆t < 1.Thus, under this assumption we may use the triangle inequality to bound Thus, when W T W is persistently exciting we see that the effects of the disturbance accumulated at each time step decays exponentially as time goes on, along with the effects of the initial tracking and parameter error.A full proof for the following Theorem is given in the Appendix, but the main idea is to use properties of geometric series to bound k−1 i=0 ρ k−i |ε k | over time and to use the concentration inequality from [16,Theorem 2.6.3] to bound the deviation of Theorem 1: Let Assumptions 3-5 hold.Further assume that W T W is persistently exciting and let M > 0 and ζ > 0 be defined as in Lemma 1. Then there exists numerical constants C 1 > 0 and C 2 > 0 such that and for each λ > 0 with probability 1 − λ we have Despite the high variance of the simple policy gradient parameter update analyzed so far, the Theorem demonstrates that with high probability our tracking and parameter errors concentrate around the origin.As ∆t decreases, the bias introduced by the sampling and additive noise diminish, as does the radius of our high-probability bound.These bounds also become tighter as the exponential rate of decay for the idealized continuous time dynamics increases.The Theorem again displays the trade-off between the bias and variance of the learning scheme observed in Section 3.However, here we still observe in equation ( 51) that the bias introduced by the noise is relatively small, meaning σ 2 does not have to be made prohibitively small so as to degrade the bound in (52).

D. Variance Reduction via Baslines
It is common for policy gradients to be implemented with a baseline [30].In this case, the gradient estimator in (34) may become biased, though it often has lower variance [7], [31].The expression with a baseline is where If S k does not depend on u k then the addition of the baseline does not add any bias to the gradient estimate [7].For example, in our numerical example below we use a simple sum-of-pastrewards baseline by setting , where R i is the i-th reward recorded.We consider it a matter of future work to rigorously study the effects of this an other common baselines from the reinforcement learned literature within the theoretical framework we have developed.

IV. NUMERICAL EXAMPLE
Our numerical example examines the application of our method to the double pendulum depicted in Figure 1 (a), whose dynamics can be found in [32].With a slight abuse of notation, the system has generalized coordinates q = (θ 1 , θ 2 ) which represent the angles the two arms make with the vertical.Letting x = (x 1 , x 2 , x 3 , x 4 ) = (q, q), the system can be represented with a state-space model of the form (1) where the angles of the two joints are chosen as outputs.It can be shown that the vector relative degree is (2, 2), so the system can be completely linearized by state feedback.The dynamics of the system depend on the parameters m 1 , m 2 , l 1 , l 2 where m i is the mass of the i-th link and l i its length.For the purposes of our simulation, we set the true parameters for the plant to be m 1 = m 2 = l 1 = l 2 = 1.However, to set-up the learning problem, we assume that we have inaccurate measurements for each of these parameters, namely, m1 = m2 = l1 = l2 = 1.3.That is, each estimated parameter is scales to 1.3 times its true value.Our nominal model-based linearizing controller u m is constructed by computing the linearizing controller for the dynamics model which corresponds to the inaccurate parameter estimates.The learned component of the controller is then constructed by using radial basis functions to populate the entries of {β k } K1 k=1 and {α k } k2 k=1 .In total, 250 radial basis functions were used.For the online leaning problem we set the sampling interval to be ∆t = 0.05 seconds and set the level of probing noise at σ 2 = 0.1.The reward was regularized using an average sum-of-rewards baseline as described in III-D.The reference trajectory for each of the output channels were constructed by summing together sinusoidal functions whose frequencies are non-integer multiples of each other to ensure that the entire region of operation was explored.The feedback gain matrix K ∈ R 2×4 was designed so that each of the eigenvalues of (A + BK) are equal to −1.5, where A ∈ R 4×4 and B ∈ R 4×2 are the appropriate matricies in the reference model for the system.
Figure 1 (b) shows the norm of the tracking error of the learning scheme over time while Figure 1 (c) shows the norm of the tracking error for the nominal model-based controller with no learning.Note that the learning-based approach is able to steadily reduce the tracking error over time while keeping the system stable.

V. CONCLUSION
This paper developed an adaptive framework which employs model-free policy-gradient parameter update rules to construct a feedback-linearization based tracking controller for systems with unknown dynamics.We combined analysis techniques from the adaptive control literature and theory of stochastic approximations to provide high-confidence tracking guarantees for the closed loops system, and demonstrated the utility of the framework through a simulation experiment.Beyond the immediate utility of the proposed framework, we believe the analysis tools we developed provide a foundation for studying the use of reinforcement learning algorithms for online adaptation.

APPENDIX
The following Appedicies contain items which were too long to present in the main body of the document.Appendix A containts two auxiliary Lemmas which are used extensively throughout the main proofs of Lemma 2 in Appendix B, Lemma 3 in Appendix C, and Theorem 1 in Appendix D. Appendix E introduces the explicit form of the error equations in equation (20), and finally Appedix F provides proof for Lemma 1.

A. Auxiliary Lemmas
Lemma 4: 3-5 hold.Then there exists a constant C > 0 such that Proof: The bound in (54) follows directly from Assumption 5 and the smoothness for the vector field for the plant.The bound in (55) follows from Assumption 3 and the continuity of the bases elements β k and α k .
Lemma 5: Let Assumptions 3-5 hold.Then there exits C > 0 such that for each t ∈ [t k , t k+1 ) we have ξ(t) − ξ k < C∆ and x(t) − x k < C∆t and φ(t) − φ k C 1 ∆t.. Proof: First, we have that ẋ = f p (x(t)) + g p (x(t))u k on the interval [t k , t k+1 ).By our standing Assumptions and the continuity of f p and g p there exists a finite constant the bound on ξ(t) − ξ k follows by an analogous argument.To prove the bound for φ(t) − φ k we recall that φ(t) = Ĵk (t−t k ) ∆t + φ k .However, the expression for Ĵk is given in equation (75) below, and we see is bounded under our standing Assumptions.Thus, there exists a K > 0 such that Ĵk < K and thus we have φ (t) − φ k ≤ K∆t.The desired result follows from the above observations.

B. Proof of Lemma 2
Next, we note that we may rewrite where for convienience of notation we have defined and suppressed the dependence on x k , e k and y γ d,k .Thus, we have that where we define the terms Thus, the reward may be rewritten as The gradient of the reward with respect to the learned parameters is given by T j (64) Thus, to obtain the desired bound we need only to bound the terms involving the T i .
First, we will produce bounds for the T j terms.By Lemma 5 and the continuity of the vector field for the plant, we know that there exists C 1 > 0 such that for each t ∈ [t k , t k+1 ) we have ξ Finally, by the continuity of y γ d we know that for there exists C 3 > 0 such that for each t ∈ [t k , t k+1 ) we have y γ d (t) − y γ d,k ≤ C 3 ∆t.Putting these facts together we have Next, we bound the terms of the form ∂ ∂θ k T j .Since the T j depend on x(•) and ξ(•), we first bounded how much these trajectories vary over the interval [t k , t k+1 ) as the learned parameter is changed.By [33,Theorem 5.6.2]K1+K2) where where g p,i (x) is the i-th column of g p (x) and (û θ k + w k ) i is the i-th entry of (û . Now, by Assumptions 3-5 and the smoothness of g p and f p there exists K 1 > 0 and K 2 > 0 such that for each t ∈ [t k , t k+1 ) we have A(t) ≤ K 1 and B(t) ≤ K 2 for any choice of matrix norm.Furthermore, we have for each i = 1, . . ., K 1 and j = 1, . . ., K 2 we have Returning to our expression for ∇ θ k R k in (64) using the bound on terms of the form T j from (65) we have that and if we additionally use the bounds of the form ∂ ∂ θ k T i from (70) we may bound Putting together the above bounds with (64) we have that and where we have used the fact that E w k ∼W k w k = σ and E w k ∼W k w k 2 = σ 2 to show the desired result for the bias of the gradient estimate.
Next, we bound the sub-Gaussian norm of the estimator.We omit some details in the interest of brevity, but the main idea is to first show that the gradient estimate is a Lipschitz continuous function of w k where w k is the realization of W k .We then use Theorem 5.2.15 from [16, Theorem 2.6.3] to demonstrate that ∆ Ĵk is a sub-Gaussian random variable.When specialized to our setting, the cited Theorem says that if X ∼ N (μ, σ2 I) is a Gussian random variable with finite mean μ and σ 2 , then the random variable T (X) where T is a Lipschitz continuous map is sub-Gaussian with norm T (X) ψ2 ≤ C L σ where C > is an absolute constant and L > 0 is a Lipschitz constant for T .
Next, letting R k = The estimate for the gradient can be expanded as Here, we have used the fact that u k ∼ π k (•|θ k , x k , e k ) = N ûθ , σ 2 I and used the formula the logarithm of normal distributions.Noting the u i k − ûi θ k = w k , the above expression can be rewritten as Furthermore, we may expand Thus we may further bound ) where we have used the fact that φ k − φ(t)) ≤ C 1 ∆t, the fact that W k is bounded by Lemma 4 and the fact that W (t) is a bounded continuous function of time by Assumptions 3 and 5 so that W k − W (t) = O(∆t) for each t ∈ [t k , t k+1 ).This then implies that . Combining these decompositions with (77) we have where we have set thus using the above bounds we have that Furthermore, we have where we have used the fact that E w k ∼W k w k = (σ).
Next, we demonstrate how to calculate δ φ k , the disturbance to the parameter error over the interval [t k , t k+1 ).Now we have that thus we have Now, where in the last equality we have used Lemma 2. Now, we have Using the same argument we used to bound Next, we bound the sub-Gaussian norms of ∆ e k and ∆ φ k .As was done in the proof of Lemma 3, we will omit some details in the interest of brevity since the following arguments closely follow arguments given above.We see that the map w k → δ e k is Lipschitz continuous with constant L > 0 where L = C∆t for some constant C > 0 which is independent of σ.Thus, by [16,Theorem 2.6.3]there exists a constant K 1 > 0 such that ∆ e k ≤ K 1 C∆tσ.Next, we bound the sub-Gaussian norm for ∆ φ k .First, we bound the term = O( ∆t σ ).Next, we need to bound the sub-Gaussian norm of t k+1 t k W (t) T W (t) φ(t)dt.Now, W (t) = W (x(t), y γ d (t), e(t)) depends on x(t) and e(t) which both depend on w k .However, by Theorem 5.6.2 from [33] our standing Assumptions ensure that the maps w k → x(t) and w k → e(t) are Lipschitz continuous for each t ∈ [t k , t k+1 ).By the continuity of W and Assumption 5 this allows us to conclude that for each t ∈ [t k , t k+1 ) the map w k → W (t) is Lipschitz continuous, and thus the map w k → W (t) T W (t) φ(t) is also Lipschitz continuous, since φ(t) is assumed to be bounded by Assumption 5. Letting L denote a single common Lipschitz constant for the family of maps {w k → W (t) t W (t)φ(t)} t∈[t k ,t k+1 ) , we then see that the map w k → t k+1 t k W (t) T W (t) φ(t)dt is Lipschitz continuous with constant ∆t * L by integrating the point-wise bound over the length of the interval.Thus, using [16,Theorem 2.6.3]we see that  For the statement of the Lemma, we drop the O(∆tσ) since we have assumes that σ 2 is finite and are interested in small values of σ.

D. Proof of Theorem 1
The proof will use the following well-known bound on geometric series: However, since ρ → 1 as ∆t → 0 this bound becomes very large for sampling intervals.To make the dependence on ∆t more explicit we note that since e −ζ∆t < 1 − ζ∆t we have that Combining the above bounds with (50) we have where K = sup 0 ≤ j ≤ K|ε j |.By Lemma (51) we have the K = O(∆t 2 (1+σ+σ 2 )) and also that |E[ k−1 i=0 ρ k−i M k ]| = 0, which when combined with the above equation implies (51).Next, we characterize the deviation from the mean caused by the M i using the inequality from [16,Theorem 2.6.3](97) Plugging (97) into (95) and combining the result with (50) provides the desired high-confidence bound.

Fig. 1 :
Fig. 1: (a) Schematic representation of the double pendulum model used in the simulations study.(b) The norm of the tracking error for the adaptive learning scheme (c) The tracking error for the nominal model-based controller with no learning.

t k t k
Ĵk dt = ∆t Ĵk = ∆t•R k 1 σ 2 w T k ∂ ∂ θ k ûθ k ,where we have used the same notation as the proof of Lemma 2. Using the same arguments as was used for the proof of Lemma 2, we see that the mapw k → ∆t • R k 1 σ 2 w T k ∂ ∂ θ k ûθ k isLipschitz continuous with a Lipschitz constant on the order of O( ∆t σ 2 ).Thus, using [16, Theorem 2.6.3]we see that ∆t• R k 1 σ 2 w T k ∂ ∂ θ k ûθ k ψ2 t