Adaptive Control for Linearizable Systems Using On-Policy Reinforcement Learning

This paper proposes a framework for adaptively learning a feedback linearization-based tracking controller for an unknown system using discrete-time model-free policy-gradient parameter update rules. The primary advantage of the scheme over standard model-reference adaptive control techniques is that it does not require the learned inverse model to be invertible at all instances of time. This enables the use of general function approximators to approximate the linearizing controller for the system without having to worry about singularities. The overall learning system is stochastic, due to the random nature of the policy gradient updates, thus we combine analysis techniques commonly employed in the machine learning literature alongside stability arguments from adaptive control to demonstrate that with high probability the tracking and parameter errors concentrate near zero, under a standard persistency of excitation condition. A simulated example of a double pendulum demonstrates the utility of the proposed theory.


I. INTRODUCTION
Many real-world control systems display nonlinear behaviors which are difficult to model, necessitating the use of control architectures which can adapt to the uncertainty while maintaining certificates of stability. There are many successful model-based strategies for adaptively constructing controllers for uncertain systems [1][2][3], but these methods often require the availability of a simple, reasonably accurate parametric model of the system dynamics. In an effort to circumvent this challenge for systems with large nonparametric uncertainties, model-free policy optimization algorithms [4][5][6][7] have enjoyed renewed interest in recent years. However, the convergence and reliability of these stochastic algorithms in the context of nonlinear control remains poorly understood, especially in the context of online adaptation.
In an effort to overcome this gap, this paper proposes an online framework for provably learning a feedback linearization-based tracking controller [2, Chapter 9] using policy gradient methods to overcome nonlinear nonparametric uncertainty. While the final algorithm is modelfree, it is inspired by model-based gradient-following update rules from the adaptive control literature [1]. This unique combination of model-free and model-based methodologies enables us to provide probabilistic safety guarantees for the overall adaptive system by integrating analysis techniques from adaptive control and the stochastic approximations literature [8], which is frequently used to analyze the convergence of machine learning algorithms through the use of concentration inequalities [9,Chapter 2].
Concretely, we first design a continuous time gradientfollowing update rule for the parameters of the learned linearizing controller. The overall continuous time learning system is exponentially stable under a standard persistency of excitation condition, however, direct implementation of the update rule would require knowing the true dynamics of the plant. Thus, we then approximate this ideal update rule using a model-free policy-gradient update scheme which operates in discrete time. To explore the dynamics of the unknown system, the update rule requires that probing noise be injected into the control, and then produces a noisy estimate of the continuous time gradient. The overall learning system can be viewed as a noisy discretization of the idealized continuous-time process, agreeing with the typical stochastic approximation perspective [8]. We then obtain our probabilistic tracking guarantees by bounding the difference between these two processes.
The analysis of our method reveals several key tradeoffs between model-free and model-based methods in the context of online adaptation. The primary advantage of our method over standard Model Reference Adaptive Control approaches [1,Chapter 7] is that our approach does not require the estimated inverse model to remain invertible at each instance of time. This condition is often required for model-based methods to avoid singularities in the learned control law or associated parameter update scheme. To avoid these issues MRAC approaches often employ a projectionbased update rule to keep the learned parameters in a regime which is know a priori to be free of singularities. In practice, the construction of these regions requires that a simple parameterization of the system's nonlinearities is available [10]. In contrast, the model-free approach enables the use of more general function approximation schemes to capture the unknown nonlinearites in the desired linearizing controller, without the need to ensure that the learned controller remains invertible at each instance of time.
The primary disadvantage of the model-free approach is the loss of deterministic safety guarantees and the high variance of policy gradient methods, which has been documented extensively in the reinforcement learning literature [11]. Much of the recent work from this community has sought to reduce the variance of these methods, possibly at the cost of introducing some bias into the gradient esti-mation, using techniques such as baselining [12], advantage estimation [13], and various forms of gradient regularization. Forthcoming work will seek to include these methods in the theoretical framework presented in this paper.
Our theoretical results also require the persistency of excitation condition to ensure that the tracking error remains bounded with high-probability. Many MRAC approaches [1,14] do not require such a condition to bound the tracking error, but do require such a condition to ensure proper convergence of the parameters. Our analysis of the modelfree method relies on showing that the parameters converge in order to then argue that the tracking error converges. It is an important matter for future work to determine whether this condition can be removed.
Due to space constraints, proofs of claims made in the paper can be found in the technical report [15].

A. Related Work
We note that this work extends our previous efforts [7], where linearizing controllers were learned in an offline setting. A number of approaches have been proposed to avoid the issues with singularities discussed above. One approach is to perturb the estimated linearizing control law to avoid singularities [16][17][18] and use high-gain feedback to bound the added disturbance. Other approaches avoid the need to invert the input-output dynamics by driving the system states to a sliding surface [3] or by using other forms of high-gain feedback to reject higher-order disturbances in the learning scheme [19]. However, these methods often require extra structural assumptions about the system (e.g. that the system is truly Lagrangian). In general, the use of high-gain feedback may also be undesirable for many physical systems, especially in the face of actuator saturation. The method proposed in this paper leads to bounded tracking, due to the constant amount of noise injected into the system during operation. Future work will investigate more sophisticated update rules which gradually decay the amount of injected noise to zero, leading the tracking bounds to continually tighten as time goes on.

B. Preliminaries
Next, we fix mathematical notation and review some definitions used extensively in the paper. Given a random variable X, E[X] denotes its expectation. Our analysis relies on the notion of a sub-Gaussian distribution [9, Chapter 2]. We say that a random variable X ∈ R is sub-Gaussian if there exists a constant C > 0 such that for each t ≥ 0 we have P{|x| ≥ t} ≤ 2 exp(− t 2 C 2 ). Informally, the tails of sub-Gaussian distributions are dominated by some Gaussian distribution. We endow the space of sub-Gaussian distributions with the norm · ψ2 defined by X ψ2 = inf t > 0 : E[exp( We say a multi-dimensional random variable X ∈ R n is sub-Gaussian if the one dimensional marginal distribution X, x is a sub-Gaussian distribution for each x ∈ R n . In this case the sub-Gaussian norm is defined by X ψ2 = sup x∈R n \{0}

II. FEEDBACK LINEARIZATION
Throughout the paper we will focus on constructing output tracking controllers for systems of the forṁ where x ∈ R n is the state, u ∈ R q is the input and y ∈ R q is the output. The mappings f : R n → R n , g : R n → R n×q and h : R n → R q are each assumed to be smooth, and we assume without loss of generality that the origin is an equilibrium point of the undriven system, i.e., f (x) = 0. Throughout the paper, we will also assume that state the x and the output y can both be measured.

A. Calculating a Linearizing Controller
Our introduction to feedback linearization will be minimal, as it is covered by a number of standard texts [2,Chapter 9]. The main idea is to take time derivatives of each of the output channels until at least one input appears. Let γ j be the number of times we need to differentiate y j (the j-th entry of y) for at least one input to appear. Combining the resulting expressions for each of the outputs yields an input-output relationship of the form where we have adopted the shorthand y (γ) = [y is the γ j -th time derivative of y j . Here, the matrix A(x) ∈ R q×q is known as the decoupling matrix and the vector b(x) ∈ R q is known as the drift term. If A(x) is non-singular for each x ∈ R n then we observe that the control law where v ∈ R q yields the decoupled linear system where v k is the k-th entry of v. We refer to γ = (γ 1 , γ 2 , . . . , γ q ) as the vector relative degree of the system, with |γ| = i γ i the total relative degree of all dimensions. The vector relative degree captures the order of the relationship between the inputs and outputs. The decoupled dynamics (4) can be compactly represented with an LTI systemξ which we will hereafter refer to as the reference model. Here the state of the reference model ξ r = (y 1 ,ẏ 1 , . . . , . . . , y (γ1−1) 1 , . . . , y q , . . . , y (γq−1) q ) ∈ R |γ| collects the outputs and their derivatives. The matrices A and B are then constructed with the appropirate dimensions to represent (4).
It can be shown [2, Chapter 9] that there exists a change of coordinates x → (ξ, η) such that in the new coordinates and after application of the linearizing control law the dynamics of the system are of the forṁ That is, the ξ ∈ R |γ| coordinates represent the portion of the system that has been linearized while the η ∈ R n−|γ| coordinates represent the remaining coordinates of the nonlinear system. The undriven dynamicṡ are referred to as the zero dynamics. We say that the system is exponentially minimum phase if the zero dynamics are exponentially stable.

B. Exact tracking for min-phase MIMO systems
Given a desired reference signal y d (·) = y 1,d (·), . . . , y q,d (·) , our goal is to design a tracking controller which drives y(t) → y d (t). We will assume that the first γ j derivatives of y j,d (·) are well defined and uniformly bounded.
For compactness of notation, we will collect Next we define the tracking error to be where ξ(·) is the actual trajectory of the linearized coordinates as in (6). Altogether, the feedback linearizing tracking controller for the system is then given by where K ∈ R q×|γ| is a linear feedback matrix designed so that (A + BK) is Hurwitz. Under the application of this control law the closed-loop error dynamics becomė and it becomes apparent that e → 0 exponentially quickly. However, while the tracking error decays exponentially, the η coordinates may become unbounded during operation, in which case the linearizing control law will break down. One sufficient condition for η to remain bounded is for the zero dynamics to be globally exponentially stable and for ξ d (·) and y d (·) to remain bounded [2, Chapter 9].

III. ADAPTIVE CONTROL
From here on, we will aim to learn a feedback linearization-based tracking controller for the unknown planṫ in an adaptive fashion. We assume that we have access to an approximate dynamics model for the systeṁ which incorporates any prior information available about the plant. Both systems are assumed to satisfy the basic assumptions imposed on (1) and to have the same dimensionality.
Assumption 2: The model and plant are both exponentially minimum phase.
With these assumptions in place, we know that there are globally-defined linearizing controllers for the plant and model, which respectively take the following form: While the terms in u p are unknown they are of the form where ∆β and ∆α capture the effects of model mismatch. Thus we construct an estimate for u p of the form where β θ1 : R n → R q is a parameterized estimate for ∆β, and α θ2 : R n → R q×q is a parameterized estimate for ∆α. The parameters θ 1 = (θ 1 1 , θ 2 1 , . . . , θ K1 1 ) ∈ R K1 and θ 2 = (θ 1 2 , θ 2 2 , . . . , θ K2 2 ) ∈ R K2 are to be learned and are combined into the total set of learned parameters θ = (θ 1 , θ 2 ) ∈ R K1+K2 . By directly adding the learned component to the controller derived from our nominal dynamics model, our approach is able to directly incorporate prior information the system designer has about the plant. Our theoretical results will assume that the estimates are of the form where are linearly independent bases of functions, such as polynomials or radial basis functions. Our theoretical analysis will assume that the learned controller can exactly reconstruct the true linearizing controller for the plant: Assumption 3: There exists a unique nominal set of learned parameters θ This is a rather strong assumption, since we have assumed that we do not have access to a parametric model for the plant. Frequently in the adaptive control literature it is assumed that the constructed learning component can uniformly reconstruct the true tracking controller for the system up to some pre-specified accuracy [19]. A forthcoming article will extend the results in this article to the case where the learned component is not able to exactly reconstruct the true linearizing controller for the system.

A. Idealized continuous-time behavior
We now introduce an ideal continuous-time update rule for the parameters of the learned controller which does not rely on the learned controller remaining invertible but assumes that we know the true dynamics of the plant. In Section III-B, we demonstrate how to approximate this ideal behavior in the sampled data setting using our model-free method.
Letting θ * ∈ R K1+K2 be defined as in Assumption 3, we define the parameter estimation error to be φ = θ −θ * where θ is the current estimate for θ * . With the gain matrix K constructed as in Section II-B, an estimate for the feedback linearization-based tracking controller is of the form When this control law is applied to the system the closedloop error dynamics take the forṁ where W is a complicated function of x, y (γ) d and e which contains terms involving b p (x), A p (x), β m (x), α m (x), β p (x) and α p (x). The exact form of this function can be found in the technical report, however we note that error equations of this form are commonplace in the adaptive control literature. The term W φ captures the effects that the parameter estimation error φ has on the closed-loop error dynamics. As we have done here, we will frequently drop the arguments of W to simplify notation, and will sometimes write W (t) to emphasize the dependence of the signal on time.
Ideally, we would like to drive W φ → 0 so that we obtain the desired closed-loop error dynamics (8). This suggests using the least-squares cost signal and following the negative gradient of the cost with the following update rule:φ Least-squares gradient-following algorithms of this sort are well-studied in the adaptive control literature [1, Chapter 2]. Since we haveθ =φ, this suggests that the parameters should also be updated according toθ = −W T W φ. Altogether, we can represent the tracking and parameter error dynamics with the linear time-varying system LettingX = (e T , φ T ) T , we havê where for each t 1 , t 2 ∈ R the state transition matrix Φ(t 1 , t 2 ) is the solution to the matrix differential equation d dt Φ(t, t 2 ) = A(t)Φ(t, t 2 ) with initial condition Φ(t 2 , t 2 ) = I, where I is the identity matrix of appropriate dimension. From the adaptive control literature, it is well known that if W is "persistently exciting" in the sense that there exists δ > 0 such that for each t 0 ≥ 0 for some c 1 , c 2 > 0, then the time varying system (15) is exponentially stable, if W (t) also remains bounded. Intuitively, this condition simply ensures that the regressor term W is "rich enough" during the learning process to drive φ → 0 exponentially quickly. Observing (12) we also see that if φ → 0 exponentially quickly then e → 0 exponentially as well. We formalize this point with the following Lemma: Lemma 1: Let the persistent excitation condition (17) hold and assume that there exists C > 0 such that W (t) < C for each t ∈ R. Then there exists M > 0 and ζ > 0 such that for each t 1 , with Φ(t 1 , t 2 ) defined as above.
Proof of this result can be found in the technical report, but variations of this result can be found in standard adaptive control texts [1]. Unfortunately, we do not know the terms in (14) since we don't know φ or W so this update rule cannot be directly implemented. In the next section we introduce a model-free update rule for the parameters of the learned controller which approximates the continuous update (14) without requiring direct knowledge of W or φ.

B. Sampled-data parameter updates with policy gradients
While there have been efforts to define continous time reinforcement learning algorithms [20,21], the majority of reinforcement learning algorithms implemented by practitioners are formulated in discrete time. Moreover, actuators on many real-world systems can only update the control signal applied to the system after some minimum sampling period ∆t has elapsed. Thus, hereafter we will assume that we can only update the control at the maximum frequency defined by these physical limitations, and formulate our learning problem over the resulting discrete time system.
To begin, we first fix some notation. We let t k = k∆t for each k ∈ N denote the sampling times for the system. Next we let u k ∈ R q denote the input applied to the plant on the interval [t k , t k+1 ). The parameters for our learned controller will be updated only at the sampling times, and we let θ k ∈ R K1+K2 denote the value of the parameters on [t k , t k+1 ), and similarly set φ k = θ k −θ * . Letting x(·) denote the trajectory of the plant, we let x k = x(t k ) ∈ R n denote the state of the plant at the k-th sample. Similarly, we let ξ(·) denote the trajectory of the outputs and their derivatives as in (6), and we set ξ k = ξ(t k ) ∈ R |γ| . We again let y d (·), ξ d (·) and y (γ) d (·) denote the desired trajectory for the outputs and their appropriate derivatives, and let ξ d,k = ξ d (t k ) ∈ R |γ| and y  d (t k ) ∈ R q , and e k = (ξ k − ξ d,k ) ∈ R |γ| . We will assume that we have access to each of these signals at every iterate of the process.
Remark 1: Typical convergence proofs in the continuous time adaptive control literature generally only require that (y j,d (·),ẏ j,d (·), . . . , y γj −1 j,d (·)) be continuous and bounded, but these methods also assume that the input to the plant can be updated continuously. In the sampled data setting, we require the continuity of y γj j,d (·) to ensure that it does not vary too much within a given sampling period.
Under this assumption if we apply then a Taylor expansion of the continuous time error dynamics (12) yields Thus, when this controller is applied and ∆t is small the continuous time reward signal (13) is well approximated at time t k by whereĀ = (I + ∆t(A + BK)). Intuitively, R k provides a measure of how well the error dynamics match the desired linear exponentially stable behavior in (12) when control u k is applied over the interval [t k , t k+1 ). However, if we want to understand how the reward changes as a function of u k (and more importantly θ k ) then we need to ensure that the control injected into the system is sufficiently exciting. Towards this end, during the learning process we will draw the input according to u k ∼ π k (·|θ k , x k , e k ), where π k (·|θ k , x k , e k ) =û θ k , x k , y γ d,k + Ke k + W k and W k = N (0, σ 2 I) is additive zero-mean Gaussian noise. The effect of choosing different values for σ will be revealed in our analysis below, but for now we restrict our attention to the case where 0 < σ < 1.
With the addition of the random noise we now define which is the expected reward when π k (·|θ k , x k , e k ) is applied to the system. We want to move the learned parameters in directions which decrease this cost. The policy gradient theorem [22], enables us to calculate the gradient of this expectation with respect to our learned parameters with the formula where the expectation accounts for randomness due to the input u k ∼ π k (·|θ k , x k , e k ). Since the components of the learned controller are known to us and we can observe R k this quantity can be computed using only information we have access to. The policy gradient theorem [22] allows us to compute a noisy, but unbiased estimate of ∇J k given bŷ where u k is the actual input applied to the plant over the k-th time interval. The gradient estimate is drawn according toĴ where the random variable J k is constructed using the relationship (24). Using our estimate of the gradient for the discrete-time reward we propose the following noisy update rule for the parameters of our learned controller: Putting it all together our noisy discretization of (15) is of the form where u k is random,Ĵ k is calculated as in (24), and the mapping H can be obtained by integrating the error dynamics along the interval [t k , t k+1 ). Although, H is well approximated by the Taylor expansion in (20), it will generally be a complicated nonlinear expression with no closed form solution [23]. In the following section we analyze the evolution of this stochastic process.

IV. CONVERGENCE ANALYSIS
In this section, we provide high-probability tracking bounds for both the tracking and paramter errors in (27), by bounding the deviation of this process from (15).

A. Bias and Variance of Gradient Estimation
Before bounding the difference between the continuous time and discrete time processes, we first need to characterize how closely the noisy gradient estimate in (23) approximates the gradient of the continuous time reward (14). We would like E[J k ] ≈ W T k W k φ k so that equation (26) closely approximates an Euler discretization of (14). In our analysis of the gradient estimator we make the following assumption: Assumption 5: There exists a constant C > 0 such that sup k≥0 x k < C and sup k≥0 θ k < C almost surely.
The assumption ensures that the reward incurred at each step, and thus also the gradient estimator in (24), remains bounded. Similar assumptions are standard in the stochastic approximation literature [8,Chapter 2], which also provides estimates for the probability of the event holding [8,Chapter 3]. Thus, it is to be understood that the ultimate tracking result in Theorem 1 holds conditioned on the event prescribed by Assumption 5. Future work will focus on providing estimates of the probability with which this assumption holds. Nonetheless, under this assumption we obtain the following characterization of the gradient estimator: Lemma 2: Let Assumptions 4-5 hold. Then J k (·|θ k , x k , e k ) is a sub-Gaussian distribution where and The hidden constants in equations (28) and (29) depend on the constant in Assumption 5 in a complex manner, and a forthcoming article will more carefully characterize this dependence. The Lemma demonstrates a trade-off between the bias and variance of the gradient estimate that has been observed in the reinforcement learning literature [24,25]. Specifically, the bias of the gradient estimate decreases as σ 2 → 0 but this causes the gradient of the estimator to blow up, as indicated by the increasing sub-Gaussian norm. The bias of the gradient estimate also has a term which is O(∆t) which does not depend on the amount of noise added to the system. This term comes from the fact that we have resorted to using a finite difference approximation (21) to approximate the gradient of the continuous-time reward in the sampled data setting. Next, we use this result to study the evolution of (27).

B. Probabalistic Tracking Bounds
The main idea behind our safety analysis is to model our sampled data error dynamics (27) as a perturbation to the idealized continuous-time error dynamics (15), as is commonly done in the stochastic approximation literature [8]. Under the assumption that W (t) is persistently exciting, the nominal continuous time dynamics are exponentially stable and we observe that the total perturbation accumulated over each sampling interval decays exponentially as time goes on. Due to space constraints, we outline the main points of the analysis here but leave the details to the technical report.
Our analysis makes use of the piecewise-linear curvē φ : R → R K1+K2 which is constructed by interpolating between φ k and φ k+1 along the interval [t k , t k+1 ). That is, we definē Combining the tracking and interpolated parameter error into the state X = (e T ,φ T ) T we may write where for each t ∈ R the dynamics matrix A(t) is constructed as in (15) and the disturbance δ : R → R |γ|+K1+K2 captures the deviation from the idealized continuous dynamics caused at each instance of time due the sampling, additive noise, and the process of interpolating the parameter error. A full expression for the term is provided in [15].
Letting Φ(·, ·) again denote the state transition matrix for the continuous time system (15) we have that where the term δ k is the total disturbance accumulated over the interval [t k , t k+1 ). We separate the effects the distubance has on the tracking and error dynamics by letting δ e k ∈ R |γ| denote the first |γ| elements of δ k and letting δ φ k ∈ R K1+K2 denote the remaining entries. The two elements of δ k depend on the randomly drawn input u k and are distributed according to (30) These random variables are constructed by integrating the disturbance over [t k , t k+1 ) and an explicit representation of these variables can be found in the technical report.
Next, for each k ∈ N we put ε e ∈ R K1+K2 and then define the zero-mean random variables M e k = ∆ e k (·|θ k , x k , e k )−ε e k and M φ k = ∆ φ k (·|θ k , x k , e k ) − ε φ k . Our overall discrete time process can then be written as where ε k is constructed by stacking ε e k on top of ε φ k and M k is constructed by stacking M e k on top of M φ k . Now if we assume that W is persistently exciting, then for each where M > 0 and ζ > 0 are as in Lemma 1 and we have put ρ = e −ζ∆t < 1. Thus, under this assumption we may use the triangle inequality to bound (33) Thus, when W is persistently exciting we see that the effects of the disturbance accumulated at each time-step decays exponentially as time goes on, along with the effects of the initial tracking and parameter error. The following intermediate characterization of the bias and noise terms is crucial for our final analysis: Lemma 3: Let Assumptions 4-5 hold. Then where the terms ε e k , ε φ k , M e k and M φ k are defined as above.
The fact that the terms in (34) and (35) are quadratic in ∆t should not be surprising, since the disturbance is being integrated over a time interval of that length. The bound in equation (36) comes from the continuity of trajectories of the plant with respect to the input (and hence the additive probing noise) in combination with [9,Theorem 5.2.2], which bounds the sub-Gaussian norm of continuous functions of Gaussian variables. Finally the bound in (37) is obtained by integrating the bound in (29) over the sampling interval.
A full proof for the following result is given in the technical report, but the main idea is to bound (33) using properties of geometric series to bound k−1 i=0 ρ k−i |ε i | over time and to use the concentration inequality from [9, Theorem 2.6.3] to bound the deviation of | Theorem 1: Let Assumptions 4-5 hold. Further assume that W is persistently exciting and let M > 0 and ζ > 0 be defined as in Lemma 1. Then there exists numerical constants C 1 > 0 and C 2 > 0 such that and for each λ > 0 with probability 1 − λ we have The bound in (38) demonstrates that the mean of the process converges exponentially to a ball of radius M C 1 ∆t(1+σ+σ 2 ) ζ . The rate of exponential decay is governed by the rate of decay for the ideal continuous time process, and the radius of the ball increases as ∆t and σ increase, but is inversely proportional to the rate of exponential decay. The bound in (39) provides high-confidence guarantees which bound deviations of the process from its mean, and here we see the bound degrade as σ → 0, again illustrating the tradeoff between the bias and variance of the method. The two bounds can be combined to obtain high-probability tracking guarantees for the system.
The Theorem highlights the apparent necessity of our persistency of excitation condition. Such conditions may be difficult to verify in practice, and it is an important matter for future work to deduce whether this requirement is an artifact of our analysis or a fundamental limitation of model-free methods. Future work will also aim to refine the constants in the theorem, so that their dependence on the persistency of excitation condition and the bound in Assumption 4 become more transparent, which would make the result more readily applicable to real-world applications.

V. NUMERICAL EXAMPLE
Our numerical example examines the application of our method to the fully actuated double pendulum depicted in Figure 1 (a), whose dynamics can be found in [26]. With a slight abuse of notation, the system has generalized coordinates q = (θ 1 , θ 2 ) which represent the angles the two arms make with the vertical. Letting x = (x 1 , x 2 , x 3 , x 4 ) = (q,q), and u = (τ 1 , τ 2 ) ∈ R 2 , with τ 1 and τ 2 the torques applied at the joints, the system can be represented with a statespace model of the form (1). With outputs chosen to be the two configuration variables the system has a vector relative degree of (2, 2).
The dynamics of the system depend on the parameters m 1 , m 2 , l 1 , l 2 where m i is the mass of the i-th link and l i its length. For the purposes of our simulation, we set the true parameters for the plant to be m 1 = m 2 = l 1 = l 2 = 1. However, to set-up the learning problem, we assume that we have inaccurate measurements for each of these parameters, namely,m 1 =m 2 =l 1 =l 2 = 1.3. That is, each estimated parameter is scaled to 1.3 times its true value. Our nominal model-based linearizing controller u m is constructed by computing the linearizing controller for the dynamics model which corresponds to the inaccurate parameter estimates. The learned component of the controller is then constructed by using radial basis functions to populate the entries of {β k } K1 k=1 and {α k } K2 k=1 . In total, 250 radial basis functions were used.
For the online leaning problem we set the sampling interval to be ∆t = 0.05 seconds and set the level of probing noise at σ 2 = 0.1. The reference trajectory for each of the output channels were constructed by summing together sinusoidal functions whose frequencies are noninteger multiples of each other to ensure that the entire region of operation was explored. The feedback gain matrix K ∈ R 2×4 was designed so that each of the eigenvalues of (A + BK) are equal to −1.5, where A ∈ R 4×4 and B ∈ R 4×2 are the appropriate matricies in the reference model for the system. Figure 1 (b) shows the norm of the tracking error of the learning scheme over time while Figure 1 (c) shows the norm of the tracking error for the nominal model-based controller with no learning. Note that the learning-based approach is able to steadily reduce the tracking error over time while keeping the system stable. The periods of the sinusoids used in the reference trajectory are very small compared to the length of the simulations, which explains the apparent fast oscillations seen in both plots.

VI. DISCUSSION
The method proposed in this paper can be seen as analogous to the REINFORCE algorithm [27], perhaps the simplest of model-free algorithms. However, much of the recent success in the RL literature can be attributed to the use of algorithms which overcome the high variance of this "vanilla" reinforcement learning algorithm using techniques such as advantage estimation, base-lining and the reuse of previously seen data. The reuse of data may also enable one to remove our persistency of excitation condition [28]. Future work will aim to include these techniques within the theoretical framework developed in the paper. Throughout the paper we have set the 'step size' used for each gradient update (26) to be constant during the whole learning process. Many of the convergence proofs from the stochastic approximation literature [8] use a decreasing sequence of step sizes to drive the stochastic process to converge to a limit point. Future work will investigate how the learning rate for our scheme can be decreased over time (along with the variance term σ) to make the learning system less noisy as time progresses. It will also be important to provide estimates for the probability of the condition in Lemma 5 holding.
Many adaptive control algorithms make use of filtering techniques to reduce the effect that sensor noise in the physical system has on the parameter updates. Our method, which relies on having accurate measurements of higherorder derivatives of the outputs, may need to take advantage of these techniques for practical implementation. More broadly, developing a theory which incorporates filtering into modern policy gradient methods remains an important open problem.
While there remain many pressing avenues for future work, we hope that the theoretical framework developed in this paper will serve as a basis for these advances.