Feedback Capacity of MIMO Gaussian Channels

Finding a computable expression for the feedback capacity of channels with colored Gaussian, additive noise is a long standing open problem. In this paper, we solve this problem in the scenario where the channel has multiple inputs and multiple outputs (MIMO) and the noise process is generated as the output of a time-invariant state-space model. Our main result is a computable expression for the feedback capacity in terms of a finite-dimensional convex optimization. The solution to the feedback capacity problem is obtained by formulating the finite-block counterpart of the capacity problem as a \emph{sequential convex optimization problem} which leads in turn to a single-letter upper bound. This converse derivation integrates tools and ideas from information theory, control, filtering and convex optimization. A tight lower bound is realized by optimizing over a family of time-invariant policies thus showing that time-invariant inputs are optimal even when the noise process may not be stationary. The optimal time-invariant policy is used to construct a capacity-achieving and simple coding scheme for scalar channels, and its analysis reveals an interesting relation between a smoothing problem and the feedback capacity expression.


I. INTRODUCTION
We consider the feedback capacity of a multiple-input multiple-output (MIMO) Gaussian channel where Λ ∈ R p×m is a deterministic matrix, y i is the channel output and x i is the channel input. The noise is a colored Gaussian process generated by a vector state-space model (a hidden Markov model) where the sequence (w i , v i ) is i.i.d. with Gaussian distribution. Our assumptions on the state-space are mild and include for instance non-stationary noise processes (when the spectral radius of F is greater than 1). Particular realizations of the state-space reveal known random processes such as the auto-regressive moving-average (ARMA) noise process. Most related to our setting is the framework for channels with general additive Gaussian noise processes by Cover and Pombra [2]. They showed that the feedback capacity is equal to the limit of where K Z n is the covariance of the Gaussian noise, and the maximum is subject to strictly-causal linear operators B (lowertriangular matrices) and pairs (K V , B) that satisfy the power constraint Tr(K V + BK Z n B T ) ≤ nP . Their general methodology applies to arbitrary Gaussian processes, and can be extended to MIMO channels and results in a formula that is similar to (3), but the computation of such expressions remains non-trivial. In this paper, we show that imposing a state-space structure on the Gaussian noise leads to a computable characterization of the infinite-limit of the optimization problem (3).
Our main result is a computable expression for the feedback capacity, formulated as a finite-dimensional convex optimization problem. The optimization is a maximal determinant optimization problem subject to linear matrix inequalities (LMIs) constraints, a class of convex optimization problems that often appear in the control literature [3]- [6] and recently also in information theory [7]- [10]. The LMIs are interpretable, and one of the LMIs corresponds to a tight relaxation of a Riccati equation. Several aspects of the feedback capacity solution such as computability, comparison with non-feedback rates, and optimal inputs distribution are discusses by studying the capacity of the moving-average (MA) and the auto-regressive (AR) noise processes.
The literature on the feedback capacity of scalar Gaussian channel is rich, e.g. [11]- [20], and the focus here is on works most related to ours (a detailed survey can be found in [21]). In [22], an explicit lower bound for ARMA(1,1) noise was derived. Their lower bound was shown to be optimal for the MA(1) noise in [23], and the conjecture was proven for the ARMA (1,1) in [21]. The capacity characterization for the ARMA(1,1) noise in [21] relies on their general result that stationary channel input processes achieve the feedback capacity when the noise is stationary. Based on this fundamental result, [21] also studied the special case of our channel in (1)- (2) where the channel is scalar, the noise is stable, and the hidden state is available to the encoder (w i = v i ), and formulated its capacity as a finite-dimensional, non-convex optimization problem. In contrast, we provide a convex optimization for the feedback capacity of the general channel in (1)- (2) under mild conditions (see Section II below). In [10], a change of variable to the non-convex optimization in [21], combined with the novel idea of using LMIs, The authors are with the Department of Electrical Engineering at California Institute of Technology (e-mails: {oron,vkostina,hassibi}@caltech.edu). Part of this work was published in [1].
showed that the capacity can be formulated as a convex optimization problem. However, the change of variable relied on an erroneous claim (see Remark 1). Our paper studies colored Gaussian noise described by the general state-space model where the hidden state of the noise may or may not be available to the encoder, the channel may be scalar or vector (MIMO), and the noise may be stationary or not. We express the feedback capacity as a convex optimization problem in this general setting. A recent conference paper also studies MIMO channels, and extends the convex optimization in [15] to MIMO channels with ISI [24]. The intersection of [24] and our setting is a MIMO channel with a stable, colored Gaussian noise, and the capacity results in the current paper (initially published as [1]) and [24] were developed independently and published concurrently. Each work considers a different extension of the MIMO channel with stable noise: [24] studies channels with ISI, whereas we study colored Gaussian noise that can be either stationary or non-stationary. A major technical contribution in our work is that we show the optimality of stationary inputs for non-stationary noise processes. This fact and its proof may be of independent interest since it does not rely on the frequency-based methods of [21].
The starting point of our derivations is the general Cover-Pombra characterization in (3), and we develop a time-domain methodology to provide a computable expression for the feedback capacity. We derive a novel formulation of the n−letter capacity in (3) as a sequential convex optimization problem (SCOP). In particular, we formulate an optimization problem whose decision variable is a sequence of length n, where at each time fixed-dimensional matrices are optimized. In the SCOP formulation, the LMI constraints have a sequential nature and should depend on two consecutive times only. This sequential property combined with the convexity of the problem is the key to obtain a single-letter upper bound for the limit of the n-letter capacity in (3). For the lower bound, a family of time-invariant channel inputs distributions is optimized and is shown to achieve the upper bound. An outcome of our derivation is a new methodology to show that time-invariant inputs are sufficient to achieve the feedback capacity even when noise may be non-stationary.
An optimal time-invariant policy can be computed directly from the feedback capacity convex optimization. Using this policy, we also construct an explicit coding scheme that achieves the feedback capacity for scalar channels. The derived scheme generalizes the coding proposed in [21], and simplifies its encoding by showing that the message can be encoded in a single dimension rather than the multi-dimensional variant proposed in [21]. We also derive an explicit decoding rule by studying a related smoothing problem [25]. The analysis of the smoothing problem reveals an interesting relation between the volume reduction of its error covariance and the capacity solution. That analysis is performed for the general case of a MIMO channel, and a possible MIMO scheme is discussed in Section VII.
The rest of the paper is organized as follows. In Section II, we present the setting and the preliminaries. Section III includes our main result on the feedback capacity of the MIMO Gaussian channel and several examples. In Section IV, we present the optimal inputs distribution and the capacity-achieving coding scheme. In Section V, the main ideas and the technical lemmas to prove our main result are presented while their detailed proofs appear in Section VI.

II. THE SETTING AND PRELIMINARIES
This section includes the communication setting. We also present the Kalman filter and the Riccati equation that are required for the presentation of the main result.

A. The setting
We consider a MIMO additive Gaussian channel where the channel input is x i ∈ R m , y i ∈ R p is the channel output, the additive noise is z i ∈ R p , and Λ ∈ R p×m is a fixed known matrix. The encoder has access to noiseless, strictly-causal feedback so that the input x i is a function of the message and all previous channel outputs y t−1 := y 1 , . . . , y t−1 . For a fixed blocklength n, the channel input should satisfy the average power constraint 1 Definitions of the average probability of error, achievable rates and the feedback capacity are standard and can be found in [21], for instance. The feedback capacity with a power constraint P is denoted by C f b (P ).
In the case of MIMO channels, the capacity can be expressed as the multi-letter expression in (3) by modifying all the matrices to their corresponding block matrices with appropriate dimensions. An equivalent characterization of the n-letter objective in (3) is the directed information I(x n → y n ) that characterizes the feedback capacity of point to point channels [26]- [29].
The additive noise is a colored Gaussian process generated as the output of the state-space: where and are independent of the initial state s 1 ∼ N (0, Σ 1 ). Note that the encoder has a strictly causal access to the noise z i , but not to the hidden state s i . For this case, we can use Kalman filtering in order to write the state-space in an observer form [30]. This pre-Kalman filtering step for the state-space (5) allows one to define a new channel state that is available to the encoder as presented in the next section.

B. The Kalman filter and the innovations process
The Kalman filter is a simple, recursive method to compute the maximum likelihood estimate of the hidden state s i based on the measurements z 1 , . . . , z i−1 . The predicted-estimate of the state and its error covariance are defined aŝ The Kalman filter is given by the recursionŝ with the initializationŝ 1 = 0, and the constants are where the error covariance Σ i is described by the Riccati recursion with the initial condition Σ 1 0. The estimate in (7) can be computed using the innovations process defined as e i = z i −Hŝ i and is distributed according to N (0, Ψ i ). It is also known that the innovation e i is orthogonal (statistically independent) to the previous instances of the measurements z i−1 [31]. Thus, we can write a new equivalent channel aŝ whereŝ i plays the role of the channel state and is available to the encoder. Note that this is a valid channel due to the Markov The innovations process also characterizes the entropy rate of Gaussian random processes as In (8), it is assumed that Ψ i ≻ 0 for all i. This is a natural assumption since otherwise the capacity is infinite. Namely, if Ψ i is only positive semidefinite, a coordinate in the noise vector z i is a deterministic function of the past noise instances z i−1 . Building an infinite-rate scheme is straightforward: the encoder transmits x j = 0 so that y j = z j for j ≤ i − 1. Then, by having z i−1 , the encoder and the decoder know a coordinate of z i , and can communicate an inifinite number of bits on this vector coordinate (assuming the image of Λ is not degenerated at this particular direction).

C. The Riccati equation
Consider the function where K p (Σ) = (F ΣH T + GL)Ψ(Σ) −1 and Ψ(Σ) = HΣH T + V . The Riccati equation is defined as f (Σ) = 0. The stabilizing solution to the Riccati equation (if exists) not only solves f (Σ) = 0, but is also the unique solution such that its corresponding closed-loop system F − K p (Σ)H is stable. In the rest of the paper, we refer to as the constants evaluated at the stabilizing solution. The corresponding time-invariant Kalman filter iŝ We move on to present assumptions on the state-space model. The stability of F determines the stationarity of the noise process. Definition 1. The matrix F is stable if its spectral radius satisfies ρ(F ) < 1.
Without further assumptions, our results hold for the stationary case, i.e., when F is stable (and L = 0). Thus, a reader whose interest is limited to the stationary case may skip the following assumptions. Assumption 1. The pair (F, H) is detectable. That is, there exists a matrix K such that ρ(F − KH) < 1.
That is, for any x and λ such that xF s = xλ, if |λ| = 1, then xW Assumption 1 asserts that all eigenvectors of F that have unstable eigenvalues (outside the unit circle) can be observed via the matrix H. Indeed, without loss of generality, it can even be assumed the pair (F, H) is observable (for all eigenvectors) since the unobserved eigenvectors have no effect on the channel noise. Assumptions 1 and 2 are sufficient and necessary conditions for the existence of the unique stabilzing solution to the Riccati equation in (12).
We further need to assume that the initial covariance matrix Σ 1 converges to the stabilizing solution. Advanced discussions on convergence of Riccati recursions can be found in [25,App. E], and here we aim to provide several alternatives in order to obtain a general framework. The first condition is stabilizability of (F s , W s ). That is, for any x and λ such that xF s = xλ, if |λ| ≥ 1, then xW 1/2 s = 0. This condition guarantees that the stabilizing solution is the only positive semidefinite solution to the Riccati equation, which implies that any initial state covariance converges to the stabilizing solution. Another useful condition is Σ 1 Σ where Σ is the stabilizing solution. Beyond these two sufficient conditions, in simple cases (such as the moving average noise in Section III-A), the convergence can be verified manually. We can assume without loss of generality that the initial covariance Σ 1 is equal to the stabilizing solution of the Riccati equation in (12) by letting x i = 0 before the transmission begins.

III. MAIN RESULT AND DISCUSSION
In this section we present the feedback capacity of the MIMO channel and its particularization to scalar channels. We discuss different aspects of our main results via several examples. The following is our main result.
Theorem 1 (Feedback capacity of MIMO channels). The feedback capacity of the MIMO Gaussian channel in (4)-(5) is given by the convex optimization where K p and Ψ are constants given in (13), and the optimization variables are the matrices Π ∈ R m×m ,Σ ∈ R n×n , and Γ ∈ R m×n .
Note that by the first LMI constraint, the optimization variables Π andΣ are positive semidefinite. The objective structure is the difference between the entropy rates of the channel outputs process and that of the noise process. Note that the objective is a concave function of the decision variables since Ψ Y is a linear function the decision variables, while Ψ is a constant. Thus, the optimization problem is a convex optimization that can be computed with standard software e.g. [32].
In Section IV-A, the decision variables will be given a straightforward interpretation by showing that they induce a timeinvariant, optimal inputs distribution. Here, we briefly remark on the LMIs in (15). The decision variable Π corresponds to the covariance matrix of the channel input so that the first LMI in (15) is a verification that it forms a valid covariance matrix with a correlated variable whose covariance matrix isΣ. For the second LMI in (15), its Schur complement implies the Riccati inequalityΣ with K Y = (F (Γ T Λ T +ΣH T ) + K p Ψ)Ψ −1 Y . In Lemma 7 (Section V), it is shown that there always exist optimal decision variables (Π,Σ, Γ) that satisfy the Riccati inequality (16) with equality, i.e., it is a Riccati equation. This reveals that the origin for explicit capacity formulae expressed as function of roots to some polynomials [21]- [23] is the Riccati equation. We demonstrate this interesting fact in Section III-A for the MA noise process.
If the channel outputs, inputs, and the additive noise are scalars, but the hidden state of the noise s i can still be a vector, the capacity in Theorem 1 can be simplified as follows.
Theorem 2 (Feedback capacity of scalar channels). The feedback capacity of the scalar Gaussian channel (4)-(5) with Λ = 1 is given by the convex optimization problem where K p and Ψ are constants given in (13).
Choosing H = 0 in (17) recovers the capacity formula of an additive white Gaussian noise (AWGN) channel Remark 1. The state-space studied in [10] can be recovered from the setting in Theorem 2 with W = V = L = 1 and a stable F . In this case, the constants are Σ = 0, K p = G, Ψ = 1 and the capacity expression in (17) and that in [10,Th. 4] are almost in full agreement. In particular, they write the optimization problem with a supremum, and there is a difference in the sign of the first LMI in (17) which reads as a strict LMI (≻) in [10].

A. Moving average (MA) noise
In this section, we study a scalar channel with the MA noise process with i.i.d. w i ∼ N (0, 1) and α ∈ R. In [23], the feedback capacity of the MA noise with |α| ≤ 1 was shown to be equal to where x 0 is the unique positive root of The MA noise can be realized by the state space (5) with F = 0, H = α, G = W = V = L = 1. We derive here the feedback capacity expression for all α.
Theorem 3 (Moving-average noise). The feedback capacity of the Gaussian channel with MA noise is where SNR is the maximal positive root of the polynomial The capacity expression in (21)- (22) for |α| ≤ 1 coincides with the feedback capacity expression in (19).
For |α| ≤ 1, the feedback capacity is independent of the initial state covariance Σ 1 but, for |α| > 1, we assume Σ 1 = 0. This condition is made to avoid the singularity that Σ i = 0 for all i, and does not converge to the stabilizing solution of the Riccati equation. If |α| > 1 and Σ 1 = 0 (i.e., the initial state is deterministic), the capacity can still be computed but with the solution to the first polynomial in (22).
To compare the capacity expression in Theorem 3 with (19) when |α| ≤ 1, we can use the change of variable It is interesting to note that the latter polynomial and (20) are different. However, the second part of Theorem 3 confirms that the feedback capacities are in agreement for |α| ≤ 1 by showing that their positive roots coincide. We proceed to prove Theorem 3.
Proof of Theorem 3. We compute the capacity expressions in Theorem 2, and then verify thereafter that the required conditions are met.
The Schur complement of P Γ Γ TΣ 0 evaluated in the optimal variables is shown to be zero using contradiction. Assume that P − Γ 2Σ−1 = p for some p > 0. Then, we can replace Γ with Γ ′ = Γ(1 + Γ −2 pΣ) 1/2 to obtain a larger objective. The Riccati LMI can be verified to be satisfied with this replacement. This step shows that the LMI cannot be strict at the optimum. For the other LMI in Theorem 2, a similar reasoning shows that the Schur complement of the Riccati LMI (16) can be achieved with equality (see Lemma 7), and the Schur complement simplifies to the Riccati equationΣ = K p ΨK p − (ΨK p ) 2 Ψ −1 Y . In order to compute the capacity expression in Theorem 2, we compute the Riccati constants K p and Ψ from the stabilizing solution of the Riccati equation in (12). The Riccati equation has two solutions Σ = 0, 1 − 1 α 2 . For |α| < 1, the stabilizing solution is Σ = 0 which implies K p = Ψ = 1, while for |α| > 1, the stabilizing solution is Σ = 1 − 1 α 2 which implies Ψ = α 2 and K p = α −2 . We note that in both cases K p Ψ = 1 so that the Riccati equation above simplifies toΣ The decoder innovation can be written as where the sign of Γ was chosen to maximize Ψ Y . We denote Ψ Y Ψ −1 = 1 + SNR and substitute the latter in both sides of (23) to obtain the fixed-point equations in (22). We verify the conditions for Theorem 2. The pair (F = 0, H = α) is detectable (Assumption 1) for all α, and (F s , W s ) = (−α, 0) is controllable on the unit-circle for all |α| = 1 (Assumption 2). Thus, for |α| = 1, the stabilizing solution for the Riccati equation exists and is equal to Σ = max{0, 1 − 1 α 2 }. It is easy to check that the Riccati recursion in (9), Σ i+1 = 1 − 1 1+α 2 Σi , converges to the stabiliing solution unless |α| > 1 and Σ 1 = 0.
If |α| = 1, the only solution to the Riccati equation is Σ = 0, but it is not a stabilizing solution (it is the maximal solution). Although Theorem 1 concerns with noise processes whose Riccati equations have stabilizing solutions, the upper bound extends to non-stabilizing solutions as well, and for the particular instance of the MA noise with |α| = 1, we verified that the lower bound in Lemma 6 holds as well.
Finally, we show the equivalence of our capacity expression and (19), for |α| < 1, by proving that the positive roots of (20) coincide. The positive root of (20) satisfies 1−|α|x0 , and by substituting these equations into the second polynomial, we get The other direction can be shown similarly.

B. Auto-regressive (AR) noise
The auto-regressive (AR) noise process of first order is given by where w i ∼ N (0, 1) is an i.i.d. sequence. This is one of the simplest instances of colored Gaussian noise and was studied in [11]- [13]. A closed-form feedback capacity expression for the stationary case |β| < 1 was derived in [21] 1 . We present next the AR noise with a general β. The AR process can be realized by (5) with F = H = β and G = L = W = V = 1.
In Fig. 1, the feedback capacity in Theorem 1, the feedback capacity expression for |β| < 1 from [21], and the non-feedback capacity (using a water filling solution) are plotted. Additionally, we plot the maximal achievable rate with i.i.d. inputs by adding the constraint Γ = 0 to (17). This rate can be computed explicitly as First, it is interesting to note that black and blue curves coincide for some non-stationary noise with 1 ≤ β 1.5. This means that the capacity expression in [21] holds true even for some values beyond the stationary regime. The rate achieved with i.i.d. inputs (green curve) approaches the feedback capacity for large regression parameters. Thus, the plot shows that, as the regression parameter grows large, the rate achieved by i.i.d. inputs approaches the feedback capacity, and the feedback link has a negligible contribution in terms of capacity. From operational perspective, the feedforward capacity lies in between the i.i.d. achievable rate (green curve) and the feedback capacity (black curve). Thus, for large β, it can be well-approximated with simple i.i.d. inputs, i.e., codewords with memory are not needed. These phenomena are related to the structure of the optimal inputs distribution that is presented and discussed for the AR noise in Section IV-A.   1. The feedback capacity of the Gaussian channel with an auto-regressive noise of first order and a unit-power input constraint (black curve). The blue curve describes the feedback capacity expression for the stationary case (|β| < 1) from [21], but is plotted here for greater values of β, as it numerically coincides with the feedback capacity as long as M = 0 (see Fig. 2 below). The red curve corresponds to the feedforward capacity (without feedback) obtained via a water-filling solution, and the green curves corresponds to an i.i.d. coding law (feedback-independent) of the channel inputs in (25).

IV. OPTIMAL INPUTS DISTRIBUTION AND CODING SCHEME
In this section, we present a capacity-achieving, time-invariant inputs distribution that can be computed from the convex optimization in Theorem 1. We then use this inputs distribution to construct a capacity-achieving coding scheme for scalar channels and discuss a possible extension to MIMO channels.

A. Optimal inputs distribution
The optimal decision variables in Theorem 1 induce a time-invariant capacity-achieving inputs distribution 2 : whereŝ i is defined in (6), and its estimate at the decoder is defined bŷ The optimal policy is composed as the sum of two signaling components. The first component, (ŝ i −ŝ i ) corresponds to the decoder's estimation error, and is a function of the feedback. Its transmission refines the decoders' knowledge onŝ i by transmitting the states innovation (the vectorŝ i can also be regarded as the channel state, see Section II). The covariance of the innovation (ŝ i −ŝ i ) isΣ, so that the covariance of the first component is cov(ΓΣ † (ŝ i −ŝ i )) = ΓΣ † Γ T . The second component, m i , is independent of (x i−1 , y i−1 ) (and thus is feedback-independent), and has an i.i.d. distribution with the remaining covariance, i.e., The transmission of the vector m i increases the uncertainty of the channel stateŝ i at the decoder, but it can be used to transmit new information (on the message). In the AWGN channel, for instance, the entire power is allocated to the second component m i . We proceed to illustrate the policy behavior for the AR noise. 2 More precisely, a capacity-achieving policy in Lemma 6 is the time-invariant law in (26) for t > 1, and a different coding law for t = 1. Combined with the i.i.d. coding law curve in Fig. 1 (green curve), this shows that the feedback link has negligible contribution to the capacity solution.
In Fig. 2, the power allocated to each of the signals in (26) is plotted for the AR noise in (24) with a power constraint P = 1. For 0 < β ≤ 1, Fig. 2 agrees with the claim in [21,Th. 4.6] on the sufficiency of inputs distribution with m i = 0 for sclar and stationary noise (see also next paragraph). However, beyond the stationary regime, there is a sharp phase transition, and the power allocated to m i increases as β grows. The phase transition location, β ≈ 1.5, explains the gap between the feedback capacity in Theorem 1 and the capacity expression in [21] since the latter used a policy with m i = 0. Fig. 2 also shows that the rate achieved with i.i.d. inputs in (25) approaches the feedback capacity for growing β. This implies that the Schalkwijk-Kailath (SK) encoding law is close to optimal in this regime [34].
The role of the second component m i has been discussed in several papers [10], [21], [33]. In [21,Cor. 4.4], it is claimed that for scalar channels with stationary noise, the capacity can be achieved with M = 0. Recently, [33] showed that the proof of the claim in [21,Cor. 4.4] relies on an erroneous calculation and thus is invalid. Our capacity derivation relies on a general policy with M 0 (with a different coding for the first time t = 1). As illustrated in the examples above, for general noise processes, M can be either positive or zero; for the MA noise, we prove in Theorem 3 that M = 0 is necessary to achieve the capacity, and for the AR noise, it is illustrated that M > 0 in the non-stationary regime (Fig. 2). When specializing our capacity expression for stationary noise processes, it may be utilized to find a counterexample for [21,Cor. 4.4]. We ran extensive simulations to specialize our capacity expression in Theorem 2 to various stationary noise processes and to compute the optimal M , yet we did not find a counterexample to [21,Cor. 4.4]. Thus the claim in [21,Cor. 4.4] may be true.
As mentioned in Remark 1, the fact that M = 0 does not imply that the achievable rate is as erroneously claimed in [10]. If M = 0, it simply implies that the message is encoded at the first time with m 1 = 0, and from t > 1 the encoder follows the rule x i = ΓΣ † (ŝ i −ŝ i ). In the next section, we show that this explicit coding scheme is capacity-achieving with double-exponential decay in the error probability for any rate below capacity.

B. Coding scheme for scalar channels
In this section, we construct a capacity-achieving coding scheme for scalar channels (with a vector hidden state) based on the optimal inputs distribution in (26). Throughout this section, it is assumed that the optimal inputs distribution in (26)  Our scheme resembles the SK scheme [34], [35], and other posterior matching schemes for memoryless channels [36]- [39] and channels with memory [21], [23], [24], [40]- [43] in its main idea to refine the decoders' knowledge of the message (or equivalently, to refine the decoders' knowledge of the first noise instance z 0 in Gaussian channels). The main difference with the SK scheme is that rather than encoding the scaled innovation of the first noise instance z 0 , our encoding follows (26) to transmit the innovation ofŝ i . This modification results in a more numerically stable encoding since our scaling factor ΓΣ † is a constant, while in the SK scheme the scaling of the message innovation increases with time.
A related scheme for a similar setting appears in [21]. Both schemes follow the encoding in (26) but, only our paper provides a computable expression for the coefficient matrix ΓΣ † (via Theorem 1). Additionally, we simplify the multidimensional encoding method in [21] by showing that, even when the hidden state is a vector, it is sufficient to encode the message in a single time instance. Additionally, we provide an explicit smoother for the maximum likelihood decoder proposed in [21], [44].
2) In subsequent times, the encoding process is simply to use the optimal inputs distribution in (26). The estimatesŝ i and s i can be computed directly from (7) and (41) and the constant ΓΣ † is obtained from the optimization in Theorem 1. 3) When transmission ends, the decoder constructs the maximum-likelihood estimate of the first noise instance z 0 using the measurements y n . This is a smoothing problem that is formally presented in Lemma 1. We are now ready to present the coding scheme as Algorithm 1. The abbreviation KF stands for Kalman filter while Smooth is for the smoothing function in Lemma 1 (Eq. (29)).
We remark that the encoder estimateŝ i is computed with the time-invariant Kalman filter in (14), while the decoder Kalman filter forŝ i is time-varying is used with the time-invariant parameters Ψ i = Ψ, K p,i = K p and the initial conditionsΣ 0 = 0 and M 0 = V (Ū ). The following theorem shows the optimality of our coding scheme.
Theorem 4 (Capacity-achieving coding scheme). For any rate R < C f b (P ), the error probability of the coding scheme for scalar channels (with m i = 0) in Algorithm 1 decays in a doubly-exponential rate for large n.
A simple proof of Theorem 4 appears at the end of this section. Its main building block is the analysis of a smoothing problem in Lemma 1. We provide now explicit formulas to compute the estimate and its error covariance. The formulas are presented for the general MIMO channel, and their particularization to scalar channels will be used in the proof of Theorem 4.
Lemma 1 (The smoothing problem). Consider the smoothing problem of estimating z 0 from y n witĥ Algorithm 1 Optimal scheme for scalar channels Inputs: m, Γ,Σ,ŝ 0 =ŝ 0 = 0,ẑ 0|0 = 0 Subject to the optimal inputs distribution (26), when m i = 0, 1) The optimal smoother can be recursively computed aŝ withẑ 0|0 = 0 and 2) The error covariance can be updated asẐ withẐ 0|0 = Ψ, and its determinant satisfies Moreover, Ψ Y,i converges to Ψ * Y , the optimal value of Ψ Y in Theorem 1. -Therefore, for scalar channels, the error covariance satisfieŝ and Ψ Y,i converges to Ψ * Y . The proof of Lemma 1 appears in Section VI-A. Recall from Theorem 1 that the capacity can be expressed as The relation between the capacity and the smoothing problem is transparent. The volume (determinant) reduction of the error covariance in (31) is the logarithm argument in the capacity expression. For scalar channels, the volume reduces to a single dimension refinement of the noise instance z 0 . However, for MIMO channels the volume reduction is not sufficient to derive an explicit coding scheme. We provide details on a possible MIMO scheme construction.
A suggested scheme for MIMO channels is as follows: assume for simplicity Λ = I, and split the message into p independent sub-messages where p is dimension of the input, output and noise. In the first time, we normalize each sub-message, and transmit their concatenation as the vector x 0 . The encoding is identical to that in Algorithm 1, that is, it follows the policy in (26). The estimation of the vector z 0 is based on the smoother in (29), and the decoding is carried out using coordinate-wise successive cancellation of the vectorẑ 0|n . The analysis of such scheme can be possibly done using Lemma 1, but a finer spectral analysis is needed. In particular, the geometric reduction in (31) needs to be shown for each coordinate and not for the overall determinant. The geometric rate of the error covariance in (32) should also determine the rates allocated to each sub-message, and is the key to obtain the double-exponential decay. We proceed to show the optimality of Algorithm 1 for scalar channels using the analysis of Lemma 1.
Proof of Theorem 4. The decoder estimates x 0 from y 0 −ẑ 0|n =Ū (m) + z 0 −ẑ 0|n . This is the problem of estimating an M-PAM signal from a Gaussian-corrupted measurement, and the probability of error can be bounded as where the inequality follows from the first and last messages where a large error deviation will not incur an error on one of their ends, and the equality follows from γ n 1 2 12 Var(z0|y n ) 2 2nR −1 with the standard Q-function. We can further bound γ n as γ n ≤ √ 3 · 2 −nR Var(z 0 |y n ).
By Lemma 1, γ n has a positive exponent if R < 1 2n log Ψ n i=1 (Ψ Y,i Ψ −1 ), which leads to the doubly-exponential decay rate. As Ψ Y,i → Ψ * Y , R can be chosen arbitrarily close to 1 2 log(Ψ * Y Ψ −1 ) which is precisely the feedback capacity.

V. PROOF OF THE MAIN RESULT
In this section we outline the proof of Theorem 1 by presenting the technical lemmas leading to tight lower and upper bounds. The proof is structured as three parts.
1. Sequential convex optimization problem (SCOP): The n-letter capacity expression for the MIMO channel is defined as The first three lemmas formulate the n-letter capacity as a SCOP. While it is easy to show that C n (P ) is concave in its decision variable P (x n y n ), the challenge is to formulate it as a convex optimization problem that enables one to explicitly compute the limit of C n (P ). To this end, we realize a SCOP with LMI constraints that have a sequential structure.

Upper bound via convexity:
The second part of the proof utilizes the SCOP structure to show that the capacity expression in Theorem 1 is an upper bound on the capacity. Since the optimization constraints contain decision variables at consecutive times, the standard time-sharing random variable argument does not apply here, and we use a different technique to show that these constraints are asymptotically satisfied when realized at the convex combinations of the decision variables.
3. Lower bound using time-invariant inputs: The last part constructs a time-invariant policy whose optimization leads to a lower bound that is expressed as the upper bound optimization problem with additional constraints. We show that the additional constraints are redundant, concluding the proof of the main result.

A. Sequential convex optimization problem
The first lemma identifies an optimal structure for the inputs distribution usingŝ i andŝ i defined in (6) and (27), respectively.
Lemma 2 (The optimal policy structure). For a fixed n, it is sufficient to optimize (35) with inputs of the form Γ i is a matrix that satisfies and the power constraint is Lemma 2 simplifies the optimization (35) by showing that the optimization domain is over the sequence of matrices (Γ i , M i 0) n i=1 . Note thatΣ i is a deterministic function of the policy up to time i − 1 and thus is not part of the optimization. Similar policies appeared in the literature e.g. [23, Section IV] and [10] building on the ideas in [22]. Their policy reads x i = Γ i (ŝ i −ŝ i ) + m i , and our policy in Lemma 2 is a subset of their policy. Specifically, ifΣ i is invertible, the orthogonality constraint is redundant, and one can use the change of variable Γ ′ i = Γ iΣ −1 i to show the equivalence of the policies. However, in general,Σ i may be singular, and the orthogonality constraint is required for the convex optimization formulation in Lemma 4. In the next lemma, the channel output is formalized as the measurement of a controlled state space.

Lemma 3 (Channel outputs dynamics). For a fixed policy {(Γ
, the channel outputs admit the state-space model s i+1 = Fŝ i +K p,i e i , where K p,i and e i ∼ N (0, Ψ i ) are defined in (8). The estimator in (27) can be written aŝ and its error covarianceΣ i = cov(ŝ i −ŝ i ) satisfies the Riccati recursion with the initial conditionΣ 1 = 0, and the constants Lemma 3 is a direct consequence of the policy derived in Lemma 2. As seen from (40), the policy in (36) translates into an additive measurement noise m i and a modification of the observability matrix ΛΓ iΣ † i + H. Similar state-space structures appeared in [21], [30], but it is interesting to realize that (40) does not fall into the classical state-space structure since the observability matrix depends on the error covarianceΣ i induced from our policy. Lemma 3 also reveals an objective structure that resembles the one in Theorem 1. In particular, we can use the covariance of the channel outputs innovation in (43), Ψ Y,i , and (11) to write The next lemma summarizes the SCOP formulation.

Lemma 4 (Sequential convex-optimization formulation). The n-letter capacity can be bounded by the convex optimization problem
where the constraints hold for t = 1, . . . , n, andΣ 1 = 0.
To see that (44) is a convex optimization, note that each of the matrix constraints is a linear function of the decision variables. In the next section, we provide the single-letter upper bound on the capacity. The key to the upper bound is the concavity of the objective function and the linearity of the constraints, along with the crucial property that the Riccati LMI constraint in (44) includes decision variables of two consecutive times only.

B. Single-letter upper bound
The next lemma concludes the upper bound in Theorem 1.

Lemma 5 (The upper bound). The feedback capacity is bounded by the convex optimization
The main idea behind the upper bound is to show that the objective function evaluated at the convex combination of each of the decision variables in Lemma 4 achieves a larger objective. At a high level, the idea is similar to the time-sharing random variable, but the challenge lies in the constraints. Specifically, one cannot show that the Riccati LMI constraint (45) is satisfied at all times when evaluated at the convex combination of the decision variables. To settle this point, we show that this constraint is satisfied in the asymptotics.

C. Lower bound
In this section, we show that the upper bound in Lemma 5 is achievable. It is shown with two lemmas: the first formulates a lower bound as an optimization problem that resembles the upper bound but has two additional constraints. The second lemma shows that additional constraints are satisfied in the upper bound optimization problem.

Lemma 6 (Lower bound). For time-invariant policies
with m i ∼ N (0, M ) (and a different coding rule for i = 1), the maximization of (35) over (Γ, M ) achieves the lower bound The optimization problem in (48) is the same as the upper bound in (45) except for the additional constraint (48) and the Riccati equation (47) which appears as an inequality in the upper bound (16). Next, we show that these two conditions are redundant concluding the proof of Theorem 1. (45): 1) There exists an optimal tuple such that the Schur complement of the Riccati LMI (16) is achieved with equality.

Consequently, the upper bound in Lemma 5 and the lower bound in Lemma 6 are equal to the feedback capacity.
For scalar channels with H = 0, it can be shown that the optimal tuple satisfies the first item. That is, the Schur complement of the Riccati equation evaluated at any optimal solution tuple is zero. This fact is utilized in Theorem 3.

VI. PROOF OF TECHNICAL LEMMAS
In this section, we provide detailed proofs of Lemmas 2 -7 consecutively. We then prove Lemma (1) on the smoothing problem in Section IV.
Proof of Lemma 2. The policy in (36) forms a subset of the maximization domain P (x n || y n ) = n i=1 P (x i | x i−1 , y i−1 ) in (35). Thus, our proof strategy is to construct a policy of the form (36), for any inputs distribution P (x n || y n ), and show that it induces the same objective. The optimality of a Gaussian inputs distribution in (35) can be shown with a standard argument of maximum entropy, e.g., [21]. We start by computing the ith objective as . The covariance can be computed explicitly as where (a) follows fromẑ i E[z i | y i−1 ] and E[x i | y i−1 ] = 0. The latter assumption is without loss of optimality since any policy with E[x i | y i−1 ] = 0 can be modified tox i = x i − E[x i | y i−1 ] that has zero mean without affecting the objective function in (49).
Step (b) follows from the channel outputs definition in (4) and (27), and (c) follows from the independence of the innovation z i −Hŝ i and the tuple (x i , y i−1 , z i−1 ). For any inputs distribution P (x n || y n ), denoted by P , we construct a new policy of the form (36), denoted by Q, as follows where , m i is independent of (x i−1 , y i−1 ) and is distributed according to m i ∼ N (0, M i ) with andΣ † i is the pseudo inverse ofΣ i cov P (ŝ i −ŝ i ). The subscript P is made to emphasize the dependence on the distribution P . We show by induction that the new policy in (51) induces the same objective as the distribution P . Consider the Gaussian vector Ξ where the superscript indicates its distribution. If we show that Ξ P i has the same distribution as Ξ Q i for i = 1, . . . , n, then their objectives are equal by (49). For the base case of the induction, we have Ξ P/Q 1 = (0, 0, x 1 , y 1 ) for both policies and our construction in (51) guarantees that x 1 has the same distribution for both policies. For the induction step, assume that the variables {Ξ P i } i≤t have the same distribution as {Ξ Q i } i≤t . We show that tuple Ξ P t+1 has the same distribution as Ξ Q t+1 by comparing their different components using a Bayes rule. First, the encoders' estimateŝ t+1 is independent of the policy choice. The decoders' estimateŝ t+1 = E[ŝ t+1 | y t ] is a function of the innovations {y i −ŷ i } i≤t , and by the induction hypothesis these innovations have the same distribution. These first two steps conclude that cov P (ŝ t+1 −ŝ t+1 ) = cov Q (ŝ t+1 −ŝ t+1 ). For the channel input, it can be easily verified by (51) that , and below we also show that The last step to complete the inductive step is for the innovation y t+1 −ŷ t+1 , and we note from (50) that the distribution of the latter conditioned on (ŝ t+1 −ŝ t+1 ), x t+1 is determined by z t+1 −Hŝ t+1 . The orthogonality constraint Γ i (I −Σ † iΣ i ) is a property of covariance matrices since (I −Σ † iΣ i ) is the orthogonal projection onto the kernel ofΣ i , but we prove it here for completeness. Consider the eigendecomposition of the covariance matrix where U 0 U 1 is an orthogonal matrix and Ω ≻ 0 which imply (ŝ i −ŝ i ) T U 1 = 0. The Moore-Penrose pseudo inverse iŝ and the constraint can be written as To see that (55) is the zero matrix, note that if u is a column of U 0 , then I − U 0 U 1 I 0 0 0 Finally, it can be verified that the power consumed by the new policy satisfies Proof of Lemma 3. The recursion for the predicted stateŝ i+1 is given in Eq. (7) where e i is the innovation process. For the channel output, we use Lemma 2 to write Note that the termŝ i is a deterministic function of y i−1 and thus has no effect on the estimation error. To show that (40) is a state-space model that admits standard Kalman filtering, note that the measurement noise Λm i + e i is independent of z i−1 . Thus, the measurement noise is independent of previous measurements y i−1 and the hidden statesŝ i−1 of the state-space model. To obtain the optimal estimator and the error covariance recursion in (42), we apply the standard Kalman filter recursions (7)-(9) that also hold with the time-varying constants Proof of Lemma 4. The starting point is the combination of Lemma 2 and Lemma 3 to the optimization problem of C n (P ) with the initial conditionΣ 1 = 0. The maximum is over all involved variables, that is, . The first step is to introduce an auxiliary decision variable Then, Ψ Y,i can be written as where we used the orthogonality constraint Γ i (I −Σ † iΣ i ) = 0. As a result, the Riccati recursion can also be represented with Π i only. The power constraint can also be expressed as so that the variable M i only appears in the constraints which can be reduced to the constraint By the Schur complement for positive semidefinite matrices [45, p. 651], Finally, the Riccati equation is relaxed to the Riccati inequalitŷ and using the Schur complement transformation, we can write Proof of Lemma 5. This is the converse proof for the capacity expression in Theorem 1. Recall that throughout the derivations, we used the n-letter capacity C n (P ) in (35), but a standard converse argument can relate this quantity to the feedback capacity by showing that for any n, where δ n → 0 is resulted from a Fano's inequality. The remaining step is to show that the SCOP formulation in Lemma 4 that serves as an upper bound to C n (P ) can be further upper bounded by its single-letter counterpart, the optimization problem in Theorem 1.
Define the convex combinations of the decision variables as and also letΣ n 1 n n i=1 Σ i ,Ψ n 1 n n i=1 Ψ i denote the averaged constants of the Riccati variables. The concavity of the log det(·) function and Jensen's inequality imply that the convex combinations attain a greater objective than the one in Lemma 4, where the argument of the right-hand side can be written as the linear function Next, the per-time constraints of the n-letter problem should be transformed into their single-letter counterparts, that is, the ones evaluated at the convex combinations in (66). It is straightforward to show that the power constraint and the first LMI constraint are satisfied at the convex combination by and We proceed to the last constraint in the optimization problem, the Riccati LMI, defined by The main challenge is that the Riccati LMI does not satisfy Ω(Π n ,Σ n ,Γ n ) 0 for all n. In other words, the tuple of convex combinations in (66) does not lie in the constraint set of the convex optimization in Theorem 1. Our strategy is to show that the limiting tuple of convex combinations (as a function of n) lies in the required constraint set. This is achieved by showing that the tuple of convex combinations lies in a relaxed constraints set, parameterized with some ǫ > 0. We then show that ǫ can be made small as n grows large and argue that there is a limit point that lies in the constraints set that corresponds to ǫ = 0. Define the ǫ-domain of the constraints set as and note that C 0 is the constraints set in Theorem 1. By summing over both sides of the Riccati inequality in (64), we have Arranging both sides and using the fact that Σ n+1 0, By our assumptions on the state-space model of the noise, we can use [25,Ch. 14] to haveΣ n → Σ andΨ n → Ψ. Thus, the constraint on Ω(Π n ,Σ n ,Γ n ) is satisfied asymptotically. Specifically, for any ǫ, there exists an n ǫ such that for all n > n ǫ 0 Ω(Π n ,Σ n ,Γ n ) + ǫI.
Since the set C ǫ is closed and nested (in ǫ), the sequence {(Π n ,Σ n ,Γ n )} n∈N has a limit point in ǫ>0 C ǫ = C 0 . That is, there exists a sequence of times T 1 ≤ T 2 ≤ T 3 . . . such that lim i→∞ (Π Ti ,Σ Ti ,Γ Ti ) ∈ C 0 . It is important to note that the times sequence depends on the noise characteristics and not on the underlying codebooks. The proof is completed by taking the limit over the sequence T 1 , T 2 , . . . in (65) to obtain which is precisely the optimization problem in (45).
Proof of Lemma 6. This is the achievability proof of the optimization problem in Lemma 6. The main ides is to fix a timeinvariant policy and analyze the achievable rate which is determined by the asymptotic behaviour of the channel outputs process.
Since the channel outputs process is described as a state-space, the asymptotic behavior of the channel outputs statistics boils down to the analysis of Riccati recursion convergence. To that end, we will use a result from [46] on certain conditions to guarantee the convergence of the Riccati recursion to the Riccati equation. Lastly, since one of the condition is given on the initial condition of the Riccati recursion (which we have no direct control over), we modify the time-invariant policy at the first time only to guarantee the convergence. We use the policy in Lemma 2 with Γ i = ΓΣ i and M i = M such that the corresponding power satisfies By Lemma 3, the induced state-space isŝ and the corresponding Riccati recursion iŝ withΣ 1 = 0 and The next step is to show the convergence of the Riccati recursion in (73) to a fixed-point solution of the Riccati equation.
Since K p,i and Ψ i converge to their time-invariant counterparts in (13) exponentially fast, we replace K p,i and Ψ i with K p and Ψ, respectively. This comes at the cost that the initial conditionΣ 1 = 0 becomes arbitrary. Before presenting the convergence conditions, we need to modify the Riccati recursion in (73) to have an equivalent form with the property that the disturbance and the measurement of the state are independent. This is a standard modification can be found for instance in [25,Sec. 14.7]. The equivalent form of (73) can be written aŝ and We use [46,Th. 1] for the convergence of the Riccati recursion in (75) to the maximal solution of the Riccati equation, the maximal solutionΣ s whose all of its closed-loop modes are inside or on the unit circle, that is, ρ(F s −K L,i (ΛΓ + H)) ≤ 1. The sufficient condition from [46] translates to the Riccati equation in (75) as 1) The initial state satisfiesΣ 1 Σ s .
The detectability condition guarantees the existence of the maximal solution. This condition will be carried to the lower bound optimization problem as a restriction on the optimization parameters (Γ, M ). Also note that (F s , ΛΓ + H) is detectable iff (F, ΛΓ + H) is detectable and thus can be expressed as ∃K : ρ(F − K(ΛΓ + H)) < 1. The first condition is needed for the convergence to the maximal solution and is shown next. As mentioned, the initial conditionΣ 1 is arbitrary. To this end, we modify the time-invariant policy by changing M 1 to be an identity matrix scaled with a constant α.
We proceed to show that the null-space ofΣ 2 lies in the null-space of any solution to the Riccati equation. For this proof, we use the closed-loop Lyapunov recursion of (75) can be expressed aŝ Let x be an eigenvector of F with λ such that xΣ 2 = 0. Then, pre-and post-multiplying the closed-loop Riccati equation in (75) with x and x T we have Then, we have xK p Q s = 0, xK L,1 = 0 which also implies xF sΣ = 0. By M 1 ≻ 0, we have Q s ≻ 0 so that xK p = 0. Now, consider any solution to the Riccati equation. Then, pre-and post-multiplying the Riccati equation with x and x T gives which implies xΣx T (1 − |λ| 2 ) 0. Finally, by the stability of F − K p H, the equation xK p = 0 implies |λ| < 1 and therefore xΣ = 0. To conclude the proof of the first item, we can choose α to be large enough such that the error covarianceΣ 2 Σ s . Note that the power constraint may be violated for small n but it will average out when taking n to be large enough.
To summarize, for any time-invariant policy (M, Γ) subject to the detectability condition, the channel outputs entropy rate converges to where Ψ Y,s is the innovation covariance of the Riccati equation in (75) evaluated at its (unique) maximal solution. As shown in [2], the asymptotic equipartition property (AEP) holds for arbitrary Gaussian processes, so that lim n→∞ 1 n (h(Y n ) − h(Z n )) is achievable for any policy of the form X n = B n Z n + V n where V n ∼ (0, Σ Vn ) is independent of Z n and B n is a (block) lower-triangular matrix, i.e., it is a strictly causal operator. The policy considered here can be written in this form sinceŝ i is a strictly causal function of {z i } i≥1 andŝ i is a strictly causal function of {y i } i≥1 . Thus, we have that We formulate an optimization problem which serves as a lower bound on the feedback capacity. By taking a maximum over all valid policies, we have To complete the proof, change the variable Γ ′ = ΓΣ s , add the orthogonality constraint and follow the steps in Lemma 5: define Π = ΓΣ † s Γ T + M , reduce M and apply the Schur complement to get the optimization problem (81). For consistency with the upper bound notation, we rename Γ ′ andΣ s with Γ andΣ respectively.
Proof of Lemma 7. Recall that from the upper bound optimization problem, the tuple (Π,Σ, Γ) satisfieŝ with We prove the claims. 1) If the optimal tuple does not satisfy the Riccati inequality (82) with equality, there exists a matrix Q 0 such that is not the zero matrix. We letΣ ′ = Q +Σ, and observe that this modification satisfies the power constraint, and the LMI Π Γ Γ TΣ′ 0.
Then, usingΣ ′ Σ and the optimality of the tuple (Γ, Π,Σ), we conclude that the objective is still equal to its optimal value after the modification. 2) If there exists an unstable mode in F that cannot be observed via ΛΓΣ † +H, by our assumption that (F, H) is detectable, this mode can be observed via ΛΓΣ † . On the other hand, the instability of this mode implies that the error covariancê Σ has an infinite value in this direction which is a contradiction to the observability of this mode via the matrix ΛΓΣ † .

A. Proof for the coding scheme analysis
Proof of Lemma 1. The proof follows a sequential estimation argument. At each time instance, a new measurement (i.e., channel output) is made available to the decoder which can improve in turn its estimate of the first channel noise instance z 0 . The derivation mostly focuses on writing the channel output as a simple linear function of z 0 , and then we apply known recursive formulas for updating a new estimate and error covariance given a new measurement. We iterate that the derivations hold for general MIMO channels.
Recall that the channel output can be written as = ΛΓΣ † (ŝ n −ŝ n ) + Hŝ n +(z n −Hŝ n ) = (ΛΓΣ † + H)(ŝ n −ŝ n ) + e n +Hŝ n , where (a) follows from the channel input x n = ΛΓΣ † (ŝ n −ŝ n ). We now relate the channel output y n and z 0 . To this end, the estimation error can be written as the recursion s n+1 −ŝ n+1 = Fŝ n +K p e n −(Fŝ n +K Y,n (y n −Hŝ n )) (a) = F (ŝ n −ŝ n ) + K p (y n −Hŝ n −(ΛΓΣ † + H)(ŝ n −ŝ n )) − K Y,n (y n −Hŝ n ) where (a) follows from (85), (b) follows from F p F − K p (ΛΓΣ † + H) andỹ n y n −Hŝ n , and (c) follows from s 1 = K p z 0 ,ŝ 1 = 0, and d n n i=1 F n−i p (K p − K Y,i )ỹ i . We combine (85) and (86) to write the channel output as y n = (ΛΓΣ † + H)(F n−1 p K p z 0 +d n−1 ) + e n +Hŝ n = κ n z 0 +(ΛΓΣ † + H)d n−1 + e n +Hŝ n , with κ n (ΛΓΣ † + H)F n−1 p K p . The estimation model in (87) is a sequential estimation problem, but the terms d n−1 and Hŝ n on the right-hand side depend on the previous measurement. We proceed to show that these bias terms have no effect on the estimation problem and thus can be ignored. Define the transformed measurements (channel outputs) o n ỹ n − (ΛΓΣ † + H)d n−1 = κ n z 0 + e n , in order to obtain a sequential estimation problem (without the bias terms) where the source is z 0 , and at each time we observe o n with the measurement noise e n . The transformation {ỹ i } n i=1 → {o i } n i=1 is linear and causal (lower-triangular). Also note that this transformation is invertible since d n−1 is a function ofỹ 1 , . . . ,ỹ n−1 only. Informally, the invertibility of this transformation shows that the information that can be extracted from the original channel outputs and the transformed channel outputs is the same. Formally, the innovations of both processes are the same, i.e., o n − E[o n |o n−1 ] = y n − E[y n | y n−1 ]. (89) We are now ready to present the recursions for the sequential estimation problem in (88). Since the source is the same at all times (i.e., z 0 ), we only need a measurement-update formula (e.g., [25,Lemma 9.3.2]) to write it recursively aŝ z 0|n =ẑ 0|n−1 +Ẑ 0|n−1 κ T n cov(y n | y n−1 ) −1 (y n − E[y n | y n−1 ]) Z 0|n+1 =Ẑ 0|n −Ẑ 0|n κ T n cov(y n | y n−1 ) −1 κ nẐ0|n with the initial conditionsẑ 0|0 = 0 andẐ 0|0 = Ψ. By Lemma 3, we have Ψ Y,n = cov(y n | y n−1 ) and E[y n | y n−1 ] = Hŝ n so that the recursions simplify toẑ 0|n =ẑ 0|n−1 +Ẑ 0|n−1 κ T n Ψ −1 Y,n (y n −Hŝ n ) Z 0|n = (I −Ẑ 0|n−1 κ T n Ψ −1 Y,n κ n )Ẑ 0|n−1 .
Furthermore, due to the optimal inputs distribution, the innovation covariance Ψ Y,n converges to its optimal value Ψ * Y (for more details, see the proof of Lemma 6). Finally, taking a determinant over (91), applying Sylvester's identity, and note that Ψ Y,n = κ nẐ0|n−1 κ T n + Ψ in (89) gives (31).

VII. CONCLUSIONS AND FUTURE WORK
In this paper, we solved the feedback capacity problem of the Gaussian MIMO channel when the noise is generated from a linear dynamical system. The derivation relies on a sequential convex optimization formulation for the finite-block capacity problem using tools from control theory and convex optimization methods. Using the optimization problem convexity along with properties of Riccati recursions convergence, we provided tight lower and upper bounds that resulted a single-letter, computable capacity expression. Additionally, we showed that that the optimization problem induces a time-invariant capacity-achieving inputs distribution that was used to construct an explicit coding scheme for scalar channels.
In a broader perspective, we derived a single-letter formula for the directed information and its main steps can be summarized as follows . Note that the channel stateŜ i can be computed at the encoder since it is a function of (X i−1 , Y i−1 ). Also, the computation of the asymptotic behaviour at the last step was enabled due to the description of the channel outputs process structure as a hidden-Markov (Lemma 3). The above steps are related to computations of the directed information for the discrete-alphabet counterpart of the Gaussian channel, the finite-state channel (FSC). More specifically, for FSCs with state that can be computed at the encoder, the directed information can be written as and could be expressed with a computable expression in few instances only [40], [43], [47]- [52]. In [53], [54], it was shown that all these solutions can be unified with the single-letter expression I(X, S; Y |Q), where the channel outputs is a hidden Markov model and Q serves as its hidden state with a finite, graphical structure (called the Q-graph). As the conjectured formula structure resembles the one for the Gaussian channel in (92), it should be interesting to investigate whether the techniques developed here apply also for FSCs. In particular, the main step is the formulation of the directed information as a sequential convex optimization problem in order to have an alternative feedback capacity formula that can be single-letterized. Two more research directions are as follows. 1) Explicit formulae: It may be possible to find simple capacity expressions for particular noise processes using the convex optimization in Theorem 1. For instance, the capacity of the ARMA noise of first order can be expressed as a function of the positive root to a quartic equation [21]. This implies that the two decision variables in Theorem 2 can be reduced to a single variable. Pursuing such simplifications for ARMA processes of higher order is natural [11]- [13], [17].
2) Scheme for MIMO channels: In Section IV, we presented an explicit scheme for scalar channel that trivially extends to MIMO channels that can be decomposed to parallel scalar channels. However, an explicit scheme for non-trivial MIMO channels remains open. A conjectured scheme was described in Section IV, and a refinement of the spectral analysis in Lemma 1 should prove the scheme optimality.