Data‐Driven Inference of the Mechanics of Slip Along Glacier Beds Using Physics‐Informed Neural Networks: Case Study on Rutford Ice Stream, Antarctica

Reliable projections of sea‐level rise depend on accurate representations of how fast‐flowing glaciers slip along their beds. The mechanics of slip are often parameterized as a constitutive relation (or “sliding law”) whose proper form remains uncertain. Here, we present a novel deep learning‐based framework for learning the time evolution of drag at glacier beds from time‐dependent ice velocity and elevation observations. We use a feedforward neural network, informed by the governing equations of ice flow, to infer spatially and temporally varying basal drag and associated uncertainties from data. We test the framework on 1D and 2D ice flow simulation outputs and demonstrate the recovery of the underlying basal mechanics under various levels of observational and modeling uncertainties. We apply this framework to time‐dependent velocity data for Rutford Ice Stream, Antarctica, and present evidence that ocean‐tide‐driven changes in subglacial water pressure drive changes in ice flow over the tidal cycle.

2 of 27 processes meant to be represented by the sliding law are not directly observable outside of laboratory settings, inference of the form of the the sliding law and the value of its parameters requires inverse modeling using observations of ice surface velocity and elevation coupled with an accurate physical model of ice flow (Bondzio et al., 2017;Gillet-Chaulet et al., 2016;Joughin et al., 2012;Shapero et al., 2016).
Fortunately, the Earth science community has seen a sharp rise in remote sensing data availability over the past two decades. This rise is due to an ever-increasing number of spaceborne and airborne Earth-observing platforms in combination with increased computational capabilities and data-providing services that operationally produce analysis-ready data sets. The glaciology community has benefited enormously from continent-wide observations of ice surface velocities and elevation over much of the Greenland and Antarctic Ice Sheets (Howat et al., 2019;Joughin et al., 2011;Porter et al., 2018;Rignot et al., 2011). However, within the context of modeling of basal drag, use of these observations in modeling efforts have traditionally involved assimilating instantaneous or time-averaged velocity observations into ice flow models in order to estimate static distributions of basal drag (Larour et al., 2012;MacAyeal, 1993;Morlighem et al., 2010;Shapero et al., 2016). A few studies have expanded upon this approach by estimating basal drag at different time epochs, which can be used to attribute changes in drag to known changes in environmental factors like surface meltwater (e.g., Minchew et al., 2016) or to better constrain parameters in sliding laws (e.g., Gillet-Chaulet et al., 2016;Habermann et al., 2013). The increasing availability of time-dependent velocity fields, which in many places capture the evolution of glacier velocities on sub-monthly timescales in high-spatial resolution, could potentially provide much finer resolution on the time-evolution of basal drag in order to obtain the underlying space-and time-varying functional form of the sliding law.
Within the past decade, machine learning algorithms have exploded in popularity due to their ability to discover patterns and relationships in large volumes of data which are used to inform numerous predictive and analytical tasks (LeCun et al., 2015). In particular, the recent success of deep learning has been attributed to the ability to learn hierarchical, abstract features in unstructured data which can interact in highly non-linear ways (Bengio et al., 2013). The coincident increase in computing power, in large part from the increased utility of graphics processing units, has led to rapid development of specialized network architectures able to learn patterns from video streams, images, and word sequences. Recently, many studies have demonstrated the potential for deep learning algorithms to be integrated with scientific knowledge in order to bridge theoretic gaps, discover new and robust patterns in scientific observations, and predict the evolution of dynamical systems (Karpatne et al., 2017;Raissi, 2018;Reichstein et al., 2019). This type of "theory-guided" learning combines the robustness provided by decades of theoretical and experimental work with the pattern recognition and representation power of deep learning.
In this work, we develop a hybrid modeling framework that can exploit contemporary remote sensing data by incorporating well-known ice dynamics and constitutive laws with a deep neural network model representing the unknown sliding law. In developing this framework, our goal is to demonstrate a general approach for inferring various components of a glacier system from large volumes of data without the need to call a sophisticated numerical solver for a discretized ice flow model. We further discuss how we can pose the learning problem in a probabilistic manner that partially allows for the quantification of uncertainties due to both data errors and uncertainties in the governing equations of ice flow. Since the focus of this work is on learning a spatiotemporal representation for basal drag, we apply our method to several oneand two-dimensional flowline simulations that are representative of real-world basal sliding scenarios that would be challenging to analyze with traditional inverse modeling approaches. Finally, we apply our methods to real velocity data over Rutford Ice Stream in West Antarctica and present observational evidence for the role of subglacial hydrology in propagating tidally driven variations in ice flow roughly 100 km inland.

Ice Flow Governing Equations
The flow of ice is well-approximated by incompressible Stokes flow, which describes the motion of a viscous fluid where inertial forces are negligibly small relative to viscous forces. In Stokes flow, the momentum equations (stress balance) reduce to gravitational body forces resisted by stresses induced through ice deformation and shear stresses at the interfaces between ice and the bed and sidewalls. For many fast-flowing 3 of 27 outlet glaciers and ice streams, flow is dominated by basal sliding where sliding velocity is comparable to surface velocity, and forward motion due to vertical shearing is negligible. In this case, the full three-dimensional Stokes equations can be reduced by neglecting certain components of the stress divergence and averaging the resulting momentum balance over depth (see Appendix A). This approximation, commonly referred to as the Shallow Ice Shelf/Stream Approximation (SSA), leads to the following two-dimensional relation in a Cartesian coordinate system with E z defined parallel to the gravity vector: tion that generally includes the misfit between the neural network predictions and observed values at the input coordinates (see LeCun et al. (2015) for an in-depth review of neural networks and deep learning).
The utility of neural networks as universal function approximators (first formulated for a single network by Cybenko (1989) and extended to finite-width multi-layer neural networks (e.g., Delalleau & Bengio, 2011;Lu et al., 2017;Bölcskei et al., 2019)) make them well-suited to represent scattered, time-dependent surface observations with potentially complex spatiotemporal patterns. More importantly, we can evaluate derivatives of Ê u and Ê h at arbitrary space and time coordinates at machine precision using automatic differentiation (Baydin et al., 2017;Raissi, 2018). Similar to a spline or polynomial fit, the neural network learns a smooth hypersurface between scattered observations in data space and can return the hypersurface value and slope at any given point. The smoothness of this surface will depend on the network capacity (i.e., layer size and depth) and is only achieved when the activation function is twice differentiable (e.g., tanh or sigmoid functions; see Appendix B for additional network and training details). These smoothed predictions and their gradients can then be used to generate time-dependent predictions of basal drag from an appropriate momentum balance, such as Equation 6.
For surface observations with minimal noise levels and glacier geometries well-suited to the SSA model, the neural network weights E θ can be estimated by minimizing a standard mean square error (MSE) loss function over training data: Gaussians with finite mean and (independent) variances are known to maximize information entropy in the absence of covariance information (Cover & Thomas, 1999).
The final learning objective is to estimate Ê θ that minimizes the following joint loss function incorporating both data and physics-based loss functions ( Figure 1): A key point to reiterate is that for the joint loss function, observed data are only used for the loss function nll E  while evaluation points for the physics-based loss function ph E  can be evaluated anywhere within the training domain (see Appendix B for further details on training and neural network architecture).

Comparison With Control Methods for Ice Flow Models
The learning framework applied to time-dependent ice surface velocity and elevation data effectively forms a physics-aware space-time interpolator of the data. The interpolation kernel is provided by the hypersurface learned by the neural network, and the physical constraints are encoded in the loss functions specifying our prior assumptions on the characteristics of the underlying basal drag field. As such, while this approach can be viewed as an analog to a time-dependent inversion of basal drag using control methods applied to an ice flow model (e.g., Goldberg et al., 2015), there are several key differences. Forward runs of transient ice flow models generally require specification of key boundary conditions regarding surface mass balance, grounding line stresses and migration, ice velocities at inflow boundaries, and a functional form for the basal drag (e.g., power-law form in Equation 2). Each of these boundary conditions are time-varying and subject to varying degrees of uncertainties, which can require a significant number of spin-up runs and fine-tuning of model parameters in order to generate velocity and elevation fields that match the observations (Larour et al., 2014). By directly having access to the time-varying surface elevation and velocities from observations, we eliminate the need for evaluation of a forward model (and required boundary conditions) and simply rearrange the SSA momentum balance. We thus decouple the time-evolution of the basal drag from other processes that can influence surface elevations, for example, surface mass balance (Larour et al., 2014), yet we also enforce that the inferred basal drag is fully consistent with the predicted velocity  and elevations. Computationally, the neural network model is mesh-free and can be evaluated anywhere within the training domain, which avoids the potentially costly parameterization of basal drag at each time step for each finite element (Goldberg et al., 2015). While recent work has utilized Ensemble Kalman Filters to assimilate surface data in a sequential manner in order to more efficiently infer time-dependent basal drag (Gillet-Chaulet, 2020), the requirements for specification of boundary conditions and grounding line migration still persist. Overall, our more focused objective of reconstructing the time-dependent basal drag allows us to bypass several algorithmic requirements necessary for forward runs of transient ice flow models, which are still necessary for any prognostic evaluation of future ice states.

Validation on 1D Ice Flow Simulations
To evaluate inference of basal drag using the neural network model, we apply the learning framework to various sets of synthetic ice velocity and thickness data generated by 1D SSA simulations for both spatially and temporally varying frictional parameters. These simulations are designed to generate observations similar to those observed at real glaciers subject to time-varying stress conditions. The prescription of spatially and temporally varying frictional parameters in the simulator provides mathematically convenient scenarios for testing recovery of the underlying sliding law parameters by our learning framework.

Spatially Varying Drag
We first generate 1D SSA simulations for two different cases of frictional parameter spatial distributions ( Figure 2).In the first case, we prescribe a constant exponent of  3 E m and a spatially varying prefactor, b E c , with values that slowly increase with upstream distance to approximate increasing basal drag. In the second case, we prescribe periodic exponent values with values ranging from approximately 1 to 6, which spans the regimes from linear to approximately plastic (rate-independent) sliding. Additionally, we assign values of b E c such that the modeled basal drag is approximately equal to the drag from the first case. In this way, both cases will have similar values of basal drag throughout the simulation but different time-dependent sliding, providing a good test for the recovery of the true frictional parameters from time-dependent observations. For both simulations, we prescribe an ice temperature of   5 E C and compute the flow rate parameter using an Arrhenius relation (Cuffey & Paterson, 2010), and we use ice and seawater densities of 917 and 1,024 kg/ 3 m E , respectively. After a spin-up period of 200 years, we force both simulations with periodic variations of the longitudinal stress at the terminus to simulate periodic ocean tides over a five-year duration. We allow the terminus to migrate based on a flotation criterion (e.g., Nick et al., 2009). The resulting velocity  time series show strong periodicity in time while the ice thicknesses and elevations are roughly constant throughout the five-year duration ( Figure S1).
For training the network E f θ subject to the loss function in Equation 12, we select a spatial subset spanning the minimum terminus position and 50 km upstream of that position to use as training data. We restrict all subsequent analysis of the inference results to this spatial subset. To simulate measurement noise, we add white noise with a standard deviation of 10 m/year and spatially correlated noise generated from a squared exponential covariance function with a lengthscale of 5 km (equivalent to approximately 10 ice thicknesses) and an amplitude of 6 m/yr, resulting in noise levels that are approximately 10 % E of the mean velocity variation, similar to noise in remote sensing observations Mouginot et al., 2019). Generally, the correlated noise will have a much larger effect on the inferred basal drag since coherent velocity gradients will be mapped to spurious basal drag variations, particularly when the wavelength of the noise is on the order of several ice thicknesses. For the ice thickness data, we similarly add white noise with an amplitude of 10 meters and correlated noise with the same lengthscale and an amplitude of 6 meters, which again results in noise levels that are approximately 10 % E of the mean thickness. This noise level is similar to bed topography uncertainties estimated using mass conservation techniques for fast-flowing glaciers . Finally, we uniformly sample 50,000 data points within the space-time training volume (about 30% of the simulation outputs) for computing the data loss function, nll E  , and we uniformly sample an additional 50,000 space-time coordinates for computing the physics losses, ph E  .
After for any triplet of ( , , ) E u x t , we can compute the mean and standard deviation of the 1/m and b E c estimates. As a comparison, we also compute basal drag directly from the noisy surface observations using Equation 6 where we first spatially smoothe the E u and E h fields with a low-pass Butterworth filter with a cutoff period of approximately six ice thicknesses (commensurate with the expected stress coupling lengthscale for glaciers on colder ice sheets (Cuffey & Paterson, 2010)), and then we compute the momentum balance using finite differences for the spatial gradients. This combined approach of smoothing and finite differencing is commonly used when computing the force balance from spatially continuous surface observations.
For both simulation cases, we are able to accurately recover the true sliding law parameters for the region of the glacier with sufficiently large velocity variations (within  E 35 km of the terminus; Figure 3).As the upstream distance increases, the amplitudes of the velocity fluctuations caused by the stress perturbations applied at the terminus attenuate, which ultimately results in increasing uncertainties in both 1 / E m and b E c as the linear fits in log space become more ill-conditioned. For the case where the prescribed prefactor and exponent both vary spatially, we observe larger uncertainties on b E c where 1 / E m is lower (Figure 3), which implies that sliding regimes that are closer to plastic will generally lead to more uncertain prefactors using the log-domain line fit used here.
Another key result is that estimates of basal drag using direct application of Equation 6 lead to highly biased and noisy sliding law parameter estimates, even when significant spatial smoothing is applied to the data prior to application of the momentum balance (Figures 3c-3f). The ability of the neural network framework to accurately recover the true parameter values (to the extent where velocity variations are large enough) indicates similar levels of robustness to noise as traditional inverse modeling schemes that apply some form of regularization on the modeled basal drag (e.g., Habermann et al., 2012;Larour et al., 2012). The ability to quantify uncertainties in predictions of basal drag is an important additional benefit of the probabilistic loss functions used to train the network E f θ . The benefit of basal drag uncertainties has been previously demonstrated with Markov Chain Monte Carlo (MCMC) techniques used to sample the posterior distribution of time-invariant basal drag friction coefficients (Petra et al., 2014). There, it was found that certain statistical metrics of the posterior distribution (e.g., eigenvectors of the covariance matrix) could discriminate between areas where observations provided robust constraints on parameter values versus those where parameter uncertainties were primarily dictated by prior assumptions. While our approach to uncertainty quantification differs from these model-based inference methods (see Section 5.4 for a detailed discussion), we share the common goal of locating areas where basal drag and observations are well-constrained relative to other areas.

Time Varying Drag
In the previous subsection, the sliding law parameters were simulated to be time-invariant. However, for some glaciers and ice streams, basal drag has been hypothesized to evolve in time, for example, in response to changes in the subglacial hydrological system. As water flows into and out of the hydrological system, the basal water pressure compensates some of the overburden pressure and thus changes the effective pressure (the difference between overburden and water pressures) at the bed. The overall change in effective pressure will thus affect the magnitude of the basal drag and the corresponding flow of ice (Flowers, 2015;Hewitt, 2013;Iken & Bindschadler, 1986;Minchew et al., 2016;Rosier et al., 2015;Schoof, 2010;Stevens et al., 2018). Here, we implement a simplified model for temporally varying water pressure by representing the prefactor in the power-law sliding law as a Mohr-Coulomb yield criteria (e.g., Tulaczyk et al., 2000) such that levels of phase lag between the periodic velocity and basal drag signals ( Figure S2). The varying phase lag is again a consequence of competing basal drag perturbations from the pressure wave and balancing of longitudinal stress perturbations resulting from the initial speedup where the latter generally propagates upstream with a higher phase velocity.
variations (and thus, prefactor variations) or explicit modeling of subglacial hydrology would be needed to recover values of the exponent. Conversely, if a priori information about the exponent were available (e.g., the bed is well-approximated by plastic deformation), then it is possible to derive estimates of basal water pressure variations from the time-varying drag (Minchew et al., 2016).

Rutford Ice Stream, Antarctica
Rutford Ice Stream (RIS) in West Antarctica is a fast-flowing ice stream which flows into the Filchner-Ronne Ice Shelf (Figure 5a). RIS is laterally confined by the Ellsworth Mountains and Fletcher Promontory with an average width of approximately 23 km (Figure 5b), and most (  E 95%) of the forward velocity is due to basal sliding (De Rydt et al., 2013;Gudmundsson, 2007;Joughin et al., 2006;Smith et al., 2015). The high width-to-thickness and slip ratios support the use of the SSA approximations for examining basal drag variations at intermediate to long spatial wavelengths (De Rydt et al., 2013;Gudmundsson, 2003). Furthermore, RIS exhibits strong variations in flow velocity due to tidal forcing where non-zero variations are measured almost 100 km away from the grounding line. While variations in vertical velocity are mostly modulated by diurnal and semi-diurnal tides, along-flow variations are observed primarily at the fortnightly sf E M (14.77 days) period (Gudmundsson, 2006;Minchew et al., 2017;Murray et al., 2007), which indicates a non-linear response of RIS flow to tidal forcing (Gudmundsson, 2011;Rosier & Gudmundsson, 2016;Rosier et al., 2015). While several recent studies have compared different mechanisms for originating along-flow variations at the sf E M frequency on the ice shelf (e.g., Rosier & Gudmundsson, 2020;Robel et al., 2017), our focus in this study is on using the response of ice flow in the grounded ice stream to infer the mechanics of slip at the ice-bed interface. Thus, our analysis focuses on regions of the ice stream greater than 10 km upstream of the grounding line in order to avoid elastic effects due to bending stresses, which are not incorporated into the SSA approximations (Rosier & Gudmundsson, 2016). We emphasize that a rigorous exploration of the ice stream stress response (including the elastic response) to tidal forcing is outside the scope of this work. Rather, our aim is demonstrate the machine learning-based techniques for inferring time-dependent basal drag on high-quality surface observations.

Data and Learning Objectives
We use existing data sets to constrain the surface velocity fields and ice-stream geometry. The 3D surface velocity fields were derived from 9 months of synthetic aperture radar data collected from multiple viewing angles in order to constrain a parametric surface displacement model consisting of sinusoids corresponding to the primary tidal constituents and a steady-state velocity . Since our main focus is on the along-flow velocity variations (where variations at diurnal and semi-diurnal constituents are minimal (Murray et al., 2007)), we use only the steady-state velocity and sinusoid periods at the sf E M frequency. Geometric information (surface elevation and ice thickness) were obtained from BedMachine V1 (Morlighem et al., 2020). While our analysis is focused on the regions of the ice stream greater than 10 km upstream of the grounding line, our model domain spans from 150 km upstream of the grounding line to regions of the ice shelf within 45 km downstream of the grounding line. With this extended domain, we can confidently constrain the spatial gradients of the observation variables. Additionally, inclusion of the ice shelf also provides a means to validate the rheological parameters since basal drag is expected to be negligible on the shelf (seawater offers very little resistance to ice flow). Here, we use a characteristic temperature of - 10 E C to calculate an effective depth averaged value of E A from tabulated values (Cuffey & Paterson, 2010) and a stress exponent  3 E n to compute ice viscosity (Appendix A). A more rigorous treatment of E A would include the effect of flow rate enhancement due to shear weakening, crystallographic fabric, impurities, etc. (Cuffey & Paterson, 2010), which can affect the inferred distribution of basal drag (Zhao et al., 2018). Since the focus of this work is to demonstrate the recovery of time-dependent basal drag in the central trunk of RIS, where rheological variations are expected to be minimal, we proceed with a uniform value for E A .
We train the network E f θ to predict the time-varying 2D horizontal velocity components and time-invariant ice thickness and surface elevation. At the fortnightly timescales, ice thickness and driving stress are assumed to be constant in time. By adding the surface elevation variable to the outputs of E f θ (as opposed to adding a known bed elevation to the thickness predictions as was done with the simulated data), we implicitly account for errors in the bed topography by treating   E s h b as an additional noisy observation The neural-network-predicted secular velocity magnitude, which is in good agreement with the observed secular velocities from  (blue contours at levels of 0, 100, 295, 320, 350, and 370 m/year). The predicted amplitude (c) and phase (d) of the timedependent velocity variations, which are also in good agreement with  upstream of the grounding line (areas of high phase uncertainty, due to low amplitude variations, masked out). Dashed black circle indicates prominent bump in bed topography. The driving stress (e) is mostly balanced by the neural-network-predicted basal drag (g). By assuming a plastic bed with yield stress determined by the Mohr-Coulomb yield criteria (Equation 13), effective pressure equals basal drag divided by the internal friction coefficient,   0.5 E (pressure values shown in square brackets and italics in (G)). Secular water pressure (f) may then be derived from the effective pressure. Scalar uncertainties for predicted basal drag (h) are generally high in areas with relatively rapid changes in bed slope, such as the margins and near the grounding line. subject to smoothing imposed by our physics-based loss functions. For the velocity components, rather than outputting the velocity values at any given input   , E t x , we instead output the spatially varying coefficients of a periodic temporal model (independently for the E u and E v components), for example,: where   sf sf T  2 / is the fortnightly angular frequency for the sf E M tidal constituent ( sf E T = 14.77 days), and the coefficients     0 , , E a b u vary in space only. This approach reduces the dimensionality of the neural network inputs to two spatial coordinates while providing physical constraints on the temporal form of the predictions. While the temporal model is not necessary since the neural network can learn the underlying time variability, we found that it substantially improved the convergence rate during training. Thus, for study areas where the time variability can be confidently constrained (e.g., for areas where daily GPS observations are available), inclusion of a suitable temporal model is likely beneficial.
For formulating the physics-based loss functions in ph E  , we use the 2D SSA equations (Equations 1a and 1b) to predict the basal drag ˆb E in both spatial directions (east and north), which we then project to the alongflow direction using the predicted velocity vectors. This projection allows us to once again penalize the drag values with incorrect signs and to compute the Laplacian smoothness metric on a scalar field as opposed to a vector field. We assume an ice density  i E = 917 kg/ 3 m E . Values for the hyperparameter controlling spatial smoothness of basal drag were chosen using a standard L-curve ( Figure S5). After training E f θ , we Monte Carlo sample the time-dependent basal drag ˆb E in the along-flow direction (see Text S1 for validation on noise-free 2D ice flow simulations). As a post-processing step, we fit the predicted drag time series samples with the periodic model used for the velocity components (Equation 15) in order to reduce high-temporal-frequency drag variations. Additionally, uncertainties in the parameters of the final periodic model are formally computed by propagating uncertainties from the drag samples.

Secular Velocity and Basal Drag Predictions
The predicted along-flow secular (steady-state) velocity magnitudes for RIS are in good agreement with the observed secular velocities with 1  E errors of 5 m/year in the central trunk (Figure 5b), while the velocity amplitude and phase are also in good agreement with prior studies  with 1  E errors of 1.7 m/year and 0.7 days, respectively. The steady-state basal drag magnitudes show a region of very low basal drag from approximately 10-50 km upstream of the grounded line, transitioning to higher drag over short distances (Figure 5g). This transition from a weak to a stronger bed has been inferred in several prior studies (e.g., Joughin et al., 2006;Pralong & Gudmundsson, 2011) and has been associated with a transition from dilatant to stiff sediment .Since RIS is close to steady-state (Gudmundsson & Jenkins, 2009), drag variations are mostly in balance with the driving stress ( Figure 5e). The drag magnitudes in our training area peak at around 100 km upstream of the grounding line, which is colocated with a local high in the basal topography ( Figure S6). Previous numerical studies of RIS have shown that basal topography is the dominant control on surface undulations, which in turn implies that basal topography is the dominant control on secular drag variations at the spatial scale of tens of kilometers (De Rydt et al., 2013;Pralong & Gudmundsson, 2011). Under the functional form of Equation 13 where  E represents the internal friction coefficient for till, these results support the view that variations in the friction coefficient  E are at much longer wavelengths (  E 20 ice thicknesses), with the exception of the low basal drag region. By further assuming a plastic bed with a uniform value of   0.5 E (median of published values (Iverson, 2010)), the effective pressure is simply twice the basal drag (Figure 5e), and we can obtain an estimate of basal water pressure at RIS by subtracting the effective pressure from the overburden stress (Figure 5f). We explore the implications of bed plasticity on water pressure changes in a later discussion.
Uncertainties for the predicted drag are generally highest at the margins and areas with high bed slopes where data gradients are large and work against the Laplacian smoothing penalty on the basal drag (Figure 5h). Specifically, the high slope areas exist near the grounding line, as well as near a prominent bump in the bed topography about 30 km upstream of the grounding line (Figure 5d). By examining the uncertainties for the predicted velocity, ice thickness, and surface elevation, we can see that all three variables exhibit larger uncertainties in these areas and contribute to the total drag uncertainty ( Figure S7). Mathematically, the uncertainties here have been inflated due to larger data misfits during training; the neural network learns to increase the likelihood variance in these high misfit bias areas in order to increase the total log likelihood (see Section 5.4 for further discussion). Overall, quantification of drag and grounding line migration in these areas has proven challenging and will only improve once more high-quality bed topography data are acquired (Rosier & Gudmundsson, 2020).

Time-Dependent Velocity and Basal Drag Predictions
By quantifying the change in velocity and basal drag at different times within the sf E M tidal period, we observe significant basal drag variations propagating upstream with values spanning 4-6 kPa over the course of the tidal period ( Figure 6). Perhaps the most interesting observation is that the upstream propagation of positive velocity variations is associated with a propagating decrease in basal drag, which suggests some form of a pressure wave driven by subglacial hydrology (analogous to Figure 4). During the initial speedup of the ice stream, the associated basal drag decreases only slightly, which may signify destructive interference of basal drag reduction and longitudinal stress perturbations originating from loss of buttressing stresses downstream (Robel et al., 2017;Rosier & Gudmundsson, 2020). We reiterate that the inferred basal drag near the grounding line is likely inaccurate since we do not incorporate elasticity of the ice into our stress calculations and bed slopes there are subject to larger uncertainties. However, later in the tidal cycle when velocity speedups have propagated to about 70 km upstream of the grounding line, the basal drag decrease has also propagated upstream while becoming more widespread within the ice stream (Figures 6e  and 6f). We do observe a phase lag between the velocity and drag variations which can be confirmed by the elliptical relationship between  b E versus ǁ ǁ u (Figure 7), a characteristic we previously observed for the 1D simulated pressure waves. The exceptions to this behavior are near the grounding line and in the weaker bed where drag variations are minimal compared to the velocity variations. At greater upstream distances, we can observe a gradual transition in the ellipse orientation, signifying a transition to a stress regime where longitudinal stresses become the primary driver of the velocity variations.

Discussion
The availability of time-dependent observations of surface velocity and elevation permit direct estimates of time-varying basal drag that satisfies global stress balance. Coupled with a machine learning model for reconstructing the spatiotemporal function for basal drag, we can retrieve important sliding parameters under certain stress and loading conditions. We discuss the robustness and implications of these results below.

Inference of Sliding Law Parameters
Under the condition that ice surface velocity variations are driven by processes other than changes in drag at the bed-such as longitudinal stress perturbations at the terminus or grounding line, as may be expected in some cases for seasonal calving cycles, ocean tide effects via hydrostatic stress differences, or changes in buttressing stresses from ice shelves-then, for a general power-law formulation of the sliding law, it is possible to recover both the prefactor and exponent from a linear fit of time-dependent sliding velocity and basal drag predictions in the log domain. Parameter estimation for recently proposed augmented sliding laws that combine the power-law sliding relationship at lower velocities and a rate-independent (plastic) relationship at higher velocities (Joughin et al., 2019;Minchew & Joughin, 2020;Zoet & Iverson, 2020) can be accomplished in a similar manner through a nonlinear optimization in the log domain. Verification and refinement of such a law from remote sensing data sets would make significant progress towards unification of laboratory, observational, and theoretical approaches towards understanding glacier sliding dynamics.
For all proposed sliding law functional forms, if the sliding law parameters are known to be time-invariant (but may be spatially varying), then simultaneous parameter recovery is possible. In cases where the parameters may vary in time, such as when changes in the prefactor are driven by subglacial hydrological processes, then simultaneous recovery is not possible, and one would need additional information about the physical properties of the bed, such as water pressure variations or bed plasticity (which is equivalent to knowing the value of the exponent in the power-law form of the sliding law, as discussed in the following section).
Under applicable conditions, successful recovery of sliding law parameters is largely dependent on the availability and temporal sampling of time-dependent velocity fields, as well as the noise wavelength and amplitude in the velocity and topography data. Static velocity snapshots allow only for the estimation of the magnitude of basal drag, which is equivalent to estimation of the joint probability distribution for the sliding law parameters in a Bayesian inference framework. Unique inference of one of the parameters would (e), (f) 5-7.5 days; and (g), (h) 7.5-10 days. Markers in (a) indicate points extracted for Figure 7. Gray contours for right-column plots correspond to the secular basal drag uncertainties in Figure 5h in intervals of 2.0 kPa. In general, an upstream-propagating increase in velocity is associated with an upstreampropagating decrease in basal drag, which suggests that a pressure wave in the subglacial till is responsible for the observed variations in surface velocity.
require some assumption on the value/distribution of the others, as well as an assumption on the form of the sliding law. Time-dependent velocity fields allow for quantification of time-dependent basal drag, permitting joint estimation of all sliding law parameters by quantifying the relationship between drag and sliding velocity at different points within the spatial domain. Furthermore, the larger the amplitude of velocity variability at any given location (e.g., amplitude of periodic variations due to ocean tides or seasonal effects), the better constrained the parameters (Figure 3). Of course, the level of noise or errors in the surface data will directly impact the uncertainty of the drag predictions, and as discussed in Section 3.1, errors with correlation lengthscales of several ice thicknesses will generally lead to larger artifacts in the predicted drag. For study areas where velocity and elevation measurements are more sparse or exhibit higher noise levels, the methods presented here would greatly benefit from a time series preprocessing stage that can fit some smoothly varying time function to the available data to inject stronger a priori knowledge about the underlying flow variations Riel et al., 2021), as was done for the RIS velocity data.
In this work, surface observations are used to compute the global stress balance directly via momentum balance equations under the assumptions that the surface velocities are approximately equal to basal sliding velocities and the rheology of the ice is reasonably well constrained. For the former, we note that the viscous nature of ice flow acts as a low-pass filter on basal stress variations such that variations with spatial scales  E one ice thickness can result in similar surface velocities and elevations (Habermann et al., 2012). This non-uniqueness requires that any inversion technique (including ours) that assimilates noisy surface observations to estimate basal stress fields must use a regularization scheme to promote smoother basal stress or satisfy some other auxiliary constraint to obtain a unique solution (Habermann et al., 2012;Larour et al., 2012;Shapero et al., 2016). We do note that while it has been shown that the transfer function amplitude between variability in basal stress and surface velocities does decrease with decreasing spatial wavelengths, higher slip ratios (ratio of sliding to deformation velocity) can increase the transfer function amplitude (Gudmundsson, 2003). Thus, inversion for basal stress fields is generally more robust for fast-flowing glaciers which are dominated by basal sliding and where the spatial scale of basal stress variability is greater than the intrinsic resolution of the velocity fields (Stearns & Van der Veen, 2018).
When the ice rheology is subject to non-negligible uncertainties, any un-modeled variations in the rheology at the surface will lead to variations in the inferred basal drag via spatial gradients of the effective dynamic viscosity. When these gradients are small, as would be expected in the central trunk of the glacier, errors in the inferred drag should also be correspondingly small. In future work, we will explore joint estimation of rheological parameters and drag, perhaps by introducing a second auxiliary variable trained to predict a spatially varying flow rate parameter, ( , ) E A x y . Since joint estimation of rheology and drag is an ill-posed problem, we would likely need to incorporate additional conditioning/regularizing factors in the physics-based loss functions in order to constrain the solution space. As an example, we may encourage softer ice in high strain-rate areas (such as lateral shear margins) and anisotropic smoothing of the rheology and drag to enforce lower spatial gradients in the along-flow direction where strain-rates are orders of magnitude lower. As demonstrated by Ranganathan et al. (2020), such constraints can effectively reduce the inherent trade-offs between rheology and drag variations.

Rutford Ice Stream and Subglacial Hydrology
In general, speedups in ice flow respond to changes in driving, longitudinal, and basal stresses. A localized perturbation in longitudinal or driving stress (e.g., as a result of a calving event for a tidewater glacier) will result in a non-local redistribution of longitudinal stresses and velocity variations well away from the original perturbation. Similarly, a localized perturbation in basal drag will result in non-local redistribution of longitudinal stresses and velocity changes upstream (Joughin et al., 2019). On the other end of the spectrum, subglacial hydrological variations that result in traveling "pressure waves" are governed primarily by the properties of the hydraulic network, although indirect effects could arise from changes in surface slope (Hewitt & Fowler, 2008;. In this case, surface velocity will respond to a combination of local reductions in basal drag (corresponding to the pressure wave front) and non-local variations in longitudinal stress. Consequently, quantifying velocity variation magnitudes without examining spatial gradients (i.e., strain rate variations) will not distinguish between these different forcing mechanisms, and an analysis of the stress states of the glacier is required (Rosier & Gudmundsson, 2016).
While a comprehensive comparison of the stress response to the different forcing mechanisms is reserved for future work, a simplified analysis of the evolution of longitudinal normal stresses can be used to infer the sign of the corresponding change in basal drag (Text S2). For RIS, gradients of longitudinal normal stresses decrease (become more negative) in response to increases in surface velocity. Therefore, assuming that lateral shear stress also becomes more resistive for increases in velocity, it follows that a decrease in basal drag is driving the velocity increases upstream. This simplified analysis, which is not subject to any modeling assumptions other than bulk ice rheology, further supports the inferred pressure wave.
Several recent modeling studies have proposed sub-glacial hydrology as the primary driver of long-period along-flow velocity variations near the grounding zone for RIS (e.g., Rosier & Gudmundsson, 2020;Warburton et al., 2020), as well as the high velocity variation amplitudes further upstream (Rosier et al., 2015). From the perspective of the work presented here, we implicitly assume that the basal drag is varying only at the sf E M frequency when we enforce the periodic time representation. This assumption is likely to be valid for the upstream portions of RIS (greater than a few ice thicknesses from the grounding zone) where elastic stress variations have decayed, thus limiting the response of ice flow to viscous effects (Rosier et al., 2015;Thompson et al., 2014). Simultaneous tracking of the velocity and basal drag variations suggests that the possible pressure wave lags behind the traveling wave of surface velocity (Figures 6 and 7), which is consistent with a pressure wave speed below the speed of longitudinal stress transmission. This constraint is almost certainly valid for real-world glaciers since pressure-driven subglacial water flow will be resisted by drag in the hydraulic network, so even for highly connected distributed systems, longitudinal stresses will propagate faster than basal drag variations (Warburton et al., 2020).
As previously discussed, estimation of the underlying sliding law parameters from surface observations is not possible without additional information if the parameters vary in time. However, we may still consider different endmembers for the sliding law exponents to explore the implications of the basal water pressure variations. Let us again consider the case where the bed is perfectly plastic such that the secular basal drag is equal to the yield stress,  y E , of the bed. In this case, drag variations are entirely determined by the Mohr-Coulomb yield criteria in Equation 13. By assuming that ice thickness is approximately constant over (fortnightly) tidal timescales, then it follows that variations in drag take the form     Δ Δ b w E p , so that changes in drag are proportional to changes in basal water pressure. For an estimated basal drag variation amplitude of 4-5 kPa roughly 20 km upstream of the grounding line, the corresponding water pressure variation would then be 8-10 kPa for internal friction coefficient   0.5 E (Figure 6). Following the subglacial hydrological model of Rosier et al. (2015), which assumes subglacial hydrology at RIS can be described as a homogenous porous medium, changes in water pressure can be related to changes in hydrologic head.
At the grounding line where the hydrological system is in direct contact with the ocean, hydrologic head is equal to the ocean elevation, and basal water pressure variations can be computed as: where  w E is the density of seawater and E S is the height of the ocean surface. From tidal models and GPS records, tidal amplitudes are approximately 3 meters at RIS Padman et al., 2018;Rosier et al., 2015), which would lead to water pressure amplitudes of approximately 30 kPa at the grounding line. Thus, our estimate of 10 kPa basal water pressure change at a distance of 20 km upstream indicates an E e -folding distance of approximately 20 km. Note that doubling the basal water pressure change to 20 kPa is equivalent to an E e -folding distance of approximately 50 km, which is the same value for the velocity amplitudes . The amplitudes of basal drag variations estimated for RIS are likely on the lower end of plausible values due to our higher choice for the smoothing hyperparameter ( Figure S5), which was necessary to handle assumed errors in the surface and bed topographies. Thus, it is reasonable to expect that estimates of basal water pressure variations are as high as 15-20 kPa.
The upstream diffusion of hydrological head variations is a function of the conductivity of the hydrological system and the temporal frequency of the tidal forcing. While estimation of head variations over the grounded ice is beyond the scope of the work, the relative consistency between the estimated basal water pressure variations assuming a plastic bed and those predicted from a simple subglacial hydrological model provides some support for the possibility that the bed of RIS deforms plastically. If the sliding exponent was instead closer to  3 E m , the negative feedback defined by the increased basal drag resistance caused by the velocity speedup would necessitate a nearly factor of two larger water pressure variation (e.g., Figure 4), which would be on the higher end of plausible values. Therefore, independent measurements of time-dependent basal water pressures would likely provide substantial information for constraining the sliding law exponent for RIS.

Ice Dynamics and Physics-Informed Neural Networks
The use of the SSA momentum balance to compute basal drag as a target metric for enforcing physical consistency follows the overall strategy of physics-informed neural networks (PINNs), wherein known physical relationships and constraints are used as auxiliary "data" for neural network training (Raissi et al., 2019). PINNs themselves are similar in nature to PDE-constrained optimization problems (see Morlighem et al. (2017) and Brinkerhoff and Johnson (2015) for cryosphere applications). The two primary advantages of PINNs for our purposes are: (a) the ability to evaluate the physics constraints at arbitrary space-time coordinates within the training domain, which prevents overfitting of the surface observations; and (b) negating the need for running (and then backpropagating gradients through) a sophisticated ice flow model. As discussed in Section 2.2.1, we are able to use this approach due to our narrower focus on inference of the time-evolution of basal drag for a given glacier without having to consider the physics governing key boundary conditions, particularly the dynamics of the grounding line and the surface mass balance. We consider these boundary conditions as implicit in the surface observations, allowing us to focus directly on the stress distribution in the momentum balance. One important implication of this approach is a lack of generalizability to other glaciers since neural networks simply function as physics-aware interpolators of the surface data. Of course, these methods are not intrinsically limited to a single glacier and could be scaled to a much wider region if surface data and bed topography are available. In this case, a neural network architecture with more and possibly wider layers would likely be necessary to capture the wider spatiotemporal variability of the data. One promising method for improving the computational efficiency of the PINN framework for our case is to treat each time slice of data as a high-dimensional training example for the neural network as opposed to scattered point examples as used in this work. While this method would require spatially continuous observations at all time epochs, it would utilize the efficiency of convolutional neural network architectures for computing spatial gradients and would likely decrease training time from 1-2 h to a few minutes due to training examples consisting of full images rather than individual points (Zhu et al., 2019). For very large modeling domains, one could improve scalability by reconstructing image patches rather than full images, although care would need to be taken for patch boundaries.
An alternative learning approach to the PINNs discussed here is to learn the full ice dynamics for a given glacier. Essentially, a neural network could be trained to predict the time evolution of ice velocity and thickness completely from velocity and thickness time series without utilizing physical information from the momentum balance equations (Raissi, 2018). In this way, the representation of the glacier's dynamics would be purely generic and could be learned with minimal supervision, that is, "end-to-end" learning. However, the main challenge for this approach is also generalizability. In order for a pure neural network model to robustly predict the time evolution of a glacier or ice stream not seen during training, one would have to train the network with many different simulations spanning the expected parameter sets of all glaciers and ice streams over the globe. In other words, as the distribution of desired testing examples becomes wider, the distribution of training examples would also have to become wider to ensure that predictions are done in an interpolatory manner rather than an extrapolatory one. Considering the wide variety of bed topographies, sliding conditions, ice shelf conditions, climatic environment, and ice geometries, the training data would need to be prohibitively large in order to ensure generalizability without using any prior physics information. One potentially promising area of research utilizes flexible relational inductive biases encoded in graphs for improving generalizability of neural networks (Battaglia et al., 2018). This type of learning would relax the usage of a specific set of momentum balance equations while still utilizing additional information known from physical interactions between velocity and thickness.

Uncertainty Quantification
By prescribing the neural network E f θ to predict the distribution of the surface variables (via means and standard deviations of independent Gaussian distributions), we are able to obtain uncertainties on the predictions of those variables and on the derived basal drag. This uncertainty is governed by the misfit between the surface observations and the mean hypersurface learned by the neural network, that is, the neural network will inflate the prediction standard deviations in areas where the predictions deviate the most from the observations ( Figure S7). This deviation is itself driven by a combination of data noise and the physics constraints on the basal drag. Thus, we would expect larger surface variable uncertainty when data noise is large or when the basal drag field is enforced to be smoother. Consequently, we expect uncertainties to vary with the choice of the smoothing penalty  E in Equation 10. We note that this uncertainty behavior arises from the likelihood loss nll E  in Equation 11 and specification of the data distribution as a trainable distribution such that heteroscedastic data variances are estimated along with the data means. For a fixed error model where data covariances are known a priori, one would likely need to use probabilistic neural networks (e.g., Bayesian neural networks (MacKay, 1995)) in order to recover an uncertainty measure that incorporates both data noise and the strength of the physics-based loss functions. For all of these approaches, we emphasize that the main derived utility is in comparison of relative uncertainties between different regions within the study domain, which is useful for outlining areas where observation noise is high or where velocity/topographic variations lead to SSA-based drag predictions that are not well-aligned to the prescribed level of spatial smoothing. Obtaining a reliable estimate of absolute uncertainties requires some form of uncertainty calibration with validation data (e.g., comparison of uncertainties with misfits between neural network predictions and data unseen during training). We leave uncertainty calibration and comparison of uncertainties with more rigorous MCMC-based methods (e.g., Petra et al., 2014) for future work.
From a broader perspective, we believe that the probabilistic learning framework takes a significant step towards general quantification of both data and modeling uncertainties within a geophysical context while lowering the burden to run computationally expensive MCMC methods. This uncertainty quantification for hybrid physical and machine learning models has proven to be useful in related fields such as atmospheric dynamics (Stuart & Teckentrup, 2018). Other probabilistic machine learning models, such as Gaussian processes (Rasmussen, 2003), may also be a suitable surrogate for the basal drag, and recent advances in variational Gaussian processes that allow for training on large datasets make them a compelling machine learning model for future work (Hensman et al., 2015).

Conclusion
We have presented a hybrid machine learning framework for learning the time-evolution of basal mechanics for glaciers and ice streams. This approach integrates into the learning procedure well-known ice flow momentum balance equations at various levels of complexity. The a priori physical knowledge allows for the transformation of ice velocity, thickness, and surface elevation measurements into a domain where a neural network can directly predict basal drag. Furthermore, we demonstrated the utility of probabilistic loss functions for quantifying uncertainties for the basal drag predictions, which will prove to be invaluable for subsequent interpretation of the drag, inference of sliding law parameters, and development of future data acquisition plans. As a real-world example, application of these techniques to time-dependent velocity data over Rutford Ice Stream, Antarctica, resulted in observational evidence of subglacial hydrological effects during the tidal cycle.
From a broader perspective, this work demonstrates a new and rapidly advancing approach for combining the physical knowledge gained from decades of theoretical and experimental work with modern data-driven techniques in order to address an outstanding problem in glacier dynamics, mainly determination of the sliding mode via the form of the inferred sliding law. Under certain forcing environments, we demonstrated that estimation of the value and uncertainty of the exponent in the power-law form of the sliding law is possible with these methods. The exponentially increasing data volume over the fastest flowing areas in the cryosphere demands techniques that combine data efficiency, modeling flexibility, and robustness in the presence of noise, data gaps, and modeling uncertainties. The methods presented here take an important step towards those requirements and present a path forward for future data assimilation tasks for a multitude of disparate data sources.

A1. Governing Equations
In its most general form, glacier flow can be described as an incompressible Stokes flow: where   E σ is the divergence of the Cauchy stress tensor E σ ,  i E is the density of ice, E g is the gravitational acceleration,  E  is the strain rate tensor, and Tr is the trace operator (here, bold font indicates tensor and vector quantities while regular font represents scalars). Equation A1 describes the stress balance (also referred to as the momentum balance) while Equation A2 represents the incompressibility of ice. The strain rate tensor describes the rate of deformation of ice and is calculated as the symmetric component of the velocity gradient: is the velocity vector. To relate the stress tensor components in Equation A1 to velocity components, the constitutive law for incompressible viscous fluids is used: is the deviatoric stress tensor,  Tr( ) / 3 p σ is the isotropic pressure, and E I is the identity matrix. The non-Newtonian effective ice dynamic viscosity,  E , is given as: where E n is the stress exponent in Glen's flow law,  e E  is the effective strain rate (square root of the second invariant of the strain rate tensor), and E A is the flow rate factor which depends on properties of the ice (e.g., temperature, interstitial liquid water content, crystal size/orientation, and impurity content). In practice, many studies have found that various approximations to the computationally expensive full Stokes equations (Equations A1 and A2) are able to reconstruct observed velocity fields fairly well for certain glacier geometries and result in similar implied glacier mechanics. For the types of glaciers and ice streams we examine in this study, the Shallow Ice Shelf/Stream Approximation (SSA) (MacAyeal, 1989) is most commonly used and assumes: (a) ice thickness is much smaller than the horizontal span; (b) most of the forward motion of fast-flowing glaciers is due to sliding at the bed (i.e., vertical shearing is negligible); and (c) total vertical normal stress is equal to the ice overburden pressure. Under these assumptions, the 3D momentum balance can be depth averaged along the E z -dimension, and by using the constitutive law in Equation A4, the approximate 2D momentum balance is expressed as (identical to the main text): where E h is the ice thickness,  bx E and  by E represent the E x -and E y -components of basal shear stress, and E s is the ice surface elevation. The vertical velocity component E w can be recovered using the incompressibility condition. The above momentum balance states that the gravitational horizontal driving stresses of ice flow (terms on the right-hand side) are balanced by a combination of horizontal gradients of deviatoric stresses and drag at the base of the glacier,  b E .

A2. 1D Shallow Ice Stream Model Boundary Conditions
Following previous work on 1D flowline models, we enforce two Neumann boundary conditions at the edges of the spatial domain (Nick et al., 2009;Vieli & Payne, 2005). At the ice divide boundary condition (  0 E x ), a symmetric ice sheet is assumed such that    u x / 0 . At the grounding line (assuming no ice shelf), the boundary condition is derived from the difference between the hydrostatic pressure of water and ice: where  w E is the density of ocean water and s E f is a scalar factor used to apply a time-varying force on the calving face (Nick et al., 2009). While Equation A7 is not strictly applicable to marine-terminating ice streams with no ice shelf, it provides a convenient way to apply longitudinal stress perturbations originating at the terminus. Thus, we can generate time-dependent ice velocity and thickness fields representative of those observed at tidewater glaciers that respond to changes in regional oceanic and climate conditions. For all 1D simulations, we use a flow rate factor      24 1 3 1.2 10 s Pa E A , which corresponds to a temperature of - 5 E C and an exponent  3 E n (Cuffey & Paterson, 2010). We solve these equations in a staggered fashion by solving for E u in Equation 3 under the stated boundary conditions for a given thickness profile, E h , using Newton's method. Mass continuity gives the time evolution of ice thickness, E h : where E a is the surface mass balance (difference between snow accumulation and ablation) and  E q hu is the width-averaged ice flux. Thus, we implement a forward Euler step for Equation A8 to update the thickness profile.

Appendix B: Network Architecture
We use feedforward networks for all neural networks in this work. The hidden layers have the form  E Wx b followed by an activation with a hyperbolic tangent (tanh) function. During our experiments, we found that activation functions that were at least twice differentiable (e.g., tanh or exponential rectified linear units (ELU)) resulted in smoother spatial gradients of output variables than the rectified linear unit (ReLU). We found very little difference in training convergence speed between tanh and ELU activations, so we use tanh throughout this work. The outputs of all networks are linear (i.e., no activation is applied).
Neural networks tasked with reconstructing velocity and thickness observations were prescribed 4-6 hidden layers where each hidden layer consisted of 50 or 100 hidden units. The exact architectures varied for each problem and were qualitatively chosen based on a balance between reconstruction accuracy, spatial smoothness of the reconstruction, and computational efficiency. Regardless, the tradeoffs between the metrics were minor, and the data reconstructions for all network architectures were largely consistent. The neural networks for basal drag predictions were prescribed 4 hidden layers where each hidden layer consisted of 50 units. In this way, we effectively applied more regularization for these networks as compared with the data networks since our prior assumption for the spatial distribution of basal shear stress is one that is smooth.

B1. Training
Weight matrices for all networks are initialized from a normal distribution with variances specified by s a  1/ where E a is the number of input hidden units for each layer. Inputs to all networks are normalized to be zero-mean with unit variance. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0002 and train for 500-1,000 epochs (each epoch is defined as a complete pass through the training data). We use the Python API for TensorFlow (Abadi et al., 2015) and TensorFlow Probability (Dillon et al., 2017) for neural network construction and training.