California Reservoir Drought Sensitivity and Exhaustion Risk using
Statistical Graphical Models
Armeen Taeb
a,1
, J. T. Reager
b
, Michael Turmon
b
, and Venkat Chandrasekaran
a
a
California Institute of Technology, Pasadena, CA, 91125
b
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, 91109
December 2, 2015
Abstract
The ongoing California drought has highlighted the potential vulnerability of state water
management infrastructure to multi-year dry intervals. Due to the high complexity of the
network, dynamic storage changes across the California reservoir system have been difficult
to model using either conventional statistical or physical approaches. Here, we analyze the
interactions of monthly volumes in a network of 55 large California reservoirs, over a period
of 136 months from 2004 to 2015, and we develop a latent-variable graphical model of their
joint fluctuations. We achieve reliable and tractable modeling of the system because the model
structure allows unique recovery of the best-in-class model via convex optimization with control
of the number of free parameters. We extract a statewide ‘latent’ influencing factor which turns
out to be highly correlated with both the Palmer Drought Severity Index (PDSI,
ρ
≈
0
.
86) and
hydroelectric production (
ρ
≈
0
.
71). Further, the model allows us to determine system health
measures such as exhaustion probability and per-reservoir drought sensitivity. We find that
as PDSI approaches -6, there is a probability greater than 50% of simultaneous exhaustion of
multiple large reservoirs.
1 Introduction
The state of California depends on a complex water management system to meet wide-ranging water
demands across a large, hydrologically diverse domain. As part of this infrastructure, California
has constructed 1530 reservoirs, and a collective storage capacity equivalent to a year of mean
runoff from California rivers [9]. The purpose of this system is to create water storage capacity
and extend seasonal water availability to meet national agricultural, residential, industrial, power
generation, and recreational needs.
Major statewide California precipitation deficits during water years 20122015 rivaled the most
intense 4-year droughts in the past 1200 years [10]. The drought was punctuated by low snowpack
in the Sierra Nevada, declining groundwater storage, fallowed agricultural lands, and lower reser-
voir storage [1, 7, 12]. How long these conditions will persist is unknown, and recent research has
explored whether their origin is ascribable to natural climate variability alone or has been increas-
ingly exacerbated by anthropogenic influence [5, 17, 13]. A strong 20152016 El Nino was expected
to bring some relief for drought conditions [11], but the 2016 precipitation through Spring was
not enough to substantially decrease statewide drought severity. These events illustrate that there
exists some sensitivity of California reservoirs to drought conditions, with resulting implications for
1
arXiv:1606.08098v1 [stat.AP] 26 Jun 2016
national water and agricultural security. Yet this sensitivity has been difficult to quantify due to
the size and complexity of the reservoir network.
Recent research has explored models to determine the potential for future exhaustion of US
reservoirs such as Lake Mead (e.g., [2, 3]). Such approaches yield potentially valuable information
to water managers, but the underlying approach used to statistically model a single reservoir is
much simpler than an approach for modeling large network of hundreds of reservoirs, whose complex
management is based on multiple economic and sectoral objectives [12]. The hard-to-quantify
influence of human operators and lack of system closure have made the modeling and prediction
of reservoir storage behavior using physical equations challenging, for example in hydrology or
climate models [14]. Thus, it is difficult to say with any certainty how the complex network of
California reservoirs responds hydrologically to drought severity, and at what point exhaustion
becomes feasible.
We present a statistical approach to the problem of modeling Californias complex reservoir
network. We model the network variables with a statistical graphical model and use the historical
system behavior to estimate network interactions by solving a constrained maximum-likelihood
estimation problem. The robustness and predictive accuracy of the model is greatly improved
by imposing sparsity constraints; the form of these constraints allows application of new convex
optimization tools to uniquely solve the resulting estimation problem.
In its general form, the model allows automatic recovery of unobserved latent variables that
influence network behavior, as well as introduction of key forcing variables, most significantly,
drought severity index. This approach allows for a mathematical simplification of a complex network
so that network behavior and response to forcing can be better understood than before. The learned
model allows us to make statistically significant statements about the sensitivity of a representative
subset of Californias reservoirs to drought severity, and the resulting conditions leading to increased
probability of exhaustion in the future.
2 Data
Our primary dataset is monthly averages of reservoir volumes, derived from daily time series of
volumes downloaded from the California Data Exchange Center (CDEC). We also used secondary
data for some covariates.
2.1 Reservoir Time Series
We began with a list of all 60 California reservoirs having daily data during the period of study
(January 2004 – April 2015). From this list, we excluded the five reservoirs with more than half
of their values undefined or zero, leaving 55 reservoirs. This list of daily values was inspected
using a simple continuity criterion and approximately 50 specific values were removed or corrected.
Corrections were possible in six cases because values had misplaced decimal points, but all other
detected errors were set to missing values. The most common error modes were missing values that
were recorded as zero volume, and a burst of errors in the Lyons reservoir during late October 2014
that seem due to a change in recording method at that time.
The final set of 55 reservoir volume time series spans 4595 days over the 136 months in the
study period. It contains two full cycles of California drought (roughly, 2004–2010 and 2010–
present). Four California hydrological zones are represented, with 25, 20, 6, and 4 reservoirs in the
Sacramento, San Joaquin, Tulare, and North Coast zones, respectively.
The volumetric data was averaged from daily down to monthly values, and its strong annual
component was removed with a per-month, per-station average, leaving a volume anomaly with
2
Sacramento
San Joaquin
Tulare
Don Pedro
New
Melones
Buchanan
Shasta
Antelope
Almanor
Isabella
Pine Flat
Sacramento
San Joaquin
Tulare
Don Pedro
New
Melones
Buchanan
Shasta
Antelope
Almanor
Isabella
Pine Flat
(a)
(b)
Figure 1: Graphical structure between a collection of reservoirs – without latent variables (a) – with latent
variables (b). Green nodes represent reservoirs (variables) and the clouded green node represents latent
variables. Solid blue lines represent edges between reservoirs and dotted edges between reservoirs and latent
variables. The reservoirs have been grouped according to hydrological zones.
136 monthly observations to be modeled statistically. Before being used in the fitting algorithms
of Section 3, each time series is also rescaled by its standard deviation so that each series has unit
variance. The volume anomaly, being a de-trended average of small daily fluctuations, is plausibly
modeled by a Gaussian frequency law.
2.2 Covariate Time Series
Our model and analysis makes use of ancillary data, i.e.,
covariates
, which are observable variables,
exogenous to the model, that may typically affect a large fraction of reservoirs. The particular
covariates we used are temperature, snowpack, PDSI, number of agricultural workers, hydroelectric
power, and consumer price index (CPI). We defined metrics for statewide snowpack (in high-altitude
regions) as well as zone-specific snowpack. We apply a time lag of two months to the covariates
temperatures, snowpack, and PDSI. As with the reservoir time series, we remove seasonal patterns
with a per-month average, and scale each time series to have unit variance.
3 Methods
We introduce three modeling techniques that are employed on the reservoir dataset. We denote
the 55 reservoir volumes by
y
∈
R
55
, and the nine covariates described in Section 2 by
c
∈
R
9
. We
denote the 136 monthly observations by
D
n
=
{
y
(
i
)
}
n
i
=1
⊂
R
55
. In what follows, we assume zero-
mean variables, or equivalently, that any known mean has been subtracted from the observations
— assured in our case by preprocessing that subtracts a climatology from
y
and
c
.
3.1 Graphical modeling
A natural framework for modeling a large network of reservoirs is a Gaussian graphical model
(defined with respect to a graph) where nodes index reservoirs and edges denote dependencies be-
tween reservoirs. The absence of an edge between a pair of reservoirs implies that the corresponding
reservoir volumes are independent conditioned on the volumes of the remaining reservoirs in the
network. Figure 1(a) is an example of a graphical model over a collection of reservoirs in three
hydrological zones: Sacramento, San Joaquin, and Tulare. In comparison to graphical modeling,
standard linear regression techniques suffer from over-fitting in this setting where there is a mod-
erate number of observations compared to the number of reservoirs. We demonstrate this point in
the supporting information (SI).
The structural properties of a graphical model are encoded in the sparsity pattern of the inverse
covariance matrix (the
precision matrix
) over reservoirs. Specifically, consider a Gaussian graphical
3
model for reservoir volumes
y
with a covariance matrix Σ. (Both Σ and the corresponding precision
matrix Θ = Σ
−
1
must be symmetric, written Θ
∈
S
55
, and positive definite, written Θ
0.) Then,
an edge between reservoirs
r
and
r
′
is present in the graph if and only if Θ
r,r
′
6
= 0, with larger
magnitudes indicating stronger interactions. Equivalently, the volumes of reservoirs
r
and
r
′
are
conditionally independent if and only if Θ
r,r
′
= 0. Thus, learning a Gaussian graphical model is
equivalent to estimating a
sparse
precision matrix Θ [18, 8].
3.2 Latent variable graphical modeling
A drawback with graphical modeling is that we do not observe all relevant phenomena: some
globally-influential quantities may be
hidden
or
latent
variables. Without accounting for these
latent variables, there will be confounding dependencies between many reservoirs. For example,
water held by a collection of nearby reservoirs might be influenced by a common snowpack variable.
Without observing this common variable, all reservoirs in this set would appear to have mutual
links, whereas if snowpack is included in the analysis, the common behavior is explained by a link
to the snowpack variable. Accounting for latent structure removes these confounding dependencies
and leads to sparser interactions between reservoirs. Figure 1(b) illustrates this point. Recent
work [4] developed a principled approach, termed latent-variable graphical modeling, to account
for latent variables. Fitting a latent-variable graphical model corresponds to representing Θ (the
precision matrix of the reservoir volumes) as the sum of a sparse and a low-rank matrix, rather
than just a sparse matrix, as in the case of pure graphical modelling. Here the low-rank component
of Θ accounts for the effect of the latent variables, and its rank is the number of latent variables.
The sparse component of Θ specifies the conditional graphical model structure underlying the
distribution of reservoir volumes conditioned on the latent variables.
Based on this decomposition of Θ, a natural model selection method for learning a latent-
variable graphical model is via a regularized maximum-likelihood estimator. Specifically, we fit the
n
observations
D
n
=
{
y
(
i
)
}
n
i
=1
⊂
R
55
to a latent-variable graphical model using the estimator:
(
ˆ
S,
ˆ
L
) = arg min
S,L
∈
S
55
−
`
(
S
−
L
;
D
n
) +
λ
(
γ
‖
S
‖
1
+ tr(
L
))
s
.
t
. S
−
L
0
, L
0
.
(1)
The term
`
(Θ;
D
n
) is the Gaussian log-likelihood of Θ, which, after discarding additive constants
and scaling, is the concave function
`
(Θ;
D
n
) = log det(Θ)
−
tr
[
Θ
·
1
n
n
∑
i
=1
y
(
i
)
y
(
i
)
′
]
.
(2)
The constraints
0 and
0 impose positive-definiteness and positive semi-definiteness. Here
ˆ
S
provides an estimate for the sparse component of Θ (corresponding to the conditional precision
matrix of reservoir volumes), and
ˆ
L
provides an estimate for the low-rank component of Θ (cor-
responding to the effect of latent variables on reservoir volumes). We note that constraining
L
to
be the zero matrix in (1) provides an estimator for learning a sparse Gaussian graphical model;
see [8] for further details. The function
‖·‖
1
denotes the
L
1
norm (element-wise sum of absolute
values) that promotes sparsity in the matrix
S
. This regularizer has successfully been employed
in many settings for fitting structured sparse models to high-dimensional data (see the survey [16]
and the references therein). The role of the trace penalty on
L
is to promote low-rank structure [6].
The regularization parameter
γ
provides a trade-off between the graphical model component and
the latent component, and the regularization parameter
λ
provides overall control of the trade-off
4
between the fidelity of the model to the data and the complexity of the model. For
λ,γ
≥
0,
the regularized maximum likelihood estimator of (1) is a convex program with a unique optimum.
Further theoretical support for this estimator is presented in [4].
3.3 Latent Variable Structure
Latent variable graphical modeling identifies the effect that underlying latent variables have on
the reservoir volumes. For applications, the following approach to associate semantics to these
latent variables can aid understanding of the estimated model. Suppose we learn a latent-variable
graphical model with
k
latent variables with the methodology just described. Let
z
∈
R
k
denote the
latent variables and
y
∈
R
55
denote reservoir volumes; further, partition the joint precision matrix
of (
y,z
) as
̃
Θ =
(
̃
Θ
y
̃
Θ
′
zy
̃
Θ
zy
̃
Θ
z
)
. A natural approximation for the time series of
z
given observations
D
n
is the conditional mean:
̃
z
(
i
)
=
E
[
z
(
i
)
|
y
(
i
)
] =
̃
Θ
−
1
z
̃
Θ
zy
y
(
i
)
.
(3)
If
̃
Θ
z
and
̃
Θ
zy
were explicitly known, the length-
n
time series
{
̃
z
(
i
)
}
n
i
=1
⊂
R
k
would provide an
estimate of the latent variables given observations
D
n
. As discussed in [4], the low-rank component
in the decomposition of the marginal precision matrix of
y
is
L
=
̃
Θ
′
zy
̃
Θ
−
1
z
̃
Θ
zy
. However, having
an estimate for
L
does not uniquely identify
̃
Θ
−
1
z
̃
Θ
zy
. Indeed, for any non-singular
A
∈
R
k
×
k
, one
can transform
̃
Θ
z
→
A
̃
Θ
z
A
′
and
̃
Θ
zy
→
A
̃
Θ
zy
without altering
L
. In terms of
z
, these observations
imply that for any non-singular
A
,
{
A
−
1
̃
z
(
i
)
}
n
i
=1
is an equivalent realization of the latent variable
time series:
z
is recoverable only up to a nonsingular transformation.
Nevertheless, the structure of the low-rank matrix
L
places a constraint on the effect of the latent
variables
z
on
y
. Let
̃
Z
∈
R
n
×
k
denote a (non-unique) realization of latent variable observations.
As we have seen,
̃
ZA
′
−
1
is an equivalent realization. The key
invariant
is the column-space of
̃
Z
,
a
k
-dimensional linear subspace of
R
n
, which we term the
latent space
, which can be recovered as
follows.
Letting
Y
∈
R
n
×
55
denote observations of reservoir volumes, (3) becomes
̃
Z
=
Y
̃
Θ
′
zy
̃
Θ
−
1
z
. Since
the column-space of
Y
̃
Θ
′
zy
̃
Θ
−
1
z
is equal to the column-space of
Y L
, one set of basis elements for
the latent space is the
k
left singular vectors of the matrix
Y L
, which can be readily computed.
We interpret the underlying latent variables by linking observed covariates to this latent space.
Specifically, we evaluate the correlation of a covariate with the latent variables by computing the
energy of the projection of the time series of the covariate onto the latent space. Covariates
with large correlation represent interpretable components of the underlying latent variables: these
covariates have significant overlap with the latent space.
3.4 Conditional latent-variable graphical modeling
We provided an approach for linking a latent variable space defined by
L
to observed covariates.
We can extend our modeling framework to incorporate covariates
x
∈
R
q
that are most correlated
with the latent space, where
q
is the number of covariates included from the overall collection
c
∈
R
9
described in Section 1. Since the covariates
x
can account for the effect of some of the latent
variables in the previous modeling framework, the distribution of
y
given
x
may still depend on a
few
residual latent variables
. Therefore, we fit a latent-variable graphical model to the conditional
distribution of
y
|
x
.
In the SI, we discuss how the convex program of (1) can be modified for fitting a conditional
latent-variable graphical model to the augmented data
D
+
n
=
{
(
y
(
i
)
, x
(
i
)
)
}⊂
R
55+
q
. This estimator
produces three components:
ˆ
Θ,
ˆ
L
y
, and
ˆ
S
y
. The matrix
ˆ
Θ
∈
S
55+
q
is an estimate for the joint
5
precision matrix of (
y,x
), and in particular, its submatrix
ˆ
Θ
y
is an estimate for the conditional
precision matrix of the reservoirs given the covariates. The submatrix
ˆ
Θ
y
can be decomposed as
ˆ
Θ
y
=
ˆ
S
y
−
ˆ
L
y
. The matrix
ˆ
L
y
∈
S
55
is an estimate for the effect of residual latent variables on
y
|
x
,
with the rank of
ˆ
L
y
equal to number of residual latent variables. Finally,
ˆ
S
y
∈
S
55
is an estimate
of the precision matrix of
y
|
x
conditioned on these residual latent variables.
4 Results
We now apply the methods described above to the reservoir system. For our experiments, we set
aside monthly observations of reservoir volumes and covariates from August 2005 – August 2013 as
a training set (
n
train
= 96) and monthly observations from January 2004 – July 2005 and September
2013 – April 2015 as a (disjoint) testing set (
n
test
= 40). Let
D
train
and
D
+
train
, respectively, be
the training set of reservoir volumes and that of reservoir volumes augmented with covariate data;
D
test
and
D
+
test
are the corresponding testing sets. The particular covariates that are included in
the augmented datasets
D
+
train
and
D
+
test
are discussed later.
Recall that, for each variable, we remove seasonal components by subtracting a per-month
average as computed on the training set, leaving an anomaly that we model as Gaussian. We
scale time series to have unit variance on the training observations, and then apply the same
transformation to the testing set. The convex programs associated with each modeling framework
are regularized max-det programs and are solved in polynomial time using general-purpose solvers
[15].
4.1 Graphical Modeling
We begin by learning a graphical model over the reservoirs in our network. Specifically, we use
D
train
as input to the appropriate convex program — this is the program (1) with
L
constrained
to be the zero matrix. We select the single regularization parameter
λ
(
γ
is set to 1 without loss of
generality) by cross-validation with
D
test
to learn a graphical model consisting of 385 edges. Recall
from Section 3 that the sparsity pattern of the precision matrix
ˆ
Θ encodes the dependence structure
between reservoirs, so that a zero entry in the precision matrix implies that the corresponding
reservoir volumes are independent conditioned on the volumes of the remaining reservoirs in the
network. We denote the strength of an edge as the normalized magnitude of the corresponding
precision matrix entry, that is,
s
(
r,r
′
) =
|
Θ
r,r
′
|
/
(Θ
r,r
Θ
r
′
,r
′
)
1
/
2
≥
0
.
The lower triangle of Figure 2 shows the dependence relationships between reservoirs in this graph-
ical model. The five strongest edges in this graphical structure are between reservoirs Relief–Main
Strawberry, Cherry–Hetch Hetchy, Little Grass Valley–Lake Valley, Almanor–Davis, and Coyote
Valley–Warm Spring. The presence of these strong edges is sensible as each such edge is between
reservoirs in the same hydrological zone, and 4 of these 5 edges are between pairs of reservoirs fed
by the same river basin.
We also observe that the majority of interactions in this graphical model are among reservoirs
that have similar volumetric capacities. Specifically, 80% of the edges are between reservoirs that
are within a factor of 5 in capacity. As a point of comparison, consider a complete graph with
all pairwise interactions. In this graphical structure, 55% of the edges are among reservoirs that
are within a factor of 5 in capacity. Thus, graphical modelling removes the dependencies between
reservoirs of vastly different capacity. This is expected since reservoirs with substantially different
6