of 19
RESEARCH ARTICLE
10.1002/2017WR020412
A Statistical Graphical Model of the California Reservoir System
A. Taeb
1
, J. T. Reager
2
, M. Turmon
2
, and V. Chandrasekaran
3
1
Department of Electrical Engineering, California Institute of Technology, Pasadena, CA, USA,
2
Jet Propulsion Laboratory,
California Institute of Technology, Pasadena, CA, USA,
3
Department of Computing and Mathematical Sciences and
Department of Electrical Engineering, California Institute of Technology, Pasadena, CA, USA
Abstract
The recent California drought has highlighted the potential vulnerability of the state’s water
management infrastructure to multiyear dry intervals. Due to the high complexity of the network, dynamic
storage changes in California reservoirs on a state-wide scale have previously been difficult to model using
either traditional statistical or physical approaches. Indeed, although there is a significant line of research
on exploring models for single (or a small number of) reservoirs, these approaches are not amenable to a
system-wide modeling of the California reservoir network due to the spatial and hydrological
heterogeneities of the system. In this work, we develop a state-wide statistical graphical model to
characterize the dependencies among a collection of 55 major California reservoirs across the state; this
model is defined with respect to a graph in which the nodes index reservoirs and the edges specify the
relationships or dependencies between reservoirs. We obtain and validate this model in a data-driven
manner based on reservoir volumes over the period 2003–2016. A key feature of our framework is a
quantification of the effects of external phenomena that influence the entire reservoir network. We further
characterize the degree to which physical factors (e.g., state-wide Palmer Drought Severity Index (PDSI),
average temperature, snow pack) and economic factors (e.g., consumer price index, number of agricultural
workers) explain these external influences. As a consequence of this analysis, we obtain a system-wide
health diagnosis of the reservoir network as a function of PDSI.
1. Introduction
1.1. Motivation
The state of California depends on a complex water management system to meet wide-ranging water
demands across a large, hydrologically diverse domain. As part of this infrastructure, California has con-
structed 1,530 reservoirs having a collective storage capacity equivalent to a year of mean run off from Cali-
fornia rivers (Graf, 1999). The purpose of this system is to create water storage capacity and extend seasonal
water availability to meet agricultural, residential, industrial, power generation, and recreational needs.
Major state-wide California precipitation deficits during the years 2012–2015 rivaled the most intense 4 year
droughts in the past 1,200 years (Griffin & Anchukaitis, 2014). The drought was punctuated by low snow
pack in the Sierra Nevada, declining groundwater storage, and fallowed agricultural lands, in addition to sig-
nificantly diminished reservoir levels (AghaKouchak et al., 2014; Famiglietti, 2014; Howitt et al., 2014). This
sensitivity of the California reservoir network to external conditions (e.g., temperature, precipitation) has
implications for state-wide water and agricultural security. In this paper, we seek a characterization of the
relationships among the major California reservoirs and their sensitivity to state-wide physical and eco-
nomic factors, with a view to investigating and quantifying the likelihood of systemic catastrophes such as
the simultaneous exhaustion of multiple large reservoirs.
Such an analysis has been difficult to carry out on a system-wide scale due to the size and complexity of the
reservoir network. In one direction, a body of work has focused on characterizing the behavior of a small
collection of reservoirs using
physical laws
(e.g., Christensen & Lettenmaier, 2004); Christensen et al., 2006;
Nazemi & Wheater, 2015). Such approaches quickly become intractable in settings with large numbers of
reservoirs whose complex management is based on multiple economic and sectoral objectives (Howitt
et al., 2014). The hard-to-quantify influence of human operators and the lack of system closure have made
the modeling and prediction of reservoir network behavior using physical equations challenging in
Key Points:

We develop a statistical graphical
model to characterize the statewide
California reservoir system

We quantify the influence of external
physical and economic factors (e.g.,
statewide PDSI and consumer price
index) on the reservoir network

Further analysis gives a system-wide
health diagnosis as a function of
PDSI, indicating when heavy
management practices may be
needed
Supporting Information:

Supporting Information S1

Supporting Information S2
Correspondence to:
A. Taeb,
ataeb@caltech.edu
Citation:
Taeb, A., Reager, J. T., Turmon, M., &
Chandrasekaran, V. (2017). A statistical
graphical model of the California
reservoir system.
Water Resources
Research
,
53
, 9721–9739. https://doi.
org/10.1002/2017WR020412
Received 13 JAN 2017
Accepted 14 OCT 2017
Accepted article online 20 OCT 2017
Published online 27 NOV 2017
V
C
2017. American Geophysical Union.
All Rights Reserved.
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9721
Water Resources Research
PUBLICATIONS
hydrology and climate models (Solander et al., 2016). In a different direction, numerous works have devel-
oped
empirical techniques
for modeling the behavior of a small number of reservoirs (e.g., Ashaary et al.,
2015; Barnett & Pierce, 2008; Bazartseren et al., 2003; Chen & Liu, 2015; Cheng et al., 2015; Hoerling & Eisc-
heid, 2007; Kuria & Vogel, 2015; Linares-Rodriguez, et al., 2015; Liu & Chung, 2014; Marton, et al., 2015; Nash
& Gleick, 1991, 1993; Phatafod, 1989; Revelle & Waggoner, 1983; Wisser et al., 2010; Yang et al., 2016, 2017;
Zhang et al., 2017). However, these methods are not directly applicable to modeling a large reservoir net-
work, as the water levels of major reservoirs in California exhibit complex interactions and are statistically
correlated with one another (as is demonstrated by our analysis). This necessitates a proper quantification
of the complex dependencies among reservoirs in determining the systemic characteristics of the reservoir
network.
The focus of this work is to develop a state-wide model over the California reservoir network that addresses
the following scientific questions:
1. What are the interactions or dependencies among reservoir volumes? In particular, how correlated are
major reservoirs in the system?
2. Are there common external factors influencing the network globally? Could these external factors cause
a system-wide catastrophe?
To the best of our knowledge, this work is the first that attempts such a state-wide characterization of the
California reservoir network. The state-wide external factors that we consider in our analysis include physical
factors such as state-wide PDSI and average temperature, and economic factors such as the consumer price
index and the number of agricultural workers. The focus on these state-wide external influences is driven
by the global nature of our analysis; indeed, an exciting direction for further research is to complement our
global model with local reservoir-specific factors to obtain an integrated picture of both systemic as well as
local risks to the reservoir network.
Answering these questions for the California reservoir system raises a number of challenges, and it is impor-
tant that any modeling framework that we consider addresses these challenges. First, reservoirs with similar
hydrological attributes (e.g., altitude, drainage area, spatial location) tend to behave similarly. As an exam-
ple, a pair of reservoirs that is approximately at the same altitude or in the same hydrological zone are more
likely to have a stronger correlation than those in different altitudes/zones. Therefore, we seek a framework
that ably models the complex heterogeneities in the reservoir system. A second challenge, which is in some
sense in competition with the first one, is that compactly specified models are much more preferable to
less succinct models, as concisely described models are often more interpretable and avoid problems asso-
ciated with
over-fitting
. Finally, it is crucial that models with both of the preceding attributes have the addi-
tional feature that they can be identified in a computationally efficient manner.
1.2. Approach and Results
Gaussian
graphical models
offer an appealing and conceptually powerful framework with all the attributes
just described. Graphical modeling is a prominent multivariate analysis technique that has been successfully
employed in domains as varied as gene regulatory network analysis, social networks, speech recognition,
and computer vision (see Jordan, 2004, for a survey on graphical modeling). These models are defined with
respect to graphs, with nodes of a graph indexing variables and the edges specifying statistical dependen-
cies among these variables. In a reservoir modeling context, the nodes of the graph correspond to reser-
voirs and an edge between two reservoirs would describe the strength of the interaction between the
levels of those reservoirs. Formally, the strength of an edge specifies the degree of conditional dependence
between the corresponding reservoirs; in other words, this is the dependence between two reservoirs con-
ditioned on all the other reservoirs in the network. Informally, an edge in a graphical model denotes the
extent to which two reservoirs remain correlated even after accounting for the influence of all the other res-
ervoirs in the network. We illustrate these points using a toy example of a graphical model over a collection
of eight reservoirs, shown in Figure 1a. (This figure is purely for explanatory purposes rather than a factual
representation of the complex dependencies among reservoirs, which we obtain in section 3). One can
imagine that the reservoir volume of Shasta (which is at a high elevation in northern California in the Sacra-
mento hydrological zone) is independent of the reservoir Pine Flat and the reservoir Isabella (which are in
southern California in the Tulare hydrological zone) after conditioning on volumes of reservoirs in the cen-
tral portion of the state (e.g., Black Butte, Lake Berrysa, New Melones, Buchanan, and Don Pedro). These
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9722
relationships are encoded in a graphical model of Figure 1a. In particular, note that Shasta has an edge link-
ing it to each of the reservoirs {Black Butte, Lake Berrysa, Don Pedro, New Melones, Buchanan}, but does
not have an edge connecting it to the reservoirs {Pine Flat, Isabella}. Figure 1a is, of course, a cartoon dem-
onstration of a graphical modeling framework. In practice, identifying conditional dependencies between
pairs of reservoirs in large networks such as the one considered in our work is a challenging problem, and
we describe tractable approaches to learning such a graphical structure underlying the complex California
reservoir system in a completely data-driven manner in section 3. To the best of our knowledge, this is the
first work that applies graphical modeling techniques to model reservoirs or other water resources.
The graphical modeling framework provides a common lens for viewing two frequently employed statistical
techniques. On the one hand, a classical approach for obtaining a multivariate Gaussian distribution over
reservoir volumes is via a maximum likelihood estimator. This estimator has been widely used in various
domains in the geophysical sciences for multivariate analysis of a collection of random variables (Wackerna-
gel, 2003). The model obtained by this maximum likelihood estimator is specified by a completely con-
nected graphical structure, where all reservoirs are conditionally correlated given all other reservoirs. On the
other hand, an independent reservoir model analyses the behavior of an individual reservoir independently
of the other reservoirs in the network. This model results in a fully disconnected graphical model. In this
paper, we learn a statistical graphical model over the reservoir network in a data-driven manner based on
historical reservoir data. This model yields a sparse (yet connected) graphical structure describing the net-
work interactions. We demonstrate that this model outperforms the model obtained via unregularized max-
imum likelihood estimator and an independent reservoir model. Thus, the reservoir behaviors are not
independent of one another but can be specified with a moderate number of interactions. We demonstrate
that a majority of these interactions are between reservoirs that are in the same basin or hydrological zone,
and among reservoirs that have similar altitude and drainage area.
A natural question is whether some dependencies specified by the graphical model are due to a small num-
ber of external phenomena (drought, agricultural usage, Colorado river discharge, precipitation, etc.). For
example, water held by a collection of nearby reservoirs might be influenced by a common snow pack vari-
able. Without observing this common variable, all reservoirs in this set would appear to have mutual links,
whereas if snow pack is included in the analysis, the common behavior is explained by a link to the snow
pack variable. Accounting for latent structure removes these
confounding
dependencies and leads to
sparser
and more localized
interactions between reservoirs. Figure 1b illustrates this point. Latent variable graphical
modeling offers a principled approach to quantify the effects of
external phenomena
that influence the
entire reservoir network. In particular, this modeling framework uses observational data to (1) identify the
number of global factors (e.g., latent variables) that summarize the effect of external phenomena on the res-
ervoir network, and (2) identify the residual reservoir dependencies after accounting for these global factors.
Our experimental results demonstrate that the reservoir network at a monthly resolution has two distinct
global factors, and residual dependencies persist after accounting for these global factors.
Figure 1.
Graphical structure between a collection of eight reservoirs (a) without latent variables and (b) with latent varia-
bles. Green nodes represent reservoirs (variables) and the clouded green node represents latent variables. Solid blue lines
represent edges between reservoirs and dotted edges between reservoirs and latent variables. The reservoirs have been
grouped according to hydrological zones.
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9723
Latent variable graphical modeling obtains a mathematical representation of the external phenomena
influencing the reservoir network. One is naturally interested in linking these mathematical objects to real-
world signals (e.g., state-wide Palmer Drought Severity Index, snow pack, consumer price index). We present
an approach for associating semantics to these latent variables. We find that the state-wide Palmer Drought
Severity Index (PDSI) is highly correlated (
q

0
:
88) with one of the latent variables. PDSI is then included as
a covariate in the
next
iteration of the graphical modeling procedure to learn a joint model over reservoirs
and PDSI. Using this model, we characterize the system-wide behavior of the network to hypothetical
drought conditions. In particular, we find that as PDSI approaches
2
5, there is a probability greater than
50% of simultaneous exhaustion of multiple large reservoirs. We further present an approach for identifying
specific reservoirs in the network that are at high risk of exhaustion during extreme drought conditions. We
find that the Buchanan and Hidden Dam reservoirs are at high risk and describe the stringent water man-
agement policies that were enforced to prevent exhaustion.
2. Data Set and Model Validation
Our primary data set consists of monthly averages of reservoir volumes, derived from daily time series of
volumes downloaded from the California Data Exchange Center (CDEC). We also used secondary data for
some covariates.
2.1. Reservoir Time Series
As described in section 1, there are 1,530 reservoirs in California. In this work, we perform statistical analysis
on the largest 60 reservoirs in California. We apply our analysis on a subset of the reservoirs as they have a
large amount of historical data available. Our technique can be extended to a larger collection of reservoirs
given sufficient data. For these 60 reservoirs, daily volume data are available during the period of study
(January 2003–November 2016). We excluded five reservoirs with more than half of their values undefined
or zero, leaving 55 reservoirs. This list of daily values was inspected using a simple continuity criterion and
approximately 50 specific values were removed or corrected. Corrections were possible in six cases because
values had misplaced decimal points, but all other detected errors were set to missing values. The most
common error modes were missing values that were recorded as zero volume, and a burst of errors in the
Lyons reservoir during late October 2014 that seems due to a change in recording method at that time.
The final set of 55 reservoir volume time series spans 5,083 days over the 167 months in the study period. It
contains two full cycles of California drought (roughly, 2007–2008 and 2012–2015) and three cycles of wet
period (2004–2006, 2009–2011, 2016). Four California hydrological zones are represented, with 25, 20, 6,
and 4 reservoirs in the Sacramento, San Joaquin, Tulare, and North Coast zones, respectively.
We are interested in long-term reservoir behavior and thus model reservoir volumes at a monthly time
scale. In particular, we average the data from daily down to 167 monthly observations. The reservoir data
exhibit strong seasonal components. Hence, a seasonal adjustment step is performed to remove these pre-
dictable patterns, so that we can model deviations from the underlying trend in the reservoir behavior.
With the exception of the Farmington reservoir (which has volume less than 10
8
m
3
), the joint volume
anomalies of the remaining 54 reservoirs are well-approximated by a multivariate Gaussian distribution.
This is demonstrated by a Q-Q plot in Figure S1 of the supporting information. Since a large amount of his-
torical data is available for the Farmington reservoir, we have included it in our analysis. These observed
properties suggest that the reservoir data is amenable to the multivariate Gaussian models we employ in
this paper. Before being used in the fitting algorithms, each time series is also rescaled by its standard devi-
ation so that each series has unit variance. We note that our statistical approach identifies correlations
between reservoir volumes. Since the correlation between two random variables is normalized by their
respective variances, this transformation is appropriate.
2.2. Covariate Time Series
Latent variable graphical modeling identifies a mathematical representation of the global factors influencing
the reservoir network. We link these global factors to real-world signals using ancillary data, i.e.,
covariates
,
which are observable variables, exogenous to the model, that may affect a large fraction of reservoirs. The par-
ticular covariates that we use are temperature (averaged values over California downloaded from NOAA),
Palmer Drought Severity Index (averaged values over California downloaded from NOAA), hydroelectric power
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9724
generation of California (downloaded from U.S. Energy Information and Administration), Colorado river dis-
charge (averaged values downloaded from United States Geological Survey), and Sierra Nevada snow pack
covariate (manually averaged in the Sierra Nevada region where the elevation is over 100 m, gridded observa-
tions downloaded from NOAA). Note that since we are interested in state-wide covariates that exert influence
over the entire network, these hydrological indicators were averaged over the state of California (or in the
case of snow pack and Colorado river discharge, averaged over a large region in the Sierra Nevada and Colo-
rado river, respectively). In addition to these hydrological indicators, we use the following economic factors:
state-wide number of agricultural workers (downloaded from State of California Employment Development
Department) and state-wide consumer price index (downloaded from Department of Industrial Relations).
For each of the seven covariates, we obtain averaged monthly observations from 2003 to 2016. We apply a
time lag of 2 months to the covariates temperature, snow pack, Colorado river discharge, and Palmer
Drought Severity Index (the reason for a 2 months lag is explained in section 4.4). As with the reservoir time
series, we remove seasonal patterns with a per-month average.
2.3. Model Validation
To ensure that the model of the reservoirs is representative of reservoir behavior, we perform model validation
using a technique known as
holdout validation
(Hastie et al., 2009). The objective of this technique is to produce
models that are not overly tuned to the idiosyncrasies of observational reservoir data, so that these models are
representative of future reservoir behavior. In a holdout validation framework, the available data are partitioned
into a training set, and a disjoint validation set. The training set is used as input to a fitting algorithm to identify
a model. The accuracy of this model is then validated by computing the average log-likelihood of the validation
set with respect to the distribution specified by the model. Here, larger values of log-likelihood are indicative of
better fit to data. For our experiments, we set aside monthly observations of reservoir volumes and covariates
from January 2004–December 2013 as a training set (
n
train
5
120) and monthly observations from January
2003–December 2003 and January 2014–November 2016 as a (disjoint) validation set (
n
test
5
47). Both the train-
ing and validation observations contain a significant amount of annual and interannual variability.
3. Dependencies Underlying the Reservoir Network
3.1. Method: Graphical Modeling
A common approach for fitting a graphical model to reservoirs is to choose the simplest model, that is, the
sparsest network that adequately explains the observational data. Easing this task, for Gaussian graphical mod-
els, the graphical structure is encoded in the sparsity pattern of the precision matrix (inverse covariance matrix)
over the variables. Specifically, zeros in the precision matrix of a multivariate Gaussian distribution indicate
absent edges in the corresponding graphical model. Thus, the number of edges in the graphical model equals
the number of nonzeros of the precision matrix
H
. As an example, consider the toy graphical model in Figure
1a. Suppose that the precision matrix
H
of size 8
3
8 is indexed according to the ordering {Shasta, Black Butte,
Lake Beryssa, Isabella, Pine Flat, Don Pedro, New Melones, and Buchanan}. Then,
H
has the following structure:
H
5
???
00
???
???
00
???
????
0
???
00
??????
000
???
0
?
????????
????
0
???
????????
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
;
where
?
denotes a nonzero value. The intimate connection between a graphical structure and the precision
matrix implies that fitting a sparse Gaussian graphical model to reservoir observational data is equivalent to
estimating a sparse precision matrix
H
. Hence, the reservoirs are modeled according to the distribution
y
Nð
0
;
H
2
1
Þ
, where
H
is sparse. Note that the preprocessing to remove climatology causes the mean to
be zero. A natural technique to fit such a model to observational data is to minimize the negative log-
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9725
likelihood (e.g., maximum likelihood estimation) of data while controlling the sparsity level of
H
. The log-
likelihood function of the training observations
D
train
5
f
y
ð
i
Þ
g
n
train
i
5
1

R
55
(after removing some additive con-
stants and scaling) is given by the concave function
ð
H
;
D
train
Þ
5
log det
ð
H
Þ
2
tr
H

R
n
½
;
(1)
where
R
n
5
1
n
train
P
n
train
i
5
1
y
ð
i
Þ
y
ð
i
Þ0
is the sample covariance matrix. Thus, fitting a graphical model to
D
train
trans-
lates to searching over the space of precision matrices to identify a matrix
H
that is sparse and also yields a
small value of
2
ð
H
;
D
train
Þ
. This formulation, however, is a computationally intractable combinatorial prob-
lem. Recent work (Friedman et al., 2008; Yuan & Lin, 2007) has identified a way around this road block by
using a convex relaxation:
^
H
5
arg min
H
2
S
55
2
ð
H
;
D
train
Þ
1
k
jj
H
jj
1
;
s
:
t
:
H

0
:
(2)
The notation
S
55
denotes the set of symmetric 55
3
55 matrices. The constraint

0 imposes positive defi-
niteness so that the joint distribution of reservoirs is nondegenerate. The regularization term
jjjj
1
denotes
the
L
1
norm (element-wise sum of absolute values) that promotes sparsity in the matrix
H
. The
L
1
penalty,
and more broadly, regularization techniques, are widely employed in inverse problems in data analysis to
overcome ill-posedness and avoid problems such as
over-fitting
to moderate sample size (see the text-
books/monographs B
uhlmann & van de Geer, 2011; Wainwright, 2014; and the references therein). These
regularization approaches have proved to be valuable in many applications, including cameras (Duarte
et al., 2008), magnetic resonance imaging (Lustig et al., 2008), gene regularity networks (Zhang & Kim,
2014), and radar (Herman & Strohmer, 2009).
The regularization parameter
k
in (2) provides overall control of the trade-off between the fidelity of the
model to the data and the complexity of the model. In particular, the program (2) with
k
5
0 yields the
familiar maximum likelihood covariance estimator. This estimator has a well-known closed form solution
^
H
5
R
2
1
n
. Generally,
R
2
1
n
will not contain any zeros. This implies that the estimated graphical structure is fully
connected with close fit to the training data
D
train
. However, as explored in section 3.2, this model may be
overtuned to the idiosyncrasies of the training observations
D
train
and will not generalize to future behavior
of reservoirs (a phenomenon known as
over-fitting
). Larger values of
k
yield a sparser graphical model with
very large
k
resulting in a completely disconnected graphical model where the reservoirs are independent
of one another. Importantly, for any choice of
k
>
0, equation (2) is a convex program with a unique opti-
mum, and can be solved efficiently using general purpose off-the-shelf solvers (Toh et al., 2006). Further
theoretical support of this estimator is presented in (Ravikumar et al., 2011).
We select the regularization parameter
k
by
holdout validation
. In particular, for any choice of
k
, we supply
the training observations
D
train
to (2) to learn a graphical model and compute the average log-likelihood of
this model on the validation set
D
test
5
f
y
ð
i
Þ
g
n
test
i
5
1

R
55
. We sweep over all values of
k
to choose the model
with the best validation performance. Let the selected model (after holdout validation) be specified by the
precision matrix
^
H
. As discussed earlier, the matrix
^
H
specifies the structural properties of the graphical
model of the network. An edge between reservoirs
r
and
r
0
is present in the graph if and only if
^
H
r
;
r
0
6
¼
0,
with larger magnitudes indicating stronger interactions. We denote the strength of an edge as the normal-
ized magnitude of the precision matrix entry, that is,
s
ð
r
;
r
0
Þ
5
j
^
H
r
;
r
0
j
=
ð
^
H
r
;
r
^
H
r
0
;
r
0
Þ
1
=
2
0
:
(3)
The quantity
s
ð
r
;
r
0
Þ
can be viewed as the partial correlation between reservoirs
r
and
r
0
, given all other res-
ervoirs. In particular, a large
s
ð
r
;
r
0
Þ
indicates that reservoirs
r
and
r
0
are highly correlated even after account-
ing for the influence of all the other reservoirs in the network. A small value of
s
ð
r
;
r
0
Þ
indicates that the
reservoirs
r
and
r
0
are weakly correlated conditioned on all the reservoirs. Finally,
s
ð
r
;
r
0
Þ
5
0 indicates that
reservoirs
r
and
r
0
are independent conditioned on all the remaining reservoirs.
3.2. Results: Graphical Model of Reservoir Network
In this section, we explore the properties of a graphical model over the reservoir network. As described in
section 3.1, we learn a graphical model by specifying a regularization parameter
k
and supplying
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9726
observations
D
train
to the convex program (2). We vary
k
from 0 to 1 to identify a collection of graphical
models. For
k
1, the graphical model is completely disconnected and not of interest. For each graphical
model, we measure the training performance as the log-likelihood of training observation
D
train
and the val-
idation performance as the log-likelihood of validation observations
D
test
. Figure S2 in the supporting infor-
mation shows the training and validation performance of graphical modeling across
k
. Recall that
k
5
0
yields the unregularized maximum likelihood (ML) estimate. This model has a training performance of
2
23.91 and validation performance of
2
1140.4. Large
k
(here,
k
5
1) yields an independent reservoir
model, where the graphical structure is disconnected. This model has a training performance of
2
82.23 and
validation performance of
2
101.95. To obtain a graphical model over the reservoir network, we choose
k
5
0
:
23 where the validation performance is maximized (i.e., the choice of
k
using holdout validation).
This model has a training performance of
2
62.38 and validation performance of
2
85
:
43
. Supporting
information Table S1 summarizes the training and validation performances of these three models.
Results of supporting information Table S1 and Figure S2 show that the training performance is a
decreasing function of
k
: smaller values of
k
lead to a closer fit to training observations. However, small
values of
k
yield a high complexity model that fits the idiosyncrasies of the training data and thus suffers
from overfitting. This is evident from the poor validation performance of the unregularized ML estimate
(when
k
5
0). The specified graphical model is the superior model since it has a better validation perfor-
mance than the unregularized ML estimate and an independent reservoir model. Thus, the reservoir
behaviors are not independent, but can be characterized by a moderate number of dependencies. In
the supporting information, we characterize the sensitivity of the graphical model to the choice of the
regularization parameter
k
.
We further explore the properties of the specified graphical model, consisting of 285 edges. Using relation (3),
we compute the strength of the connections in the graphical structure. The upper triangle of Figure 2
shows the dependence relationships between reservoirs in this graphical model. The five strongest edges
in this graphical structure are between reservoirs Relief—Main Strawberry, Cherry—Hetch Hetchy, Invisi-
ble Lake—Lake Berryessa, Almanor—Davis, and Coyote Valley—Warm Spring. We show the geographical
location of these pairs of reservoirs in Figure 3. The presence of these strong edges is sensible: each such
edge is between reservoirs in the same hydrological zone, and four of these five edges are between pairs
of reservoirs fed by the same river. The five most connected reservoirs in order Folsom Lake, Antelope
river, Black Butte River, New Exchequer, and French Meadows, all of which are large reservoirs (volume
10
8
m
3
). We show the five strongest connections to Folsom lake in Figure 3, all of which are either con-
nected or are in close proximity to the Sacramento River. As a point of comparison, the lower triangle of
Figure 2 shows the graphical structure of the unregularized maximum likelihood estimate. This model
yields a fully connected network.
Furthermore, we observe that a majority of interactions in this graphical model are among reservoirs that
have similar drainage area (e.g., land where water falls off into reservoirs) and elevation. Figure 4a shows
a plot of the ratios of drainage areas between pairs of reservoirs connected via an edge and the strength of
the connections. Figure 4b shows a plot of the ratios of altitudes between pairs of connected reservoirs and
the strength of the connections. As a point of comparison, Figures 4c and 4d show similar metrics for the
unregularized ML estimate. Examining Figure 4, we observe that graphical modeling removes (or weakens)
dependencies between reservoirs of vastly different drainage area or elevation. This is expected since reser-
voirs with substantially different drainage area or elevation are less likely to have similar variability.
We observe that a large portion of the strong interactions occur between reservoirs in the same hydrologi-
cal zone, here denoted
h
(
r
). To quantify this observation, we consider
j
5
P
r
;
r
0
and
h
ð
r
Þ
5
h
ð
r
0
Þ
s
ð
r
;
r
0
Þ
P
r
;
r
0
s
ð
r
;
r
0
Þ
;
(4)
the ratio of within-zone edge strength to total edge strength. The model we fit has
j
5
0
:
85, so 85% of the
total edge strength is between reservoirs in the same hydrological zones. In comparison,
j
5
0
:
46 for an
unregularized ML estimate. Nevertheless, we notice some surprising connections between reservoirs that
are geographically far apart. In the next section, we propose a framework to quantify the influence of
external phenomena on the reservoir network. We further explore the effect of these external phenom-
ena to remove the confounding relationships between geographically distant reservoirs.
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9727
4. Global Factors of the Reservoir Network
We identified a graphical model over California reservoirs. Could some of these dependencies specified by
the graphical model be due to external phenomena (e.g., global factors)? In this section, we describe an
approach, known as
latent variable graphical modeling
, that identifies the number and effect of global fac-
tors influencing the reservoir network. Since these global factors are not directly observed (although we
later discuss an approach to link global factors to real-world signals), we also denote them as
latent
variables
.
4.1. Method: Latent Variable Graphical Modeling
As shown by Chandrasekaran et al. (2012), fitting a latent variable graphical model corresponds to repre-
senting the precision matrix of the reservoir volumes
H
as the difference
H
5
S
2
L
, where
S
is sparse and
L
is
a low rank matrix. The matrix
L
accounts for the effect of external phenomena, and its rank is equal to the
number of global factors; these global factors summarize the effect of external phenomena on the reservoir
network. The matrix
S
specifies the residual conditional dependencies among the reservoirs after extracting
the influence of global factors. Moreover, the sparsity pattern of
S
encodes the residual graphical structure
among reservoirs. As an example, consider the toy model shown in Figure 1b. Suppose that the matrix
S
is
Figure 2.
Linkages between reservoir pairs in the graphical model (top triangle) compared with those of the unregular-
ized maximum likelihood estimate (bottom triangle). Connection strength
s
ð
r
;
r
0
Þ
is shown in the image map, with
unlinked reservoir pairs drawn in gray. The four hydrological zones are separated by red lines. Red boxes surround the
five strongest connections in each model.
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9728
indexed according to the ordering {Shasta, Black Butte, Lake Berrysa, Isabella, Pine Flat, Don Pedro, New
Melones, and Buchanan}. Then
S
has the structure:
S
5
???
0000
?
???
00
?
00
000
??
000
000
??
00
?
000
??
000
0
?
000
???
00000
???
?
00
?
0
???
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
;
where
?
denotes a nonzero entry. Fitting a latent variable graphical model to reservoir volumes is to identify
the simplest model, e.g., smallest number of global factors and sparsest residual network, that adequately
explains the data. In other words, we search over the space of precision matrices
H
that can be
Figure 3.
A schematic of California and its river network with some reservoir connections. Green nodes represent the five
pairs of reservoirs with strongest edge strength in the graphical model. The red nodes represent the five strongest edges
to Folsom Lake, which is the most connected reservoir in the network. The acronyms for the reservoirs are: WRS, Wishon;
COY, Coyote Valley; INV, Indian Valley; BER, Lake Berryessa; SHA, Shasta; BUL, Bullards Bar; FOL, Folsom Lake; CMN,
Camanche; DNP, Don Pedro; EXC, New Exchequer; ALM, Almanor Lake; DAV, Lake Davis; SWB, Main Strawberry; RLF, Relief;
CHV, Cherry Valley; HTH, Hetch-Hetchy.
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9729
decomposed as
H
5
S
2
L
to identify a matrix
S
that is sparse, a matrix
L
that has a small rank, and also yields
a small negative log-likelihood
2
ðD
train
;
S
2
L
Þ
. As with the case of graphical modeling, this formulation is a
computationally intractable combinatorial problem. Based on a recent work by Chandrasekaran et al.
(2012), a computationally tractable estimator is given by:
ð
^
S
;
^
L
Þ
5
arg min
S
;
L
2
S
55
2
ð
S
2
L
;
D
train
Þ
1
k
ðjj
S
jj
1
1
c
tr
ð
L
ÞÞ
;
s
:
t
:
S
2
L

0
;
L

0
:
(5)
The constraint

0 imposes positive definiteness on the precision matrix estimate
S
2
L
, so that the joint dis-
tribution of reservoirs is nondegenerate. The constraint

0 imposes positive semidefiniteness on the matrix
L
(see Chandrasekaran et al., 2012, for an explanation of this constraint). Here,
^
L
provides an estimate for
the low-rank component of the precision matrix (corresponding to the effect of latent variables on the res-
ervoir volumes), and
^
S
provides an estimate for the sparse component of the precision matrix (correspond-
ing to the residual dependencies between reservoirs after accounting for the latent variables).
The regularization parameter
c
provides a trade-off between the graphical model component and the latent
component. In particular, for very large values of
c
, the convex program (5) produces the same estimates as
the graphical model estimator (2) (that is,
^
L
5
0 so that no latent variables are used). As
c
decreases, the
number of latent variables increases and correspondingly the number of edges in the residual graphical
structure decreases; this is because latent variables account for a global signal common to all reservoirs.
Figure 4.
(a) Ratios of drainage areas between pairs of reservoirs connected with an edge and their corresponding edge
strengths in a graphical model. (b) Ratios of elevations of pairs of reservoirs connected with an edge and their corre-
sponding edge strengths in a graphical model. (c) Ratios of drainage areas between pairs of reservoirs connected with an
edge and their corresponding edge strengths in an unregularized maximum likelihood (ML) estimate. (d) Ratios of eleva-
tions of pairs of reservoirs connected with an edge and their corresponding edge strengths in an unregularized maximum
likelihood (ML) estimate.
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9730
The regularization parameter
k
provides overall control of the trade-off between the fidelity of the model to
the data and the complexity of the model.
As before, the function
jjjj
1
denotes the
L
1
norm that promotes sparsity in the matrix
S
. The role of the
trace penalty on
L
is to promote low-rank structure (Fazel, 2002). As before, for
k
;
c
0, equation (5) is a
convex program with a unique optimum that can be solved efficiently. Theoretical support for this estima-
tor is presented in Chandrasekaran et al. (2012).
Similar to the graphical model setting, we use the
holdout validation
technique to determine the number of
global latent variables and edges in the graphical structure between reservoirs. Concretely, for a particular
choice of
k
;
c
, we supply
D
train
as input to the program (5) to learn a latent variable graphical model and
compute the average log-likelihood of this model on the validation set
D
test
. We sweep over all possible
choices of
c
;
k
and choose a set of parameters that yield the best validation performance.
Let the selected model (after holdout validation) be specified by the parameters
ð
^
S
;
^
L
Þ
. The matrix
^
L
denotes
the effect of
k
5
rank
ð
^
L
Þ
latent variables on the reservoir network. The matrix
^
S
encodes the residual graphi-
cal structure between reservoirs after incorporating
k
latent variables. We can quantify the strength of the
edges of this graphical structure using the relation (3) with
^
H
replaced with
^
S
. Finally, we quantify the por-
tion of the variability of the network explained by the latent variables as follows: the model estimates the
covariance matrix of reservoirs as
ð
^
S
2
^
L
Þ
2
1
so that
y
Nð
0
;
ð
^
S
2
^
L
Þ
2
1
Þ
. Given that the variance of a reser-
voir
r
is
½ð
^
S
2
^
L
Þ
2
1

r
;
r
, we denote the overall variance of the network as
P
55
r
5
1
½ð
^
S
2
^
L
Þ
2
1

r
;
r
. The variance of res-
ervoir
r
, conditioned on
k
latent variables, is given by
ð
^
S
2
1
Þ
r
. We thus denote the variance of the network
conditioned on
k
latent variables by
P
55
r
5
1
½
^
S
2
1

r
;
r
. Furthermore, we define the ratio
d
ð
k
Þ
5
P
55
r
5
1
½ð
^
S
2
^
L
Þ
2
1
2
^
S
2
1

r
;
r
P
55
r
5
1
½ð
^
S
2
^
L
Þ
2
1

r
;
r
;
(6)
as the portion of the variability of the network explained by
k
latent variables.
4.2. Results: Accounting for Global Factors of the Reservoir Network
We first explore the effect of global factors on the connectivity of the reservoir network. Using observations
D
train
as input to the convex program (5), we vary the regularization parameters
ð
k
;
c
Þ
to learn a collection of
latent variables graphical models. Figure 5 shows the residual conditional graphical structure corresponding
to each model. We observe that an increase in the number of latent variables leads to sparser structures and
stronger inner-zone connections. Indeed, the ratios of inner zone edge strengths to total edge strength are
j
5
0
:
90
;
j
5
0
:
91
;
j
5
0
:
93
;
j
5
0
:
94
;
j
5
0
:
97, and
j
5
0
:
99 for models with 1, 2, 3, 4, 5, and 6 latent variables,
respectively. These results support the idea that latent variables extract global features that are common to all
reservoirs, and incorporating them results in more localized interactions. The residual dependencies that per-
sist (even after including several latent variables) can be attributed to unmodeled local variables.
Further, appealing to relation (6), the portion of the variability of the network explained by 1, 2, 3, 4, 5, and
6 latent variables is given by
d
ð
1
Þ
5
0
:
23
;
d
ð
2
Þ
5
0
:
25
;
d
ð
3
Þ
5
0
:
28
;
d
ð
4
Þ
5
0
:
31
;
d
ð
5
Þ
5
0
:
32
;
d
ð
6
Þ
5
0
:
40,
respectively. Thus, the effect of latent variables on the network increases as we incorporate more of them in
the model. Nonetheless, even six latent variables explain less than 50% of the reservoir variability, with the
other portion attributed to residual conditional dependencies between reservoirs. Furthermore, this experi-
ment suggests that both the influence of global latent variables and residual dependencies among reser-
voirs are important factors of the reservoir network variability.
We now focus on one of these latent variables. In particular, we choose the parameters
ð
c
;
k
Þ
via holdout
validation with the validation set
D
test
to learn a latent variable graphical model consisting of two latent var-
iables together with a residual graphical model (conditioned on the latent variables) having 171 edges. This
is the model corresponding to Figure 5b. Thus, the reservoir network consists of two global factors, and
some residual dependencies persist after accounting for their influence. The training and validation perfor-
mance of this model (in terms of log-likelihood) are given by
2
62.11 and
2
85.87, respectively.
The conditional dependency relationships between reservoir pairs in this residual graphical structure are shown
in the upper triangle of Figure 5b. Comparing this graphical structure with the graphical structure without any
latent variables (lower triangle of Figure 5b), accounting for the global factors weakens or removes many
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9731
Figure 5.
Linkages between reservoir pairs in the latent-variable sparse graphical model (top triangle) with varying num-
ber of latent variables compared with those of the ordinary sparse graphical model (bottom triangle). Connection
strength
s
ð
r
;
r
0
Þ
is shown in the image map, with unlinked reservoir pairs drawn in gray. The four hydrological zones are
separated by red lines. Red boxes surround the five strongest connections in each model.
Water Resources Research
10.1002/2017WR020412
TAEB ET AL.
GRAPHICAL MODELING OF RESERVOIRS
9732