A Statistical Graphical Model of the California Reservoir System - Taeb_et_al-2017-Water_Resources

RESEARCH ARTICLE

10.1002/2017WR020412

A Statistical Graphical Model of the California Reservoir System

A. Taeb

, J. T. Reager

, M. Turmon

, and V. Chandrasekaran

Department of Electrical Engineering, California Institute of Technology, Pasadena, CA, USA,

Jet Propulsion Laboratory,

California Institute of Technology, Pasadena, CA, USA,

Department of Computing and Mathematical Sciences and

Department of Electrical Engineering, California Institute of Technology, Pasadena, CA, USA

Abstract

The recent California drought has highlighted the potential vulnerability of the state’s water

management infrastructure to multiyear dry intervals. Due to the high complexity of the network, dynamic

storage changes in California reservoirs on a state-wide scale have previously been difficult to model using

either traditional statistical or physical approaches. Indeed, although there is a significant line of research

on exploring models for single (or a small number of) reservoirs, these approaches are not amenable to a

system-wide modeling of the California reservoir network due to the spatial and hydrological

heterogeneities of the system. In this work, we develop a state-wide statistical graphical model to

characterize the dependencies among a collection of 55 major California reservoirs across the state; this

model is defined with respect to a graph in which the nodes index reservoirs and the edges specify the

relationships or dependencies between reservoirs. We obtain and validate this model in a data-driven

manner based on reservoir volumes over the period 2003–2016. A key feature of our framework is a

quantification of the effects of external phenomena that influence the entire reservoir network. We further

characterize the degree to which physical factors (e.g., state-wide Palmer Drought Severity Index (PDSI),

average temperature, snow pack) and economic factors (e.g., consumer price index, number of agricultural

workers) explain these external influences. As a consequence of this analysis, we obtain a system-wide

health diagnosis of the reservoir network as a function of PDSI.

1. Introduction

1.1. Motivation

The state of California depends on a complex water management system to meet wide-ranging water

demands across a large, hydrologically diverse domain. As part of this infrastructure, California has con-

structed 1,530 reservoirs having a collective storage capacity equivalent to a year of mean run off from Cali-

fornia rivers (Graf, 1999). The purpose of this system is to create water storage capacity and extend seasonal

water availability to meet agricultural, residential, industrial, power generation, and recreational needs.

Major state-wide California precipitation deficits during the years 2012–2015 rivaled the most intense 4 year

droughts in the past 1,200 years (Griffin & Anchukaitis, 2014). The drought was punctuated by low snow

pack in the Sierra Nevada, declining groundwater storage, and fallowed agricultural lands, in addition to sig-

nificantly diminished reservoir levels (AghaKouchak et al., 2014; Famiglietti, 2014; Howitt et al., 2014). This

sensitivity of the California reservoir network to external conditions (e.g., temperature, precipitation) has

implications for state-wide water and agricultural security. In this paper, we seek a characterization of the

relationships among the major California reservoirs and their sensitivity to state-wide physical and eco-

nomic factors, with a view to investigating and quantifying the likelihood of systemic catastrophes such as

the simultaneous exhaustion of multiple large reservoirs.

Such an analysis has been difficult to carry out on a system-wide scale due to the size and complexity of the

reservoir network. In one direction, a body of work has focused on characterizing the behavior of a small

collection of reservoirs using

physical laws

(e.g., Christensen & Lettenmaier, 2004); Christensen et al., 2006;

Nazemi & Wheater, 2015). Such approaches quickly become intractable in settings with large numbers of

reservoirs whose complex management is based on multiple economic and sectoral objectives (Howitt

et al., 2014). The hard-to-quantify influence of human operators and the lack of system closure have made

the modeling and prediction of reservoir network behavior using physical equations challenging in

Key Points:

We develop a statistical graphical

model to characterize the statewide

California reservoir system

We quantify the influence of external

physical and economic factors (e.g.,

statewide PDSI and consumer price

index) on the reservoir network

Further analysis gives a system-wide

health diagnosis as a function of

PDSI, indicating when heavy

management practices may be

needed

Supporting Information:

Supporting Information S1

Supporting Information S2

Correspondence to:

A. Taeb,

ataeb@caltech.edu

Citation:

Taeb, A., Reager, J. T., Turmon, M., &

Chandrasekaran, V. (2017). A statistical

graphical model of the California

reservoir system.

Water Resources

Research

, 9721–9739. https://doi.

org/10.1002/2017WR020412

Received 13 JAN 2017

Accepted 14 OCT 2017

Accepted article online 20 OCT 2017

Published online 27 NOV 2017

V

2017. American Geophysical Union.

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9721

Water Resources Research

PUBLICATIONS

hydrology and climate models (Solander et al., 2016). In a different direction, numerous works have devel-

oped

empirical techniques

for modeling the behavior of a small number of reservoirs (e.g., Ashaary et al.,

2015; Barnett & Pierce, 2008; Bazartseren et al., 2003; Chen & Liu, 2015; Cheng et al., 2015; Hoerling & Eisc-

heid, 2007; Kuria & Vogel, 2015; Linares-Rodriguez, et al., 2015; Liu & Chung, 2014; Marton, et al., 2015; Nash

& Gleick, 1991, 1993; Phatafod, 1989; Revelle & Waggoner, 1983; Wisser et al., 2010; Yang et al., 2016, 2017;

Zhang et al., 2017). However, these methods are not directly applicable to modeling a large reservoir net-

work, as the water levels of major reservoirs in California exhibit complex interactions and are statistically

correlated with one another (as is demonstrated by our analysis). This necessitates a proper quantification

of the complex dependencies among reservoirs in determining the systemic characteristics of the reservoir

network.

The focus of this work is to develop a state-wide model over the California reservoir network that addresses

the following scientific questions:

1. What are the interactions or dependencies among reservoir volumes? In particular, how correlated are

major reservoirs in the system?

2. Are there common external factors influencing the network globally? Could these external factors cause

a system-wide catastrophe?

To the best of our knowledge, this work is the first that attempts such a state-wide characterization of the

California reservoir network. The state-wide external factors that we consider in our analysis include physical

factors such as state-wide PDSI and average temperature, and economic factors such as the consumer price

index and the number of agricultural workers. The focus on these state-wide external influences is driven

by the global nature of our analysis; indeed, an exciting direction for further research is to complement our

global model with local reservoir-specific factors to obtain an integrated picture of both systemic as well as

local risks to the reservoir network.

Answering these questions for the California reservoir system raises a number of challenges, and it is impor-

tant that any modeling framework that we consider addresses these challenges. First, reservoirs with similar

hydrological attributes (e.g., altitude, drainage area, spatial location) tend to behave similarly. As an exam-

ple, a pair of reservoirs that is approximately at the same altitude or in the same hydrological zone are more

likely to have a stronger correlation than those in different altitudes/zones. Therefore, we seek a framework

that ably models the complex heterogeneities in the reservoir system. A second challenge, which is in some

sense in competition with the first one, is that compactly specified models are much more preferable to

less succinct models, as concisely described models are often more interpretable and avoid problems asso-

ciated with

over-fitting

. Finally, it is crucial that models with both of the preceding attributes have the addi-

tional feature that they can be identified in a computationally efficient manner.

1.2. Approach and Results

Gaussian

graphical models

offer an appealing and conceptually powerful framework with all the attributes

just described. Graphical modeling is a prominent multivariate analysis technique that has been successfully

employed in domains as varied as gene regulatory network analysis, social networks, speech recognition,

and computer vision (see Jordan, 2004, for a survey on graphical modeling). These models are defined with

respect to graphs, with nodes of a graph indexing variables and the edges specifying statistical dependen-

cies among these variables. In a reservoir modeling context, the nodes of the graph correspond to reser-

voirs and an edge between two reservoirs would describe the strength of the interaction between the

levels of those reservoirs. Formally, the strength of an edge specifies the degree of conditional dependence

between the corresponding reservoirs; in other words, this is the dependence between two reservoirs con-

ditioned on all the other reservoirs in the network. Informally, an edge in a graphical model denotes the

extent to which two reservoirs remain correlated even after accounting for the influence of all the other res-

ervoirs in the network. We illustrate these points using a toy example of a graphical model over a collection

of eight reservoirs, shown in Figure 1a. (This figure is purely for explanatory purposes rather than a factual

representation of the complex dependencies among reservoirs, which we obtain in section 3). One can

imagine that the reservoir volume of Shasta (which is at a high elevation in northern California in the Sacra-

mento hydrological zone) is independent of the reservoir Pine Flat and the reservoir Isabella (which are in

southern California in the Tulare hydrological zone) after conditioning on volumes of reservoirs in the cen-

tral portion of the state (e.g., Black Butte, Lake Berrysa, New Melones, Buchanan, and Don Pedro). These

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9722

relationships are encoded in a graphical model of Figure 1a. In particular, note that Shasta has an edge link-

ing it to each of the reservoirs {Black Butte, Lake Berrysa, Don Pedro, New Melones, Buchanan}, but does

not have an edge connecting it to the reservoirs {Pine Flat, Isabella}. Figure 1a is, of course, a cartoon dem-

onstration of a graphical modeling framework. In practice, identifying conditional dependencies between

pairs of reservoirs in large networks such as the one considered in our work is a challenging problem, and

we describe tractable approaches to learning such a graphical structure underlying the complex California

reservoir system in a completely data-driven manner in section 3. To the best of our knowledge, this is the

first work that applies graphical modeling techniques to model reservoirs or other water resources.

The graphical modeling framework provides a common lens for viewing two frequently employed statistical

techniques. On the one hand, a classical approach for obtaining a multivariate Gaussian distribution over

reservoir volumes is via a maximum likelihood estimator. This estimator has been widely used in various

domains in the geophysical sciences for multivariate analysis of a collection of random variables (Wackerna-

gel, 2003). The model obtained by this maximum likelihood estimator is specified by a completely con-

nected graphical structure, where all reservoirs are conditionally correlated given all other reservoirs. On the

other hand, an independent reservoir model analyses the behavior of an individual reservoir independently

of the other reservoirs in the network. This model results in a fully disconnected graphical model. In this

paper, we learn a statistical graphical model over the reservoir network in a data-driven manner based on

historical reservoir data. This model yields a sparse (yet connected) graphical structure describing the net-

work interactions. We demonstrate that this model outperforms the model obtained via unregularized max-

imum likelihood estimator and an independent reservoir model. Thus, the reservoir behaviors are not

independent of one another but can be specified with a moderate number of interactions. We demonstrate

that a majority of these interactions are between reservoirs that are in the same basin or hydrological zone,

and among reservoirs that have similar altitude and drainage area.

A natural question is whether some dependencies specified by the graphical model are due to a small num-

ber of external phenomena (drought, agricultural usage, Colorado river discharge, precipitation, etc.). For

example, water held by a collection of nearby reservoirs might be influenced by a common snow pack vari-

able. Without observing this common variable, all reservoirs in this set would appear to have mutual links,

whereas if snow pack is included in the analysis, the common behavior is explained by a link to the snow

pack variable. Accounting for latent structure removes these

confounding

dependencies and leads to

sparser

and more localized

interactions between reservoirs. Figure 1b illustrates this point. Latent variable graphical

modeling offers a principled approach to quantify the effects of

external phenomena

that influence the

entire reservoir network. In particular, this modeling framework uses observational data to (1) identify the

number of global factors (e.g., latent variables) that summarize the effect of external phenomena on the res-

ervoir network, and (2) identify the residual reservoir dependencies after accounting for these global factors.

Our experimental results demonstrate that the reservoir network at a monthly resolution has two distinct

global factors, and residual dependencies persist after accounting for these global factors.

Figure 1.

Graphical structure between a collection of eight reservoirs (a) without latent variables and (b) with latent varia-

bles. Green nodes represent reservoirs (variables) and the clouded green node represents latent variables. Solid blue lines

represent edges between reservoirs and dotted edges between reservoirs and latent variables. The reservoirs have been

grouped according to hydrological zones.

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9723

Latent variable graphical modeling obtains a mathematical representation of the external phenomena

influencing the reservoir network. One is naturally interested in linking these mathematical objects to real-

world signals (e.g., state-wide Palmer Drought Severity Index, snow pack, consumer price index). We present

an approach for associating semantics to these latent variables. We find that the state-wide Palmer Drought

Severity Index (PDSI) is highly correlated (

88) with one of the latent variables. PDSI is then included as

a covariate in the

iteration of the graphical modeling procedure to learn a joint model over reservoirs

and PDSI. Using this model, we characterize the system-wide behavior of the network to hypothetical

drought conditions. In particular, we find that as PDSI approaches

2

5, there is a probability greater than

50% of simultaneous exhaustion of multiple large reservoirs. We further present an approach for identifying

specific reservoirs in the network that are at high risk of exhaustion during extreme drought conditions. We

find that the Buchanan and Hidden Dam reservoirs are at high risk and describe the stringent water man-

agement policies that were enforced to prevent exhaustion.

2. Data Set and Model Validation

Our primary data set consists of monthly averages of reservoir volumes, derived from daily time series of

volumes downloaded from the California Data Exchange Center (CDEC). We also used secondary data for

some covariates.

2.1. Reservoir Time Series

As described in section 1, there are 1,530 reservoirs in California. In this work, we perform statistical analysis

on the largest 60 reservoirs in California. We apply our analysis on a subset of the reservoirs as they have a

large amount of historical data available. Our technique can be extended to a larger collection of reservoirs

given sufficient data. For these 60 reservoirs, daily volume data are available during the period of study

(January 2003–November 2016). We excluded five reservoirs with more than half of their values undefined

or zero, leaving 55 reservoirs. This list of daily values was inspected using a simple continuity criterion and

approximately 50 specific values were removed or corrected. Corrections were possible in six cases because

values had misplaced decimal points, but all other detected errors were set to missing values. The most

common error modes were missing values that were recorded as zero volume, and a burst of errors in the

Lyons reservoir during late October 2014 that seems due to a change in recording method at that time.

The final set of 55 reservoir volume time series spans 5,083 days over the 167 months in the study period. It

contains two full cycles of California drought (roughly, 2007–2008 and 2012–2015) and three cycles of wet

period (2004–2006, 2009–2011, 2016). Four California hydrological zones are represented, with 25, 20, 6,

and 4 reservoirs in the Sacramento, San Joaquin, Tulare, and North Coast zones, respectively.

We are interested in long-term reservoir behavior and thus model reservoir volumes at a monthly time

scale. In particular, we average the data from daily down to 167 monthly observations. The reservoir data

exhibit strong seasonal components. Hence, a seasonal adjustment step is performed to remove these pre-

dictable patterns, so that we can model deviations from the underlying trend in the reservoir behavior.

With the exception of the Farmington reservoir (which has volume less than 10

), the joint volume

anomalies of the remaining 54 reservoirs are well-approximated by a multivariate Gaussian distribution.

This is demonstrated by a Q-Q plot in Figure S1 of the supporting information. Since a large amount of his-

torical data is available for the Farmington reservoir, we have included it in our analysis. These observed

properties suggest that the reservoir data is amenable to the multivariate Gaussian models we employ in

this paper. Before being used in the fitting algorithms, each time series is also rescaled by its standard devi-

ation so that each series has unit variance. We note that our statistical approach identifies correlations

between reservoir volumes. Since the correlation between two random variables is normalized by their

respective variances, this transformation is appropriate.

2.2. Covariate Time Series

Latent variable graphical modeling identifies a mathematical representation of the global factors influencing

the reservoir network. We link these global factors to real-world signals using ancillary data, i.e.,

covariates

which are observable variables, exogenous to the model, that may affect a large fraction of reservoirs. The par-

ticular covariates that we use are temperature (averaged values over California downloaded from NOAA),

Palmer Drought Severity Index (averaged values over California downloaded from NOAA), hydroelectric power

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9724

generation of California (downloaded from U.S. Energy Information and Administration), Colorado river dis-

charge (averaged values downloaded from United States Geological Survey), and Sierra Nevada snow pack

covariate (manually averaged in the Sierra Nevada region where the elevation is over 100 m, gridded observa-

tions downloaded from NOAA). Note that since we are interested in state-wide covariates that exert influence

over the entire network, these hydrological indicators were averaged over the state of California (or in the

case of snow pack and Colorado river discharge, averaged over a large region in the Sierra Nevada and Colo-

rado river, respectively). In addition to these hydrological indicators, we use the following economic factors:

state-wide number of agricultural workers (downloaded from State of California Employment Development

Department) and state-wide consumer price index (downloaded from Department of Industrial Relations).

For each of the seven covariates, we obtain averaged monthly observations from 2003 to 2016. We apply a

time lag of 2 months to the covariates temperature, snow pack, Colorado river discharge, and Palmer

Drought Severity Index (the reason for a 2 months lag is explained in section 4.4). As with the reservoir time

series, we remove seasonal patterns with a per-month average.

2.3. Model Validation

To ensure that the model of the reservoirs is representative of reservoir behavior, we perform model validation

using a technique known as

holdout validation

(Hastie et al., 2009). The objective of this technique is to produce

models that are not overly tuned to the idiosyncrasies of observational reservoir data, so that these models are

representative of future reservoir behavior. In a holdout validation framework, the available data are partitioned

into a training set, and a disjoint validation set. The training set is used as input to a fitting algorithm to identify

a model. The accuracy of this model is then validated by computing the average log-likelihood of the validation

set with respect to the distribution specified by the model. Here, larger values of log-likelihood are indicative of

better fit to data. For our experiments, we set aside monthly observations of reservoir volumes and covariates

from January 2004–December 2013 as a training set (

train

5

120) and monthly observations from January

2003–December 2003 and January 2014–November 2016 as a (disjoint) validation set (

test

5

47). Both the train-

ing and validation observations contain a significant amount of annual and interannual variability.

3. Dependencies Underlying the Reservoir Network

3.1. Method: Graphical Modeling

A common approach for fitting a graphical model to reservoirs is to choose the simplest model, that is, the

sparsest network that adequately explains the observational data. Easing this task, for Gaussian graphical mod-

els, the graphical structure is encoded in the sparsity pattern of the precision matrix (inverse covariance matrix)

over the variables. Specifically, zeros in the precision matrix of a multivariate Gaussian distribution indicate

absent edges in the corresponding graphical model. Thus, the number of edges in the graphical model equals

the number of nonzeros of the precision matrix

. As an example, consider the toy graphical model in Figure

1a. Suppose that the precision matrix

of size 8

3

8 is indexed according to the ordering {Shasta, Black Butte,

Lake Beryssa, Isabella, Pine Flat, Don Pedro, New Melones, and Buchanan}. Then,

has the following structure:

5

???

????

???

??????

000

???

????????

????

???

????????

;

where

denotes a nonzero value. The intimate connection between a graphical structure and the precision

matrix implies that fitting a sparse Gaussian graphical model to reservoir observational data is equivalent to

estimating a sparse precision matrix

. Hence, the reservoirs are modeled according to the distribution

Nð

;

2

, where

is sparse. Note that the preprocessing to remove climatology causes the mean to

be zero. A natural technique to fit such a model to observational data is to minimize the negative log-

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9725

likelihood (e.g., maximum likelihood estimation) of data while controlling the sparsity level of

. The log-

likelihood function of the training observations

train

5

train

5

(after removing some additive con-

stants and scaling) is given by the concave function

‘

;

train

5

log det

2

;

(1)

where

5

train

5

Þ0

is the sample covariance matrix. Thus, fitting a graphical model to

train

trans-

lates to searching over the space of precision matrices to identify a matrix

that is sparse and also yields a

small value of

2

‘

;

train

. This formulation, however, is a computationally intractable combinatorial prob-

lem. Recent work (Friedman et al., 2008; Yuan & Lin, 2007) has identified a way around this road block by

using a convex relaxation:

5

arg min

2

‘

;

train

1

;

(2)

The notation

denotes the set of symmetric 55

3

55 matrices. The constraint

0 imposes positive defi-

niteness so that the joint distribution of reservoirs is nondegenerate. The regularization term

jjjj

denotes

the

norm (element-wise sum of absolute values) that promotes sparsity in the matrix

. The

penalty,

and more broadly, regularization techniques, are widely employed in inverse problems in data analysis to

overcome ill-posedness and avoid problems such as

over-fitting

to moderate sample size (see the text-

books/monographs B

€

uhlmann & van de Geer, 2011; Wainwright, 2014; and the references therein). These

regularization approaches have proved to be valuable in many applications, including cameras (Duarte

et al., 2008), magnetic resonance imaging (Lustig et al., 2008), gene regularity networks (Zhang & Kim,

2014), and radar (Herman & Strohmer, 2009).

The regularization parameter

in (2) provides overall control of the trade-off between the fidelity of the

model to the data and the complexity of the model. In particular, the program (2) with

5

0 yields the

familiar maximum likelihood covariance estimator. This estimator has a well-known closed form solution

5

2

. Generally,

2

will not contain any zeros. This implies that the estimated graphical structure is fully

connected with close fit to the training data

train

. However, as explored in section 3.2, this model may be

overtuned to the idiosyncrasies of the training observations

train

and will not generalize to future behavior

of reservoirs (a phenomenon known as

over-fitting

). Larger values of

yield a sparser graphical model with

very large

resulting in a completely disconnected graphical model where the reservoirs are independent

of one another. Importantly, for any choice of

0, equation (2) is a convex program with a unique opti-

mum, and can be solved efficiently using general purpose off-the-shelf solvers (Toh et al., 2006). Further

theoretical support of this estimator is presented in (Ravikumar et al., 2011).

We select the regularization parameter

holdout validation

. In particular, for any choice of

, we supply

the training observations

train

to (2) to learn a graphical model and compute the average log-likelihood of

this model on the validation set

test

5

test

5

. We sweep over all values of

to choose the model

with the best validation performance. Let the selected model (after holdout validation) be specified by the

precision matrix

. As discussed earlier, the matrix

specifies the structural properties of the graphical

model of the network. An edge between reservoirs

and

is present in the graph if and only if

;

with larger magnitudes indicating stronger interactions. We denote the strength of an edge as the normal-

ized magnitude of the precision matrix entry, that is,

;

5

;

(3)

The quantity

;

can be viewed as the partial correlation between reservoirs

and

, given all other res-

ervoirs. In particular, a large

;

indicates that reservoirs

and

are highly correlated even after account-

ing for the influence of all the other reservoirs in the network. A small value of

;

indicates that the

reservoirs

and

are weakly correlated conditioned on all the reservoirs. Finally,

;

5

0 indicates that

reservoirs

and

are independent conditioned on all the remaining reservoirs.

3.2. Results: Graphical Model of Reservoir Network

In this section, we explore the properties of a graphical model over the reservoir network. As described in

section 3.1, we learn a graphical model by specifying a regularization parameter

and supplying

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9726

observations

train

to the convex program (2). We vary

from 0 to 1 to identify a collection of graphical

models. For

1, the graphical model is completely disconnected and not of interest. For each graphical

model, we measure the training performance as the log-likelihood of training observation

train

and the val-

idation performance as the log-likelihood of validation observations

test

. Figure S2 in the supporting infor-

mation shows the training and validation performance of graphical modeling across

. Recall that

5

yields the unregularized maximum likelihood (ML) estimate. This model has a training performance of

2

23.91 and validation performance of

2

1140.4. Large

(here,

5

1) yields an independent reservoir

model, where the graphical structure is disconnected. This model has a training performance of

2

82.23 and

validation performance of

2

101.95. To obtain a graphical model over the reservoir network, we choose

5

23 where the validation performance is maximized (i.e., the choice of

using holdout validation).

This model has a training performance of

2

62.38 and validation performance of

2

. Supporting

information Table S1 summarizes the training and validation performances of these three models.

Results of supporting information Table S1 and Figure S2 show that the training performance is a

decreasing function of

: smaller values of

lead to a closer fit to training observations. However, small

values of

yield a high complexity model that fits the idiosyncrasies of the training data and thus suffers

from overfitting. This is evident from the poor validation performance of the unregularized ML estimate

(when

5

0). The specified graphical model is the superior model since it has a better validation perfor-

mance than the unregularized ML estimate and an independent reservoir model. Thus, the reservoir

behaviors are not independent, but can be characterized by a moderate number of dependencies. In

the supporting information, we characterize the sensitivity of the graphical model to the choice of the

regularization parameter

We further explore the properties of the specified graphical model, consisting of 285 edges. Using relation (3),

we compute the strength of the connections in the graphical structure. The upper triangle of Figure 2

shows the dependence relationships between reservoirs in this graphical model. The five strongest edges

in this graphical structure are between reservoirs Relief—Main Strawberry, Cherry—Hetch Hetchy, Invisi-

ble Lake—Lake Berryessa, Almanor—Davis, and Coyote Valley—Warm Spring. We show the geographical

location of these pairs of reservoirs in Figure 3. The presence of these strong edges is sensible: each such

edge is between reservoirs in the same hydrological zone, and four of these five edges are between pairs

of reservoirs fed by the same river. The five most connected reservoirs in order Folsom Lake, Antelope

river, Black Butte River, New Exchequer, and French Meadows, all of which are large reservoirs (volume

). We show the five strongest connections to Folsom lake in Figure 3, all of which are either con-

nected or are in close proximity to the Sacramento River. As a point of comparison, the lower triangle of

Figure 2 shows the graphical structure of the unregularized maximum likelihood estimate. This model

yields a fully connected network.

Furthermore, we observe that a majority of interactions in this graphical model are among reservoirs that

have similar drainage area (e.g., land where water falls off into reservoirs) and elevation. Figure 4a shows

a plot of the ratios of drainage areas between pairs of reservoirs connected via an edge and the strength of

the connections. Figure 4b shows a plot of the ratios of altitudes between pairs of connected reservoirs and

the strength of the connections. As a point of comparison, Figures 4c and 4d show similar metrics for the

unregularized ML estimate. Examining Figure 4, we observe that graphical modeling removes (or weakens)

dependencies between reservoirs of vastly different drainage area or elevation. This is expected since reser-

voirs with substantially different drainage area or elevation are less likely to have similar variability.

We observe that a large portion of the strong interactions occur between reservoirs in the same hydrologi-

cal zone, here denoted

(

). To quantify this observation, we consider

5

;

and

5

;

(4)

the ratio of within-zone edge strength to total edge strength. The model we fit has

5

85, so 85% of the

total edge strength is between reservoirs in the same hydrological zones. In comparison,

5

46 for an

unregularized ML estimate. Nevertheless, we notice some surprising connections between reservoirs that

are geographically far apart. In the next section, we propose a framework to quantify the influence of

external phenomena on the reservoir network. We further explore the effect of these external phenom-

ena to remove the confounding relationships between geographically distant reservoirs.

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9727

4. Global Factors of the Reservoir Network

We identified a graphical model over California reservoirs. Could some of these dependencies specified by

the graphical model be due to external phenomena (e.g., global factors)? In this section, we describe an

approach, known as

latent variable graphical modeling

, that identifies the number and effect of global fac-

tors influencing the reservoir network. Since these global factors are not directly observed (although we

later discuss an approach to link global factors to real-world signals), we also denote them as

latent

variables

4.1. Method: Latent Variable Graphical Modeling

As shown by Chandrasekaran et al. (2012), fitting a latent variable graphical model corresponds to repre-

senting the precision matrix of the reservoir volumes

as the difference

5

2

, where

is sparse and

a low rank matrix. The matrix

accounts for the effect of external phenomena, and its rank is equal to the

number of global factors; these global factors summarize the effect of external phenomena on the reservoir

network. The matrix

specifies the residual conditional dependencies among the reservoirs after extracting

the influence of global factors. Moreover, the sparsity pattern of

encodes the residual graphical structure

among reservoirs. As an example, consider the toy model shown in Figure 1b. Suppose that the matrix

Figure 2.

Linkages between reservoir pairs in the graphical model (top triangle) compared with those of the unregular-

ized maximum likelihood estimate (bottom triangle). Connection strength

;

is shown in the image map, with

unlinked reservoir pairs drawn in gray. The four hydrological zones are separated by red lines. Red boxes surround the

five strongest connections in each model.

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9728

indexed according to the ordering {Shasta, Black Butte, Lake Berrysa, Isabella, Pine Flat, Don Pedro, New

Melones, and Buchanan}. Then

has the structure:

5

???

0000

???

000

???

00000

???

;

where

denotes a nonzero entry. Fitting a latent variable graphical model to reservoir volumes is to identify

the simplest model, e.g., smallest number of global factors and sparsest residual network, that adequately

explains the data. In other words, we search over the space of precision matrices

that can be

Figure 3.

A schematic of California and its river network with some reservoir connections. Green nodes represent the five

pairs of reservoirs with strongest edge strength in the graphical model. The red nodes represent the five strongest edges

to Folsom Lake, which is the most connected reservoir in the network. The acronyms for the reservoirs are: WRS, Wishon;

COY, Coyote Valley; INV, Indian Valley; BER, Lake Berryessa; SHA, Shasta; BUL, Bullards Bar; FOL, Folsom Lake; CMN,

Camanche; DNP, Don Pedro; EXC, New Exchequer; ALM, Almanor Lake; DAV, Lake Davis; SWB, Main Strawberry; RLF, Relief;

CHV, Cherry Valley; HTH, Hetch-Hetchy.

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9729

decomposed as

5

2

to identify a matrix

that is sparse, a matrix

that has a small rank, and also yields

a small negative log-likelihood

2

‘

ðD

train

;

2

. As with the case of graphical modeling, this formulation is a

computationally intractable combinatorial problem. Based on a recent work by Chandrasekaran et al.

(2012), a computationally tractable estimator is given by:

;

5

arg min

;

2

‘

2

;

train

1

ðjj

1

ÞÞ

;

2

;

(5)

The constraint

0 imposes positive definiteness on the precision matrix estimate

2

, so that the joint dis-

tribution of reservoirs is nondegenerate. The constraint

0 imposes positive semidefiniteness on the matrix

(see Chandrasekaran et al., 2012, for an explanation of this constraint). Here,

provides an estimate for

the low-rank component of the precision matrix (corresponding to the effect of latent variables on the res-

ervoir volumes), and

provides an estimate for the sparse component of the precision matrix (correspond-

ing to the residual dependencies between reservoirs after accounting for the latent variables).

The regularization parameter

provides a trade-off between the graphical model component and the latent

component. In particular, for very large values of

, the convex program (5) produces the same estimates as

the graphical model estimator (2) (that is,

5

0 so that no latent variables are used). As

decreases, the

number of latent variables increases and correspondingly the number of edges in the residual graphical

structure decreases; this is because latent variables account for a global signal common to all reservoirs.

Figure 4.

(a) Ratios of drainage areas between pairs of reservoirs connected with an edge and their corresponding edge

strengths in a graphical model. (b) Ratios of elevations of pairs of reservoirs connected with an edge and their corre-

sponding edge strengths in a graphical model. (c) Ratios of drainage areas between pairs of reservoirs connected with an

edge and their corresponding edge strengths in an unregularized maximum likelihood (ML) estimate. (d) Ratios of eleva-

tions of pairs of reservoirs connected with an edge and their corresponding edge strengths in an unregularized maximum

likelihood (ML) estimate.

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9730

The regularization parameter

provides overall control of the trade-off between the fidelity of the model to

the data and the complexity of the model.

As before, the function

jjjj

denotes the

norm that promotes sparsity in the matrix

. The role of the

trace penalty on

is to promote low-rank structure (Fazel, 2002). As before, for

;

0, equation (5) is a

convex program with a unique optimum that can be solved efficiently. Theoretical support for this estima-

tor is presented in Chandrasekaran et al. (2012).

Similar to the graphical model setting, we use the

holdout validation

technique to determine the number of

global latent variables and edges in the graphical structure between reservoirs. Concretely, for a particular

choice of

;

, we supply

train

as input to the program (5) to learn a latent variable graphical model and

compute the average log-likelihood of this model on the validation set

test

. We sweep over all possible

choices of

;

and choose a set of parameters that yield the best validation performance.

Let the selected model (after holdout validation) be specified by the parameters

;

. The matrix

denotes

the effect of

5

rank

latent variables on the reservoir network. The matrix

encodes the residual graphi-

cal structure between reservoirs after incorporating

latent variables. We can quantify the strength of the

edges of this graphical structure using the relation (3) with

replaced with

. Finally, we quantify the por-

tion of the variability of the network explained by the latent variables as follows: the model estimates the

covariance matrix of reservoirs as

2

2

so that

Nð

;

2

2

. Given that the variance of a reser-

voir

½ð

2

2

;

, we denote the overall variance of the network as

5

½ð

2

2

;

. The variance of res-

ervoir

, conditioned on

latent variables, is given by

2

. We thus denote the variance of the network

conditioned on

latent variables by

5

2

;

. Furthermore, we define the ratio

5

5

½ð

2

2

2

2

;

5

½ð

2

2

;

(6)

as the portion of the variability of the network explained by

latent variables.

4.2. Results: Accounting for Global Factors of the Reservoir Network

We first explore the effect of global factors on the connectivity of the reservoir network. Using observations

train

as input to the convex program (5), we vary the regularization parameters

;

to learn a collection of

latent variables graphical models. Figure 5 shows the residual conditional graphical structure corresponding

to each model. We observe that an increase in the number of latent variables leads to sparser structures and

stronger inner-zone connections. Indeed, the ratios of inner zone edge strengths to total edge strength are

5

;

5

;

5

;

5

;

5

97, and

5

99 for models with 1, 2, 3, 4, 5, and 6 latent variables,

respectively. These results support the idea that latent variables extract global features that are common to all

reservoirs, and incorporating them results in more localized interactions. The residual dependencies that per-

sist (even after including several latent variables) can be attributed to unmodeled local variables.

Further, appealing to relation (6), the portion of the variability of the network explained by 1, 2, 3, 4, 5, and

6 latent variables is given by

5

;

5

;

5

;

5

;

5

;

5

40,

respectively. Thus, the effect of latent variables on the network increases as we incorporate more of them in

the model. Nonetheless, even six latent variables explain less than 50% of the reservoir variability, with the

other portion attributed to residual conditional dependencies between reservoirs. Furthermore, this experi-

ment suggests that both the influence of global latent variables and residual dependencies among reser-

voirs are important factors of the reservoir network variability.

We now focus on one of these latent variables. In particular, we choose the parameters

;

via holdout

validation with the validation set

test

to learn a latent variable graphical model consisting of two latent var-

iables together with a residual graphical model (conditioned on the latent variables) having 171 edges. This

is the model corresponding to Figure 5b. Thus, the reservoir network consists of two global factors, and

some residual dependencies persist after accounting for their influence. The training and validation perfor-

mance of this model (in terms of log-likelihood) are given by

2

62.11 and

2

85.87, respectively.

The conditional dependency relationships between reservoir pairs in this residual graphical structure are shown

in the upper triangle of Figure 5b. Comparing this graphical structure with the graphical structure without any

latent variables (lower triangle of Figure 5b), accounting for the global factors weakens or removes many

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9731

Figure 5.

Linkages between reservoir pairs in the latent-variable sparse graphical model (top triangle) with varying num-

ber of latent variables compared with those of the ordinary sparse graphical model (bottom triangle). Connection

strength

;

is shown in the image map, with unlinked reservoir pairs drawn in gray. The four hydrological zones are

separated by red lines. Red boxes surround the five strongest connections in each model.

Water Resources Research

10.1002/2017WR020412

TAEB ET AL.

GRAPHICAL MODELING OF RESERVOIRS

9732