of 21
Automated construction of cognitive maps
with visual predictive coding
James A. Gornet
1,2
jgornet@caltech.edu
Matt Thomson
1,2
mthomson@caltech.edu
1
California Institute of Technology, Division of Biology and Biological Engineering, Pasadena, CA, USA
2
California Institute of Technology, Computation and Neural Systems, Pasadena, CA, USA
Humans construct internal cognitive maps of their environment directly from sensory
inputs without access to a system of explicit coordinates or distance measurements. While
machine learning algorithms like SLAM utilize specialized inference procedures to identify
visual features and construct spatial maps from visual and odometry data, the general
nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy
that can generalize to auditory, tactile, and linguistic inputs. Here, we demonstrate that
predictive coding provides a natural and versatile neural network algorithm for constructing
spatial maps using sensory data. We introduce a framework in which an agent navigates
a virtual environment while engaging in visual predictive coding using a self-attention-
equipped convolutional neural network. While learning a next image prediction task,
the agent automatically constructs an internal representation of the environment that
quantitatively reflects spatial distances. The internal map enables the agent to pinpoint
its location relative to landmarks using only visual information.The predictive coding
network generates a vectorized encoding of the environment that supports vector navigation
where individual latent space units delineate localized, overlapping neighborhoods in the
environment. Broadly, our work introduces predictive coding as a unified algorithmic
framework for constructing cognitive maps that can naturally extend to the mapping of
auditory, sensorimotor, and linguistic inputs.
Space and time are fundamental physical structures
in the natural world, and all organisms have evolved
strategies for navigating space to forage, mate, and
escape predation.
1–3
. In humans and other mammals,
the concept of a spatial or cognitive map has been
postulated to underlie spatial reasoning tasks
4–6
. A
spatial map is an internal, neural representation of an
animal’s environment that marks the location of land-
marks, food, water, shelter, and then can be queried
for navigation and planning. The neural algorithms
underlying spatial mapping are thought to generalize
to other sensory modes to provide cognitive repre-
sentations of auditory and somatosensory data
7
as
well as to construct internal maps of more abstract
information including concepts
8,9
, tasks
10
, semantic
information
11–13
, and memories
14
. Empirical evidence
suggest that the brain uses common cognitive mapping
strategies for spatial and non-spatial sensory informa-
tion so that common mapping algorithms might exist
that can map and navigate over not only visual but also
semantic information and logical rules inferred from
experience
7,8,15
. In such a paradigm reasoning itself
could be implemented as a form of navigation within a
cognitive map of concepts, facts, and ideas.
Since the notion of a spatial or cognitive map emerged,
the question of how environments are represented
within the brain and how the maps can be learned
from experience has been a central question in
neuroscience
16
. Place cells in the hippocampus are neu-
rons that are active when an animal transits through
a specific location in an environment
16
. Grid cells in
the entorhinal cortex fire in regular spatial intervals
and likely track an organism’s displacement in the
environment
17,18
. Yet with the identification of a sub-
strate for the representation of space, the question of
how a spatial map can be learned from sensory data
has remained, and the neural algorithms that enable
the construction of spatial and other cognitive maps
remain poorly understood.
Empirical work in machine learning has demonstrated
that deep neural networks can solve spatial naviga-
tion tasks as well as perform path prediction and grid
cell formation
19,20
. Cueva & Wei
19
and Banino
et al.
20
demonstrate that neural networks can learn to perform
path prediction and that networks generate firing pat-
terns that resemble the firing patterns of grid cells in
the entorhinal cortex. Crane
et al.
21
, Zhang
et al.
22
,
and Banino
et al.
20
demonstrate navigation algorithms
that require the environment’s map or using firing
patterns that resemble place cells in the hippocam-
pus. These studies allow an agent to access environ-
mental coordinates explicitly
19
or initialize a model
with place cells that represent specific locations in an
arena
20
. In machine learning and autonomous naviga-
tion, a variety of algorithms have been developed to
perform mapping tasks including SLAM and monoc-
ular SLAM algorithms
23–26
as well as neural network
implementations
27–29
. Yet, SLAM algorithms contain
many specific inference strategies, like visual feature
and object detection, that are specifically engineered for
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint
this version posted April 18, 2024.
;
https://doi.org/10.1101/2023.09.18.558369
doi:
bioRxiv preprint
Figure 1. A predictive coding neural network explores a virtual environment.
In predictive coding, a model predicts
observations and updates its parameters using the prediction error.
a,
an agent’s traverses its environment
by taking the most direct path to random positions.
b,
a self-attention-based encoder-decoder neural net-
work architecture learns to perform predictive coding. A ResNet-18 convolutional neural network acts as an
encoder; self-attention is performed with 8 heads, and a corresponding ResNet-18 convolutional neural net-
work performing decoding to the predicted image.
c,
the neural network learns to perform predictive coding
effectively—with a mean-squared error of 0.094 between the actual and predicted images.
map building, wayfinding, and pose estimation based
on visual information. Whereas extensive research
in computer vision and machine learning use video
frames, these studies do not extract representations of
the environment’s map
30,31
. A unified theoretical and
mathematical framework for understanding the map-
ping of spaces based on sensory information remains
incomplete.
Predictive coding has been proposed as a unifying the-
ory of neural function where the fundamental goal of
a neural system is to predict future observations given
past data
32–34
. When an agent explores a physical
environment, temporal correlations in sensory obser-
vations reflect the structure of the physical environ-
ment. Landmarks nearby one another in space will
also be observed in temporal sequence. In this way,
predicting observations in a temporal series of sensory
observations requires an agent to internalize some im-
plicit information about a spatial domain. Historically,
Poincare motivated the possibility of spatial mapping
through a predictive coding strategy where an agent
assembles a global representation of an environment
by gluing together information gathered through local
exploration
35,36
. The exploratory paths together con-
tain information that could, in principle, enable the
assembly of a spatial map for both flat and curved man-
ifolds. Indeed, extended Kalman filters
25,37
for SLAM
perform a form of predictive coding by directly map-
ping visual changes and movement to spatial changes.
However, extended Kalman filters as well as other
SLAM approachs require intricate strategies for land-
mark size calibration, image feature extraction, and
models of the camera’s distortion whereas biological
systems can solve flexible mapping and navigation that
engineered systems cannot. Yet, while the concept
of predictive coding for spatial mapping is intuitively
2
| Pre-print
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint
this version posted April 18, 2024.
;
https://doi.org/10.1101/2023.09.18.558369
doi:
bioRxiv preprint
attractive, a major challenge is the development of algo-
rithms that can glue together local, sensory information
gathered by an agent into a global, internally consistent
environmental map. Connections between mapping
and predictive coding in the literature have primarily
focused on situations where an agent has explicit ac-
cess to its spatial location as a state variable
38–40
. The
problem of building spatial maps
de novo
from sensory
data remains poorly understood.
Here, we demonstrate that a neural network trained
on a sensory, predictive coding task can construct
an implicit spatial map of an environment by assem-
bling observations acquired along local exploratory
paths into a global representation of a physical space
within the network’s latent space. We analyze sen-
sory predictive coding theoretically and demonstrate
mathematically that solutions to the predictive sensory
inference problem have a mathematical structure that
can naturally be implemented by a neural network
with a ‘path-encoder,’ an internal spatial map, and a
‘sensory decoder,’ and trained using backpropagation.
In such a paradigm, a network learns an internal map
of its environment by inferring an internal geometric
representation that supports predictive sensory infer-
ence. We implement sensory predictive coding within
an agent that explores a virtual environment while
performing visual predictive coding using a convolu-
tional neural network with self-attention. Following
network training during exploration, we find that the
encoder network embeds images collected by an agent
exploring an environment into an internal represen-
tation of space. Within the embedding, the distances
between images reflect their relative spatial position,
not object-level similarity between images. During
exploratory training, the network implicitly assembles
information from local paths into a global represen-
tation of space as it performs a next image inference
problem. Fundamentally, we connect predictive coding
and mapping tasks, demonstrating a computational
and mathematical strategy for integrating information
from local measurements into a global self-consistent
environmental model.
Mathematical formulation of spatial
mapping as sensory predictive
coding
In this paper, we aim to understand how a spatial map
can be assembled by an agent that is making sensory
observations while exploring an environment. Papers
in the literature that study connections between pre-
dictive coding and mapping have primarily focused
on situations where an agent has access to its ‘state’ or
location in the environment
38–40
. Here, we develop a
theoretical model and neural network implementation
of sensory predictive coding that illustrates why and
how an internal spatial map can emerge naturally as a
solution to sensory inference problems. We, first, for-
mulate a theoretical model of visual predictive coding
and demonstrate that the predictive coding problem
can be solved by an inference procedure that constructs
an implicit representation of an agent’s environment
to predict future sensory observations. The theoretical
analysis also suggests that the underlying inference
problem that can be solved by an encoder-decoder
neural network that infers spatial position based upon
observed image sequences.
We consider an agent exploring an environment,
Ω
R
2
, while acquiring visual information in the
form of pixel valued image vectors
R
×
given
an
Ω
. The agent’s environment
Ω
is a bounded
subset of
R
2
that could contain obstructions and holes.
In general, at any given time,
, the agent’s state can
be characterized by a position
(
)
and orientation
(
)
where
(
)
and
(
)
are coordinates within a global
coordinate system unknown to the agent.
The agent’s environment comes equipped with a visual
scene, and the agent makes observations by acquiring
image vectors
R
×
as it moves along a sequence
of points
. At every position
and orientation
,
the agent acquires an image by effectively sampling
from an image the conditional probability distribution
(
|
,
)
which encodes the probability of observing
a specific image vector
when the agent is positioned
at position
and orientation
. The distribution
(
|
푥,
)
has a deterministic and stochastic component
where the deterministic component is set by landmarks
in the environment while stochastic effects can emerge
due to changes in lighting, background, and scene
dynamics. Mathematically, we can view
(
|
푥,
)
as
a function on a vector bundle with base space
Ω
and
total space
Ω
×
43
. The function assigns an observation
probability to every possible image vector for an agent
positioned at a point
(
푥,
)
. Intuitively, the agent’s
observations preserve the geometric structure of the
environment: the spatial structure influences temporal
correlations.
In the predictive coding problem, the agent moves
along a series of points
(
0
,
0
)
,
(
1
,
1
)
, ... ,
(
,
)
while acquiring images
0
, 퐼
1
, ... 퐼
. The motion of the
agent in
Ω
is generated by a Markov process with tran-
sition probabilities
(
+
1
,
+
1
|
,
)
. Note that the
agent has access to the image observations
but not the
spatial coordinates
(
,
)
. Given the set
{
0
... 퐼
}
the
agent aims to predict
+
1
. Mathematically, the image
prediction problem can be solved theoretically through
statistical inference by (a) inferring the posterIor proba-
bility distribution
(
+
1
|
0
, 퐼
1
....퐼
)
from observations.
Then, (b) given a specific sequence of observed images
{
0
... 퐼
}
, the agent can predict the next image
+
1
by
finding the image
+
1
that maximizes the posterior
probability distribution
(
+
1
|
0
, 퐼
1
....퐼
)
.
The posterior probability distribution
(
+
1
|
0
, 퐼
1
, ....퐼
)
is by definition
The neural network is a feedforward deep neural network trained using backpropagation, or gradient descent, rather than
Helmholtz machines
41,42
, which are commonly used in predictive coding.
Pre-print |
3
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint
this version posted April 18, 2024.
;
https://doi.org/10.1101/2023.09.18.558369
doi:
bioRxiv preprint
Figure 2. Predictive coding neural network constructs an implicit spatial map.
a-b,
The predictive coder’s
latent space encodes accurate spatial positions.
A neural network predicts the spatial location from
the predictive coding’s latent space.
a,
a heatmap of the prediction errors between the actual po-
sition and the predictive coder’s predicted positions show a low prediction error.
b,
The histogram
of prediction errors of positions from the predictive coder’s latent space show a low prediction error.
As a baseline (Noise model (
=
1
lattice unit)), actual positions with a small noise displacement
gives an error model.
c,
predictive coding’s latent distances recover the environment’s spatial met-
ric.
Sequential visual images are mapped to the neural network’s latent space, and the latent space
distances (
2
) are plotted with physical distances onto a joint density plot. An nonlinear regression model
=
log
+
is shown as a baseline.
d,
a correlation plot and a quantile-quantile plot show the overlap between the
empirical and model distributions.
(
+
1
|
0
, 퐼
1
, ...., 퐼
)
=
(
0
, 퐼
1
, ..., 퐼
, 퐼
+
1
)
(
0
, 퐼
1
, ..., 퐼
)
.
If we consider
(
0
, 퐼
1
... 퐼
, 퐼
+
1
)
to be a function of
an implicit set of spatial coordinates
(
,
)
where the
(
,
)
provide an internal representation of the spatial
environment. Then, we can express the posterior prob-
ability
(
+
1
|
0
, 퐼
1
....퐼
)
in terms of the implicit spatial
representation
(
+
1
|
0
, 퐼
1
, ... , 퐼
)
=
Ω
dx d
(
0
,
0
, 푥
1
,
1
, ... , 푥
,
)
(
0
, 퐼
1
···
|
0
,
0
, ... , 푥
,
)
(
0
, 퐼
1
, ... , 퐼
)
(
+
1
|
,
)
(
+
1
|
+
1
,
+
1
)
=
Ω
dx d
(
0
,
0
, 푥
1
,
1
, ..., 푥
,
|
0
, 퐼
1
... , 퐼
)
|
{z
}
encoding
(
1
)
(
+
1
,
+
1
|
,
)
|
{z
}
spatial transition probability
(
2
)
(
+
1
|
+
1
,
+
1
)
|
{z
}
decoding
(
3
)
(1)
4
| Pre-print
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint
this version posted April 18, 2024.
;
https://doi.org/10.1101/2023.09.18.558369
doi:
bioRxiv preprint