of 7
Supplementary Information
Neural networks solve predictive coding by performing maximum likelihood estimation
We can express the model distribution
?
(
>
C
|
>
<
C
)
as
?
(
>
C
|
>
<
C
)
=
π
?
(
>
C
,
G
C
,
G
<
C
|
>
<
C
)
3G
C
3G
<
C
=
π
?
(
>
C
|
G
C
)
?
(
G
C
|
G
<
C
)
?
(
G
<
C
|
>
<
C
)
3G
C
3G
<
C
=
E
G
?
(
G
C
|
>
<
C
)
[
?
(
>
C
|
G
C
)
]
Performing maximum likelihood estimation,
arg max
E
>
?
data
(
>
)
?
(
>
1
,...,
>
)
)
=
arg max
)
C
=
1
E
>
?
data
(
>
)
E
G
C
?
(
G
C
|
>
C
)
?
(
>
C
|
G
C
)
As
log
is a monotonic, increasing function, we can take perform maximum
log-likelihood
estimation,
arg max
E
>
?
data
(
>
)
?
(
>
1
,...,
>
)
)
=
arg max
E
>
?
data
(
>
)
log
?
(
>
1
,...,
>
)
)
=
arg max
)
C
=
1
E
>
?
data
(
>
)
E
G
C
?
(
G
C
|
>
C
)
log
?
(
>
C
|
G
C
)
Predictive coding, which is solving for
?
(
>
C
|
G
C
)
and
?
(
G
C
|
>
<
C
)
, is equivalent to estimating the data-generating
distribution
?
(
>
1
,...,
>
)
)
.
Suppose that the agent’s path and observations are deterministic. First, the agent’s next position given its past
positions
G
C
=
(
G
<
C
)
.
Second, the agent’s
sequence
of past positions given its
sequence
of observations is given by
G
<
C
=
5
(
>
<
C
)
.
We can then parameterize the model distributions
?
(
G
C
|
G
<
C
)
=
(
G
<
C
)
(
G
C
)
and
?
(
G
<
C
|
>
<
C
)
=
5
(
>
<
C
)
(
G
<
C
)
with neural networks
>
<
C
5
7!
G
<
C
and
G
<
C
7!
G
C
. If every positions have a single observation, we can parameterize
the distribution
?
(
>
C
|
G
C
)
as
?
(
>
C
|
G
C
)
=
N
(
>
C
;
6
(
G
C
)
,
$
2
I
)
,
then maximum likelihood estimation becomes
arg max
E
>
?
data
(
>
)
?
(
>
1
,...,
>
)
)
=
arg max
)
C
=
1
E
>
?
data
(
>
)
E
G
C
?
(
G
C
|
>
C
)
log
?
(
>
C
|
G
C
)
=
arg max
)
C
=
1
E
>
?
data
(
>
)
log
?
(
>
C
|
G
C
=
5
(
>
<
C
))
=
arg min
E
>
?
data
(
>
)
)
C
=
1
k
>
C
6
5
(
>
<
C
)
k
2
2
.
In neural networks, the errors are propagated by gradient descent
r
k
>
C
6
5
(
>
<
C
)
k
2
2
.
Pre-print |
￿￿
Figure S
￿
. The place field overlap between two locations is linearly decodable to a vector heading.
a,
at
di
￿
erent two locations
G
1
and
G
2
, there exists a place field code
I
1
and
I
2
, respectively. The bitwise
di
￿
erent
I
1
I
2
gives the overlap between place fields at locations
G
1
and
G
2
. We perform linear regression
inputting the overlap codes
I
1
I
2
and predicting the vector displacement
G
1
G
2
between the two locations,
G
1
G
2
=
,
[
I
1
I
2
]+
1
.
The linear model forms a linear decoder from the place field code
I
1
I
2
to the vector displacement
G
1
G
2
.
b,
the errors in predicted distance
k
A
ˆ
A
k
(le
￿
) and predicted direction
ˆ
(right) are decomposed
from the predicted displacement
G
1
G
2
. The linear decoder has a low prediction error for distance (
<
80%
,
￿￿
.
￿￿
lattice units; mean,
7
.
89
lattice units) and direction (
<
80%
,
￿￿
.
￿￿
°
; mean,
30
.
6
°
).
c,
residual plots
show both distance (le
￿
) and direction (right) are strongly correlated with the place field code.
d,
in addition,
the place field code is strongly correlated with the vector heading with Pearson correlation coe
￿
icients
of
0
.
861
,
0
.
718
, and
0
.
924
for vector displacement (
G
1
G
2
), distance (
A
1
A
2
), and direction (
1
2
),
respectively.
￿￿
| Pre-print
Figure S
￿
. Without self-attention, the predictive coder encodes less accurate spatial information.
a-b,
self-
attention in the predictive coder captures sequential information. To determine whether the temporal in-
formation is crucial to build an accurate spatial map, a neural network predicts the spatial location from
the predictive coding’s latent space without self-attention.
a,
a heatmap of the predictive coder’s prediction
error shows a low accuracy in many regions.
b,
the histogram of prediction errors shows a similar high pre-
diction error as the auto-encoder.
The predictive coding neural network requires self-attention for an accurate environmental
map
The predictive coder architecture has three modules: encoder, self-attention, and decoder. The encoder and
decoder act on single images—similar to how a tokenizer in language models transforms single words to vectors.
The self-attention operates on image sequences to capture sequential patterns. If the predictive coder uses
temporal structure to build a spatial map, then the self-attention should build the spatial map—not the encoder
or decoder. In this section, we show that the temporal structure is required to build an accurate map of the
environment.
Here we validate that self-attention is necessary to build an accurate map. First, we take the latent units encoded
by the predictive coder’s encoder (without the self-attention). We then train a separate neural network to predict
the actual spatial position given the latent unit. The accuracy of the predicted positions provides a lower bound
on the spatial information given by the predictive coder’s encoder’s latent space. The heatmap (left) visualizes
the errors given di
ff
erent positions in the environment. The histogram of the prediction errors (right) provides a
comparison between the latent spaces of the predictive coder, the predictive coder without self–attention, and
the auto-encoder. Without self-attention, the predictive coder has a much higher prediction error of its position
similar to the auto-encoder.
Pre-print |
￿￿
Figure S
￿
. As the number of past observations increase, the predictive coder’s positional prediction error de-
creases. a-c,
the predictive coder trains with one, five, and ten past observations, respectively. To deter-
mine how much temporal information is crucial to build an accurate spatial map, a neural network predicts
the spatial location from the predictive coding’s latent space.
a,
with only one past observation, a heatmap
of the predictive coder’s prediction error shows a high error in many regions.
b,
with five past observations,
the prediction error reduces in many regions.
c,
with ten past observations, the prediction error is reduced
below
￿
.
￿
lattice units for the majority (>
￿￿
%) of positions.
d,
as the number of past observations goes to
zero, the histogram of prediction errors converges to the auto-encoder’s prediction error.
The continuity and number of past observations determines the environmental map’s
accuracy
In the predictive coder’s architecture, it contains three modules: the encoder, the self-attention, and the decoder.
As discussed in the
previous section
, the predictive coder’s requires self-attention learn an accurate spatial map:
the observation’s temporal information is crucial to build an environment’s map. A question that arises is how
much temporal information does the predictive coder require to build an accurate map? In this section, we show
that the predictive coder’s spatial prediction error decreases as the continuity and number of past observations.
First, we take the latent units encoded by the predictive coder’s encoder trained on di
ff
ering numbers of past
observations (
)
=
1
,
5
,
10
). We then train a separate neural network to predict the actual spatial position given the
latent unit. The accuracy of the predicted positions provides a lower bound on the spatial information given by
the predictive coder’s encoder’s latent space. The heatmap (
Figure
S3
(
a
,
b
,
c
)
) visualizes the errors given di
ff
erent
positions in the environment. With only one past observation, a heatmap of the predictive coder’s prediction
error shows a high error in many regions. With five past observations, the prediction error reduces in many
regions. With ten past observations, the prediction error is reduced below 7.3 lattice units for the majority (>
80%) of positions. As the number of past observations goes to zero, the histogram of prediction errors converges
to the auto-encoder’s prediction error (
Figure
S3
(
d
)
).
￿￿
| Pre-print