Unsupervised deep learning identifies
semantic disentanglement
in single inferotemporal neurons
Irina Higgins
1
†∗
, Le Chang
2
,
3
†
, Victoria Langston
1
, Demis Hassabis
1
,
4
Christopher Summerfield
1
,
5
‡
, Doris Tsao
2
,
6
‡
, Matthew Botvinick
1
,
4
‡
1
DeepMind, London, UK,
2
Caltech, Pasadena, USA
3
Chinese Academy of Sciences, Shanghai, China
4
University College London, London, UK,
5
University of Oxford, Oxford, UK
6
Howard Hughes Medical Institute, Pasadena, USA
∗
To whom correspondence should be addressed; E-mail: irinah@google.com
†
Equal contribution,
‡
Equal contribution
Deep supervised neural networks trained to classify objects have emerged as popular models
of computation in the primate ventral stream. These models represent information with a
high-dimensional distributed population code, implying that inferotemporal (IT) responses
are also too complex to interpret at the single-neuron level. We challenge this view by mod-
elling neural responses to faces in the macaque IT with a deep unsupervised generative model,
β
-VAE. Unlike deep classifiers,
β
-VAE “disentangles” sensory data into interpretable latent
factors, such as gender or hair length. We found a remarkable correspondence between the
generative factors discovered by the model and those coded by single IT neurons. Moreover,
we were able to reconstruct face images using the signals from just a handful of cells. This sug-
gests that the ventral visual stream may be optimising the disentangling objective, producing
a neural code that is low-dimensional and semantically interpretable at the single-unit level.
Introduction
In search of a basic unit of representation in the neocortex.
What is the basic unit of repre-
sentation in the neocortex, and what computational objective gives rise to it? The foundational
“neuron doctrine” argues that single cells are the key building blocks of brain function
1
, and decades
of extracellular single neuron recordings have defined canonical coding principles, such as the
sensitivity of early visual neurons to oriented contours and more anterior ventral stream neurons to
complex objects and faces
2, 3
. More recently however, advances in recording methods have permitted
the simultaneous recording of large populations of neurons
4
. Hand-in-hand with this innovation
has come the idea that meaningful variables (e.g. the gender of face) are encoded not in single
neurons but in neural populations
4–6
, and that previously reported one-to-one mappings between the
declarative aspects of the external world and single neurons may be spurious or misleading
6
.
In parallel, visual neuroscience moved beyond handcrafted computational models towards
arXiv:2006.14304v1 [q-bio.NC] 25 Jun 2020
theories that emphasise representation learning through end-to-end optimization
7, 8
. When trained
with high-density teaching signals, contemporary deep networks can outperform humans on mul-
tiway object recognition tasks
9
, and in doing so form high-dimensional representations that are
multiplexed over many simulated neurons. Examined at the population level, these tuning distribu-
tions closely resemble those in biological systems
10
, especially in higher-performing networks
11
,
allowing deep learning networks to make accurate predictions about neural responses to synthesised
images
12
. A natural synergy has thus arisen between new tools for multivariate population encoding
and new computational theories that assume that the tuning properties of a single unit are all but
uninterpretable
5–7
.
Disentangled representation learning through self-supervision.
An important challenge for
theories that rely on deep supervised networks, however, is that external teaching signals are scarce
in the natural world, and visual development relies heavily on untutored statistical learning
13–15
.
Building on this intuition, one longstanding hypothesis
16, 17
is that the visual system uses self-
supervision to recover the semantically interpretable latent structure of sensory signals, such as the
shape or size of an object, or the gender or age of a face image. While appearing deceptively simple
and intuitive to humans, such interpretable structure has proven hard to recover in practice, since it
forms a highly complex non-linear transformation of pixel-level inputs. Recent advances in machine
learning, however, have offered an implementational blueprint for this theory with the advent of
deep self-supervised generative models that learn to “disentangle” high-dimensional sensory signals
into meaningful factors of variation. One such model, known as the beta-variational autoencoder (
β
-
VAE), learns to faithfully reconstruct sensory data from a low-dimensional embedding whilst being
additionally regularised in a way that encourages individual network units to code for semantically
meaningful variables, such as the colour of an object, the gender of a face, or the arrangement of
a scene (Fig. 1a-c)
18–20
. These deep generative models thus continue the longstanding tradition
from the neuroscience community of building self-supervised models of vision
21, 22
, while moving
in a new direction that allows strong generalisation, imagination, abstract reasoning, compositional
inference and other hallmarks of biological visual cognition
19, 23–25
.
Results
How well do single disentangled latent units explain the responses of single neurons?
If the
computations employed in biological sensory systems resemble those employed by this class of deep
generative model to disentangle the visual world, then contrary to the “population doctrine”
4–6
, the
tuning properties of single neurons should map readily onto the meaningful latent units discovered
by the
β
-VAE. Here, we tested this hypothesis, drawing on a previously published dataset
26
of
neural recordings from 159 neurons in macaque face area AM, made whilst the animals viewed
2,100 natural face images (Fig. 2a, see Online Methods). Using face perception as the test
domain for understanding whether IT neurons may be employing similar disentangling learning
mechanisms to the deep generative models has unique advantages. Specifically, both neural
Preprint. Under review.
2
responses and image statistics in this domain have been particularly well studied compared to other
visual stimulus classes. This allows for comparisons with strong hand-engineered baselines
3
using
relatively densely sampled neural data
27
. Furthermore, although faces make up a small subset
of all possible visual objects, and neurons that preferentially respond to faces tend to cluster in
particular patches of the inferotemporal (IT) cortex
27
, the computational mechanisms and basic
units of representation employed for face processing may in fact generalise more broadly within the
ventral visual stream
27, 28
.
We first investigated whether the variation in average spike rates of any of the individual
recorded neurons was strongly explained by the activity in single units of a trained
β
-VAE that
learnt to “disentangle” the same face dataset that was presented to the primates. For illustration, in
Fig. 1c we show faces that were generated (or “imagined”) by such a
β
-VAE. Each row of faces is
produced by gradually varying the output of a single network unit (we call these “latent units”), and
it can be seen that they learnt to encode interpretable variables – e.g. hairstyle, age, face shape or
emotional variables such as the presence of a smile. Individual disentangled units discovered by the
β
-VAE were also able to explain the response variance in single recorded neurons, as shown in Fig.
2b. For example, neuron 95 is shown to be sensitive to the thickness of the hair, and neuron 136 is
shown to respond differentially to the presence of a smile.
To quantify this effect, we used a metric recently proposed in the machine learning literature,
referred to as neural “alignment”
1
in this work, which measures the extent to which variance
in each neuron’s firing rate can be explained by a single latent unit
29
, but is insensitive to the
converse, i.e. whether a single unit predicts the response of many biological neurons (Fig. 3a,
see Online Methods). High alignment scores thus indicate that a neural population is intrinsically
low-dimensional, with the factors of variation mapping onto the variables discovered by the latent
units of the neural network. We first compared alignment scores between the
β
-VAE and the
monkey data to a theoretical ceiling which was obtained by subsampling the neural data to match the
intrinsic dimensionality of the
β
-VAE latent representation (see Online Methods) and computing its
alignment with itself (Fig. 3b). Remarkably, alignment scores in the
β
-VAE met this ceiling, with no
reliable difference between the two estimates obtained when the analysis was repeated on multiple
subsamples and with multiple network instances (
p
= 0
.
43
, Welch’s t-test). Furthermore, when
we repeated this analysis while computing alignment against fictitious neural responses obtained
by linearly recombining the original neural data, we found a significant drop in scores for both
the
β
-VAE and neural subsets (Fig. 3c,
p <
0
.
01
, Welch’s t-test), indicating that the individual
disentangled units discovered by the
β
-VAE map significantly better onto the responses of single
neurons recorded from macaque IT, rather than onto their linear combinations.
The extent to which the
β
-VAE is effective in disentangling a dataset into its latent factors
can vary substantially with the way it is regularised, as well as with randomness in its initialisation
and training conditions
31
. The parameter after which the network class is named determines the
weight of a regularisation term that aims to keep the latent factors independent. Networks with
1
Two versions of the same measure were simultaneously and independently proposed in the machine learning
literature, referred to as “completeness”
29
or “compactness”
30
. We choose to refer to the same measure as “alignment”
for more intuitive exposition.
Preprint. Under review.
3
higher values of
β
thus typically give rise to more disentangled representations, as measured by a
metric known as the unsupervised disentanglement ranking (UDR, see Online Methods)
32
, a finding
we replicate here. However, we also found that networks with higher UDR scores additionally had
higher alignment scores with the neural data (Fig. 3d), and that this relationship held for networks
with the same and different values of
β
(Fig. 3e). In other words, the better the network was able to
disentangle the latent factors in the face dataset, the more those factors were expressed in single
neurons recorded from macaque IT.
No single aspect of the disentanglement objective is sufficient to achieve high alignment with
neural responses
Next, we compared the
β
-VAE alignment scores with a number of rival models.
These baseline models were carefully chosen to disambiguate the role played by the different
aspects of the
β
-VAE design and training in explaining the coding of neurally aligned variables in
its single latent units (see Online Methods). We included a state-of-the-art deep supervised network
(VGG,
33
) that has previously been proposed as a good model for comparison against neural data
in face recognition tasks
34, 35
, other generative models, such as a basic autoencoder (AE)
36
and a
variational autoencoder (VAE)
37
, as well as baselines provided by ICA, PCA and a classifier which
used only the encoder from the
β
-VAE. We defined “latent units” as those emerging in the deepest
layers of these networks and, where appropriate, used PCA or feature subsampling (e.g. for VGG
raw) to equate the dimensionality of the latent units (to
≤
50
) to provide a fair comparison with
the
β
-VAE. We also compared
β
-VAE to the “gold standard” provided by the previously published
active appearance model (AAM)
3
, which produced a low-dimensional code that explained the
responses of single neurons to face images well
3, 26
. Unlike the
β
-VAE, which relied on a general
learning mechanism to discover its latent units, AAM relied on a manual process idiosyncratic to
the face domain. Hence,
β
-VAE provides a
learning
-based counterpart to the handcrafted AAM
units that could generalise beyond the domain of faces. Although the baselines considered varied in
their average alignment scores (Fig. 4a), none approached those of the
β
-VAE, for which alignment
was statistically higher than every other model (all p-values
<
0
.
01
, Welch’s t-test). The alignment
scores broken down by individual neurons are plotted in Fig. 4b for the
β
-VAE and its baselines.
We validated the findings above using a more direct metric for the coding of latent factors in
single neurons, which compared the ratio between the maximum correlation between spike rates
and activations in each latent unit, and the sum of such correlations over the model units (average
correlation ratio in Fig. 5a, see Online Methods). This ratio was higher for the
β
-VAE than for other
models, confirming the results with alignment scores (Fig. 5b). Interestingly, different neurons
did not tend to covary with the same
β
-VAE latent unit. In fact, there was more heterogeneity
among
β
-VAE units that achieved maximum correlation with the neural responses than among the
equivalent units for other models (Fig. 5c). Rich heterogeneity in response properties of single
neurons (or latent units) is exactly what would be desired to enable a population of computational
units to encode the rich variation in the image dataset.
Taken together these results suggest that no one feature of the
β
-VAE – its architecture
(baselined by AE, VAE and classifier), training data distribution (baselined by VGG) or isolated
aspects of its learning objective (baselined by PCA and ICA) – was sufficient to explain the coding
Preprint. Under review.
4
of neurally aligned latent variables in single units. Rather, it was all of these design choices together
that allowed the
β
-VAE to learn a set of disentangled latent units that explained the responses to
single neurons so well.
Disentangled units carry sufficient information to decode previously unseen faces from as few
as twelve neurons
Finally, we conducted an analysis that sought to link the virtues of the
β
-VAE
as a tool in machine learning – its capacity to make strong inferences about held out data – with its
qualities emphasised here as a theory of visual cognition – strong one-to-one alignment between
individual neural and individual disentangled latent units. During training we omitted 62 faces that
had been viewed by the monkeys from the training set of the
β
-VAE, allowing us to verify that
these were reconstructed more faithfully by the
β
-VAE than by other networks. Critically, in order
to reconstruct these faces, we applied the decoder of the
β
-VAE not to its latent units as inferred
by its encoder, but rather to the latent unit responses predicted from the activity of a small subset
of single neurons (as few as twelve) that best aligned with each model unit on a different subset
of data (Fig. 6a, see Online Methods). We found that such one-to-one decoding of latent units
from the corresponding single neurons was significantly more accurate for the disentangled latent
units learnt by the
β
-VAE compared to the latent units learnt by other baseline models (all p-values
<
0
.
01
, Welch’s t-test) (Fig. 6b). Furthermore, we visualised the
β
-VAE reconstructions decoded
from just twelve matching neurons (Fig. 6c). Qualitatively, these appeared both more identifiable
and of higher image quality than those produced by the latent units decoded from the nearest rival
models, the AE and the basic VAE, which required twice as many neurons for decoding (Fig. 6c). It
should be noted that the AE was explicitly optimised for reconstruction quality, while the
β
-VAE
was optimised for disentangling. These results suggest that both the small subset of just twelve
neurons and the corresponding twelve disentangled units carried sufficient information to decode
previously unseen faces, the capacity that is required for effective vision in an unpredictable and
ever-changing natural world.
Discussion
Disentangled representation learning as a predictive computational model of vision.
The
results we have presented here provide evidence that the code for facial identity in the primate IT
may in fact be low-dimensional and interpretable at a single neuron level. In particular, we showed
that the axes of variation represented by single IT neurons align with single semantically meaningful
“disentangled” latent units discovered by the
β
-VAE, a recent class of self-supervised deep neural
networks proposed in the machine learning community.
Our work extends recent studies of the coding properties of single neurons in the primate
face patch area, reporting finding one-to-one correspondences between model units and neurons, as
opposed to few-to-one as previously reported
3
. Moreover, we show that disentangling may occur at
the end of the ventral visual stream (IT), extending results recently reported for V1
38
. Past studies
have proposed that the ventral visual cortex may disentangle
17, 38
and represent visual information
Preprint. Under review.
5
with a low-dimensional code
3, 39, 40
. However, this work did not ask how these representations
emerge via learning. Here, we propose a theoretically grounded
41
computational model (the
β
-VAE)
for how disentangled, low-dimensional codes may be learnt from the statistics of visual inputs
18
.
An important aspect of our proposed learning mechanism is that it generalises beyond the
domain of faces
18–20
. We believe that the difficulty in identifying interpretable codes in the IT
encountered in the past may have been due to the fact that semantically meaningful axes of variation
of complex visual objects are more challenging for humans to define (and hence use as visual
probes) compared to simple features, such as visual edges
14
. A computational model like the
β
-VAE, on the other hand, is able to automatically discover disentangled latent units that align with
such axes, as was demonstrated for the domain of faces in this work. Hence, assuming that the
computational mechanisms underlying face perception in the brain generalise to the broader set
visual domains
27, 28
,
β
-VAE may serve as a promising tool to understand IT codes at a single neuron
level even for rich and complex visual stimuli in the future.
One contribution of this paper is the introduction of novel measures for comparing neural
and model representations. Unlike other often used representation comparison methods which are
insensitive to invertible linear transformations
11, 42
, our methods measure the alignment between
individual neurons and model units. Hence, they do not abstract away the representational form and
preserve the ability to discriminate between alternative computational models that may otherwise
score similarly
15
.
While the development of
β
-VAE for learning disentangled representations was originally
guided by high level neuroscience principles
43–45
, subsequent work in demonstrating the utility
of such representations for intelligent behaviour was primarily done in the machine learning
community
23–25
. In line with the rich history of mutually beneficial interactions between neuro-
science and machine learning
46
, we hope that the latest insights from machine learning may now
feed back to the neuroscience community to investigate the merit of disentangled representations for
supporting intelligence in the biological systems, in particular as the basis for abstract reasoning
47
,
or generalisable and efficient task learning
48
.
Acknowledgments
We would like to thank Raia Hadsell, Zeb Kurth-Nelson and Koray Kavukcouglu for comments on
the manuscript.
Preprint. Under review.
6
References
1.
Barlow, H. B. Single units and sensation: A neuron doctrine for perceptual psychology?
Perception
1
, 371–394 (1972).
2.
Hubel, D. H. & Wiesel, T. N. Receptive fields of single neurones in the cat’s striate cortex.
J.
Physiol.
124
, 574–591 (1959).
3.
Chang, L. & Tsao, D. Y. The code for facial identity in the primate brain.
Cell
169
, 1013–1028
(2017).
4. Saxena, S. & Cunningham, J. Towards the neural population doctrine.
Curr. Opin. Neurobiol.
55
, 103–111 (2019).
5.
Eichenbaum, H. Barlow versus Hebb: When is it time to abandon the notion of feature detectors
and adopt the cell assembly as the unit of cognition?
Neurosci. Lett.
680
, 88–93 (2018).
6.
Yuste, R. From the neuron doctrine to neural networks.
Nat. Rev. Neurosci.
16
, 487–497 (2015).
7.
Richards, B. A.
et al.
A deep learning framework for neuroscience.
Nat. Neurosci.
22
,
1546–1726 (2019).
8.
Yamins, D. L. K. & DiCarlo, J. J. Using goal-driven deep learning models to understand
sensory cortex.
Nat. Neurosci.
19
, 356–365 (2016).
9.
He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification.
ICCV
(2015).
10.
Khaligh-Razavi, S. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may
explain IT cortical representation.
PLoS Comput. Biol.
10
(2014).
11.
Yamins, D. L. K.
et al.
Performance-optimized hierarchical models predict neural responses in
higher visual cortex.
PNAS
111
, 8619–8624 (2014).
12.
Bashivan, P., Kar, K. & DiCarlo, J. J. Neural population control via deep image synthesis.
Science
364
(2019).
13.
Slone, L. & Johnson, S. Infants’ statistical learning: 2- and 5-month-olds’ segmentation of
continuous visual sequences.
J. Exp. Child Psychol.
133
, 47–56 (2015).
14.
Lindsay, G. Convolutional neural networks as a model of the visual system: Past, present, and
future.
J. Cogn. Neurosci.
1–15 (2020).
15.
Thompson, J. A. F., Bengio, Y., Formisano, E. & Schönwiesner, M. How can deep learning
advance computational modeling of sensory information processing?
NeurIPS Workshop on
Representation Learning in Artificial and Biological Neural Networks
(2016).
Preprint. Under review.
7
16.
Bengio, Y., Courville, A. & Vincent, P. Representation learning: A review and new perspectives.
IEEE Trans. Pattern Anal. Mach. Intell.
35
, 1798–1828 (2013).
17.
DiCarlo, J., Zoccolan, D. & Rust, N. How does the brain solve visual object recognition?
Neuron
73
, 415–434 (2012).
18.
Higgins, I.
et al.
β
-VAE: Learning basic visual concepts with a constrained variational
framework.
ICLR
(2017).
19.
Burgess, C. P.
et al.
MONet: Unsupervised scene decomposition and representation.
arxiv
(2019). URL
https://arxiv.org/abs/1901.11390
.
20.
Lee, W., Kim, D., Hong, S. & Lee, H. High-fidelity synthesis with disentangled representation.
arxiv
(2020). URL
https://arxiv.org/abs/2001.04296
.
21.
Fukushima, K. A self-organizing neural network model for a mechanism of pattern recognition
unaffected by shift in position.
Biol. Cybern.
36
, 193 – 202 (1980).
22.
Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex.
Nat.
Neurosci.
2
, 1019–1025 (1999).
23.
Higgins, I.
et al.
DARLA: Improving zero-shot transfer in reinforcement learning.
ICML
70
,
1480–1490 (2017).
24. Higgins, I.
et al.
SCAN: Learning hierarchical compositional visual concepts.
ICLR
(2018).
25.
Achille, A.
et al.
Life-long disentangled representation learning with cross-domain latent
homologies.
NeurIPS
9873 – 9883 (2018).
26.
Chang, L., Egger, B., Vetter, T. & Tsao, D. Y. What computational model provides the best
explanation of face representations in the primate brain?
bioRxiv
(2020).
27.
Tsao, D. Y. & Livingstone, M. S. Mechanisms of face perception.
Annu. Rev. Neurosci.
31
,
411–437 (2008).
28.
Tarr, M. J. & Gauthier, I. FFA: A flexible fusiform area for subordinate-level visual processing
automatized by expertise.
Nat. Neurosci.
3
, 764–769 (2000).
29.
Eastwood, C. & Williams, C. K. I. A framework for the quantitative evaluation of disentangled
representations.
ICLR
(2018).
30.
Ridgeway, K. & Mozer, M. C. Learning deep disentangled embeddings with the F-statistic loss.
NeurIPS
31
, 185–194 (2018).
31.
Locatello, F.
et al.
Challenging common assumptions in the unsupervised learning of disentan-
gled representations.
ICML
97
, 4114–4124 (2019).
Preprint. Under review.
8
32.
Duan, S.
et al.
Unsupervised model selection for variational disentangled representation
learning.
ICLR
(2020).
33. Parkhi, O., Vedaldi, A., & Zisserman, A. Deep face recognition.
BMVC
(2015).
34.
Grossman, S.
et al.
Convergent evolution of face spaces across human face-selective neuronal
groups and deep convolutional networks.
Nat. Commun.
10
, 4934 (2019).
35.
Dobs, K., Isik, L., Pantazis, D. & Kanwisher, N. How face perception unfolds over time.
Nat.
Commun.
10
, 1258 (2019).
36.
Hinton, G. E. & Salakhutdinov, R. Reducing the dimensionality of data with neural networks.
Science
313
, 504–507 (2006).
37. Kingma, D. P. & Welling, M. Auto-encoding variational bayes.
ICLR
(2014).
38.
Gáspár, M. E., Polack, P.-O., Golshani, P., Lengyel, M. & Orbán, G. Representational untangling
by the firing rate nonlinearity in V1 simple cells.
eLife
8
(2019).
39.
de Beeck, H. O., Wagemans, J. & Vogels, R. Inferotemporal neurons represent low-dimensional
configurations of parameterized shapes.
Nat. Neurosci.
4
, 1244–1252 (2001).
40.
Kayaert, G., Biederman, I., de Beeck, H. P. O. & Vogels, R. Tuning for shape dimensions in
macaque inferior temporal cortex.
Eur. J. Neurosci.
22
, 212–224 (2005).
41.
Higgins, I.
et al.
Towards a definition of disentangled representations.
Theoretical Physics for
Deep Learning Workshop, ICML 2019
(2018).
42.
Kriegeskorte, N., Mur, M. & Bandettini, P. Representational similarity analysis - connecting
the branches of systems neuroscience.
Front. Syst. Neurosci.
2
, 1662–5137 (2008).
43.
Wood, J. N. & Wood, S. M. W. The development of invariant object recognition requires visual
experience with temporally smooth objects.
J. Physiol.
1–16
, 1391–1406 (2018).
44.
Smith, L. B., Jayaraman, S., Clerkin, E. & Yu, C. The developing infant creates a curriculum
for statistical learning.
Trends Cogn. Sci.
22
, 325–336 (2018).
45.
Friston, K. The free-energy principle: a unified brain theory?
Nat. Rev. Neurosci.
11
, 127––138
(2010).
46.
Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscience-inspired artificial
intelligence.
Neuron
95
, 245–258 (2017).
47.
Bellmund, J. L. S., Gärdenfors, P., Moser, E. I. & Doeller, C. F. Navigating cognition: Spatial
codes for human thinking.
Science
362
(2018).
48. Niv, Y. Learning task-state representations.
Nat. Neurosci.
22
, 1544 – 1553 (2019).
Preprint. Under review.
9
49. Martinez, A. & Benavente, R. AR face database.
CVC Technical Report
24
(1998).
50. Liu, Z., Luo, P., Wang, X. & Tang, X. Deep learning face attributes in the wild.
ICCV
(2015).
51.
Ma, D. S., Correll, J. & Wittenbrink, B. The Chicago face database: A free stimulus set of
faces and norming data.
Behav. Res. Methods
47
, 1122–1135 (2015).
52.
Peer, P. CVL face database.
Computer Vision Laboratory, University of Ljubljana, Slovenia
(1999).
53.
Phillips, P., Wechsler, H., Huang, J. & Rauss, P. The FERET database and evaluation procedure
for face recognition algorithms.
Image Vision Comput.
16
, 295–306 (1998).
54.
Strohminger, N.
et al.
The MR2: A multi-racial mega-resolution database of facial stimuli.
Behav. Res. Methods
48
, 1197–204 (2016).
55.
Gao, W.
et al.
The CAS-PEAL large-scale chinese face database and baseline evaluations.
IEEE Trans. Syst. Man. Cybern. B Cybern.
38
(2008).
56.
Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate
inference in deep generative models.
ICML
32
, 1278–1286 (2014).
57.
Güçlütürk, Y.
et al.
Reconstructing perceived faces from brain activations with deep adversarial
neural decoding.
NeurIPS
4246–4257 (2017).
58.
Hyvärinen, A. & Oja, E. Independent component analysis: Algorithms and applications.
Neural
Networks
13
, 411–430 (2000).
59.
Hopcroft, J. E. & Karp, R. M. An N5/2 algorithm for maximum matchings in bipartite graphs.
SIAM J. Comput.
2
, 225–231 (1973).
Preprint. Under review.
10
Online methods
Dataset
We used a dataset of 2,162 natural grayscaled, centered and cropped images of frontal views of faces
with neutral facial expressions pasted on a gray 200x200 pixel background as described in
26
. The
face images were collated from multiple publicly available datasets
49–55
. 62 held out face images
were randomly chosen. These faces were among the 2,100 faces presented to the primates, but not
among the 2,100 faces used to train the models. All models (apart from VGG) were trained on the
same set of faces, which were mirror flipped with respect to the images presented to the primates.
This ensured that the train and test data distributions were similar, but not identical. To train the
Classifier baseline, we augmented the data with 5x5 pixel translations of each face to ensure that
multiple instances were present for each unique face identity. The data was split into 80%/10%/10%
train/validation/test sets.
Neurophysiological data
All neurophysiological data was re-used from
26
. The data was collected from two male rhesus
macaques (Macaca mulatta) of 7-10 years old. Face patches were determined by identifying regions
responding significantly more to faces than to non-face stimuli while passively viewing images on a
screen in a 3T TIM (Siemens, Munich, Germany) magnet. Tungsten electrodes (18–20 Mohm at
1 kHz, FHC) were used for single-unit recording. Spikes were sampled at 40 kHz. All spike data
were re-sorted with offline spike sorting clustering algorithms (Plexon). Only well-isolated units
were considered for further analysis. Monkeys were head fixed and passively viewed the screen in
a dark room. Eye position was monitored using an infrared eye tracking system (ISCAN). Juice
reward was delivered every 2–4 s if fixation was properly maintained. Images were presented in
random order. All images were presented for 150 ms interleaved by 180 ms of a gray screen. Each
image was presented 3–5 times. The number of spikes in a time window of 50-350 ms after stimulus
onset was counted for each stimulus. See
26
for further details.
Artificial neurophysiological data
In order to investigate whether the responses of
β
-VAE units
encoded linear combinations of neural responses, we created artificial neural data by linearly
recombining the responses of the real neurons. We first standardised the responses of the 159
recorded neurons across the 2,100 face images. We then multiplied the original matrix of neural
responses with a random projection matrix
A
. Each value
A
ij
of the projection matrix was sampled
from the unit Gaussian distribution. The absolute value of the matrix was then taken, and each
column was normalised to sum to 1.
Neuron subsets
For fairer comparison with the models, which learnt latent representations of size
N
∈
[10
,
50]
as will be described below, we sampled neural subsets with fifty or fewer neurons. To
do this, we first uniformly sampled five values from
N
∈
[10
,
50]
without replacement to indicate
Preprint. Under review.
11
the size of the subsets. Then, for each size value we sampled ten random neuron subsets without
replacement, resulting in 50 neuron subsets in total.
Model details
β
-VAE model
We used the standard architecture and optimisation parameters introduced in
18
for training the
β
-VAE (Fig. 7a). The encoder consisted of four convolutional layers (32x4x4 stride 2, 32x4x4
stride 2, 64x4x4 stride 2, and 64x4x4 stride 2), followed by a 256-d fully connected layer and
a 50-d latent representation. The decoder architecture was the reverse of the encoder. We used
ReLU activations throughout. The decoder parametrised a Bernoulli distribution. We used Adam
optimiser with
1
e
−
4
learning rate and trained the models for 1 mln iterations using batch size of
16, which was enough to achieve convergence. The models were trained to optimise the following
disentangling objective:
L
β
−
V AE
=
E
p
(
x
)
[
E
q
φ
(
z
|
x
)
[log
p
θ
(
x
|
z
)]
−
βKL
(
q
φ
(
z
|
x
)
||
p
(
z
)) ]
(1)
where
p
(
x
)
is the probability of the image data,
q
(
z
|
x
)
is the learnt posterior over the latent
units given the data, and
p
(
z
)
is the unit Gaussian prior with a diagonal covariance matrix.
Baseline models
We compared
β
-VAE to a number of baselines to test whether any individual aspects of
β
-VAE
training could account for the quality of its learnt latent units. To disambiguate the role of the
learning objective, we compared
β
-VAE to a traditional
autoencoder (AE)
36
and a basic
variational
autoencoder (VAE)
37, 56
. These models had the same architecture, training data, and optimisation
parameters as the
β
-VAE (Fig. 7a), but different learning objectives. The AE optimsed the following
objective that tried to optimise the quality of its reconstructions:
L
AE
=
E
p
(
x
)
||
f
(
x
;
θ,φ
)
−
x
||
2
(2)
where
f
(
x
;
θ,φ
)
is the image reconstruction produced by putting the original image through
the encoder and decoder networks parametrised by
φ
and
θ
respectively. The VAE optimised the
variational lower bound on the data distrbution
p
(
x
)
:
L
V AE
=
E
p
(
x
)
[
E
q
φ
(
z
|
x
)
[log
p
θ
(
x
|
z
)]
−
KL
(
q
φ
(
z
|
x
)
||
p
(
z
)) ]
(3)
where
q
(
z
|
x
)
is the learnt posterior over the latent units given the data, and
p
(
z
)
is the isotropic
unit Gaussian prior.
To test whether the supervised classification objective could be a good alternative to the
self-supervised disentangling objective, we compared
β
-VAE to two classifier neural network
baselines. One of these baselines, referred to as the
Classifier
in all figures and text, shared the
encoder architecture, the data distribution and the optimisation parameters with the
β
-VAE (Fig.
Preprint. Under review.
12
5b), but instead of disentangling, it was trained to differentiate between the 2,100 faces using a
supervised objective. In particular, the four convolutional layers and the fully connected layer of the
encoder fed into an N-dimensional representation, which was followed by 2,100 logits that were
trained to recognise the unique 2,100 face identities. In order to avoid overfitting, we used early
stopping. The final models trained for between 300 k and 1 mln training steps.
The other classifier baseline was the
VGG-Face
model
33
(referred to as the VGG in all
figures and text), a more powerful deep network developed for state-of-the-art face recognition
performance and previously chosen as an appropriate computational model for comparison against
neural data in face recognition tasks
34, 35, 57
(Fig. 5c). Similarly to other works
26, 34, 35, 57
, we used a
standard pre-trained VGG network, trained to differentiate between 2,622 unique individuals using
a dataset of 982,803 images
33
. Note that the data used for VGG training was unrelated to the 2,100
face images presented to the primates. The VGG therefore had a different architecture, training
data distribution and optimisation parameters compared to the
β
-VAE. The model consisted of 16
convolutional layers, followed by 3 fully connected layers (see
33
for more details). The last hidden
layer before the classification logits contained 4,096 units. Following the precedent set by
26
and
57
we used PCA to reduce the dimensionality of the VGG representation by projecting the activations
in its last hidden layer in response to the 2,100 test faces to the top N principal components (PCs)
(Fig. 5c, referred to as VGG (PCA) in figures). Alternatively, we also randomly subsampled the
units in the last hidden layer of VGG (without replacement) to control for any potential linear
mixing of their responses which PCA could plausibly introduce (Fig. 7c, referred to as VGG (raw)
in figures).
To rule out that the responses of single neurons could be modelled by simply explaining the
variance in the data we compared
β
-VAE to N PCs produced by applying
principal component
analysis (PCA)
to the 2,100 faces. To rule out the role of simply finding the independent compo-
nents of the data during
β
-VAE training, we compared
β
-VAE to the N independent components
discovered by
independent component analysis (ICA)
applied to the 2,100 face images.
Finally, we also compared
β
-VAE to the
active appearance model (AAM)
. Linear combi-
nations of small numbers of its latent units (six on average) was previously reported to explain
the responses of single neurons in the primate AM area well
3, 26
. We re-used the AAM latent
units from
26
. These were obtained by setting 80 landmarks on each of the 2,100 facial images
presented to the primates. The positions of landmarks were normalised to calculate the average
shape template. Each face was warped to the average shape using spline interpolation. The warped
image was normalised and reshaped to a 1-d vector. PCA was carried out on landmark positions
and shape-free intensity independently. The first N/2 shape PCs and the first N/2 appearance PCs
were concatenated to produce the N-dimensional AAM representations (Fig. 7d).
Training procedure and model selection
To ensure that all models had a fair chance of learning a useful representation, we trained multiple
instances of each model class using different hyperparameter settings. The choice of hyperparam-
eters and their values were dependent on the model class. However, all models went through the
Preprint. Under review.
13
same model selection pipeline: 1)
K
model instances with different hyperparameter settings were
obtained as appropriate; 2)
S
⊆
K
models with the best performance on the training objective were
selected; 3) models that did not discover any latent units that shared information with the neural
responses were excluded, resulting in
M
⊆
S
models retained for the final analyses. These steps
are expanded below for each model class.
Hyperparameter sweep
For the
β
-VAE model the main hyperparameter of interest that affects
the quality of the learnt latent units is the value of
β
. The
β
hyperparameter controls the degree of
disentangling achieved during training, as well as the intrinsic dimensionality of the learnt latent
representation
18
. Typically a
β >
1
is necessary to achieve good disentangling, however the exact
value differs for different datasets. Hence, we trained 400 models with different values of
β
by
uniformly sampling 40 values of
β
in the
[0
.
5
,
20]
range. Another factor that affects the quality
of disentangled representation is the random initialisation seed for training the models
31
. Hence,
for each
β
value, we trained 10 models from different random initialisation seeds, resulting in
the total pool of 400 trained
β
-VAE model instances to choose from. All
β
-VAE models were
initialised to have
N
= 50
latent units, however due to the variability in the values of
β
, the intrinsic
dimensionality of the trained models varied between ten and fifty.
In order to isolate the role of disentangling within the
β
-VAE optimisation objective from
the self-supervision aspect of training, we kept as many choices as possible unchanged between
the
β
-VAE and the AE/VAE baselines: the model architecture, optimiser, learning rate, batch size
and number of training steps. The remaining free hyperparameters that could affect the quality
of the AE/VAE learnt latent units were the random initialisation seeds, and the number of latent
units
N
. The latter was necessary to sweep over explicitly, since AE and VAE models do not
have an equivalent to the
β
hyperparameter that affects the intrinsic dimensionality of the learnt
representation. Hence, we trained 100 model instances for each of the AE and VAE model classes,
with five values of
N
sampled uniformly without replacement from
N
∈
[10
,
50]
, each trained from
twenty random initialisation seed values.
For the Classifier baseline we used the following hyperparameters for the initial selection:
five values of
N
∈
[10
,
50]
sampled uniformly without replacement, as well as a number of learning
rate values
{
1
e
−
3
,
1
e
−
4
,
1
e
−
5
,
1
e
−
6
,
1
e
−
7
}
and batch sizes
{
16
,
64
,
128
,
256
}
, resulting in
100 model instances. We trained the models with early stopping to avoid overfitting, and used the
classification performance on the validation set to choose the settings for the learning rate and batch
size. We found that the values used for training
β
-VAE, AE and VAE (learning rate
1
e
−
4
, batch
size 16) were also reasonable for training the Classifier, achieving
>
95%
classification accuracy.
Hence, we trained the final set of 450 Classifier model instances with fixed learning rate and batch
size, five values of
N
∈
[10
,
50]
sampled uniformly without replacement and fifty random seeds.
We used FastICA algorithm
58
to extract ICA units, which is dependent on the random
initialisation seed. Hence, we extracted
N
∈
[10
,
50]
independent components with ten random
initialisation seeds each, resulting in 41 ICA model instances.
The remaining baseline models relied on using a single canonical model instance (VGG and
AAM) and/or on a deterministic dimensionality reduction process (PCA, AAM, VGG). Hence, the
Preprint. Under review.
14
random seed hyperparameter did not apply to them. In order to make a fairer comparison with the
other baselines, we therefore created different model instances by extracting different numbers of
representation dimensions with
N
∈
[10
,
50]
, resulting in 41 PCA and VGG (PCA) model instances,
and 21 AAM instances (since
N
needs to split evenly into shape and appearance related units).
For the VGG (raw) variant, we first uniformly sampled five values from
N
∈
[10
,
50]
without
replacement to indicate the size of the hidden unit subsets. Then, for each size value we sampled
ten random hidden unit subsets without replacement, resulting in 50 VGG (raw) model instances in
total.
Model selection based on training performance
For each model class, apart from the deter-
ministic baselines (PCA, AAM and VGG), we selected a subset of model instances based on
their training performance. For the
β
-VAEs, we used the recently proposed Unsupervised Dis-
entanglement Ranking (UDR) score
32
to select 51 model instances with the most disentangled
representations (within the top 15% of UDR scores) for further analysis. For AE baseline, we
selected 50 model instances with the lowest reconstruction error per chosen value of
N
. For the
VAE baseline we selected 50 model instances with the highest lower bound on the training data
distribution per chosen value of
N
. Finally, for the Classifier baseline, we selected 81 models which
achieved
>
95%
classification accuracy on the test set.
Filtering out uninformative models
To ensure that all models used in the final analyses shared
at least some information with the recorded neural population, we performed the following filtering
procedure. First, we trained Lasso regressors as per Variance Explained section below, to predict the
responses of each neuron across the 2,100 faces from the population of latent units extracted from
each trained model. We then calculated the mean amount of Variance Explained (VE) averaged
across all neurons for each of the models. We filtered out all models where
V E <
V E
−
SD
(
V E
)
,
with
V E
and
SD
(
V E
)
represent the mean and standard deviation of VE scores across all models
respectively.
The full model selection pipeline resulted in 51
β
-VAE model instances, 50 AE, VAE and
ICA model instances, 41 PCA and VGG (PCA) model instances, 22 VGG (raw) model instances,
21 AAM model instances and 64 Classifier model instances that were used for further analyses.
Analysis methods
Variance explained
We used Lasso regression to predict the response of each neuron
n
j
from
model units. We used 10-fold cross validation using standardised units and neural responses to find
the sparsest weight matrix that produced mean squared error (MSE) results between the predicted
neural responses
ˆn
j
and the real neural responses
n
j
no more than one standard error away from the
smallest MSE obtained using 100 lambda values. The learnt weight vectors were used to predict the
neural responses from model units on the test set of images. Variance explained (VE) was calculated
on the test set according to the following:
Preprint. Under review.
15
VE
j
= 1
−
∑
i
(
ˆn
ij
−
n
ij
)
2
∑
i
(
n
ij
−
n
j
)
2
where
j
is the neuron index,
i
is the test image index, and
n
j
is the mean response magnitude
for neuron
j
across all test images. In order to speed up the Lasso regression calculations, we
manually zeroed out the responses of those model units that did not carry much information about
the face images. We defined units as “uninformative” if their standardised responses had low
variance
σ
2
<
0
.
01
across the dataset of 2,100 faces. We verified that this did not affect the sparsity
of the resulting Lasso regression weights.
Alignment score
We used the completeness score from
29
, referred to as alignment score in text
for more intuitive exposition. First we obtained the matrix
R
necessary for calculating the score
by training Lasso regressors to predict the responses of each neuron from the population of model
latent units. When calculating completeness against the original neural responses, we followed
the same procedure as per the variance explained calculations. When calculating completeness
against the artificial (linearly recombined) neural responses, we did not zero out the responses of
the “uninformative” units, since in this case this procedure affected the sparsity of the resulting
Lasso regression weights. Instead, in order to speed up calculations, we reduced the number of
cross validation splits from ten to three. The completeness score
C
j
for neuron
j
was calculated
according to the following:
C
j
=
ρ
j
(1
−
H
(
p
j
))
(4)
H
(
p
j
) =
−
∑
d
p
dj
log
D
p
dj
(5)
p
dj
=
R
dj
∑
d
R
dj
(6)
ρ
j
=
∑
d
R
dj
∑
dj
R
dj
(7)
where
j
indexes over neurons,
d
indexes over model units, and
D
is the total number of model
units. The overall completeness score per model is equal to the sum of all per-neuron completeness
scores
C
=
∑
j
C
j
. See
29
for more details.
Unsupervised Disentanglement Ranking (UDR) score
The UDR score
32
measures the quality
of disentanglement achieved by trained
β
-VAE models by performing pairwise comparisons between
the representations learnt by models trained using the same hyperparameter setting but with different
seeds. This approach requires no access to labels or neural data. We used the Spearman version of
the UDR score described in
32
. For each trained
β
-VAE model we performed 9 pairwise comparisons
with all other models trained with the same
β
value and calculated the corresponding UDR
ij
score,
Preprint. Under review.
16
where
i
and
j
index the two
β
-VAE models. Each UDR
ij
score is calculated by computing the
similarity matrix
R
ij
, where each entry is the Spearman correlation between the responses of
individual latent units of the two models. The absolute value of the similarity matrix is then taken
|
R
ij
|
and the final score for each pair of models is calculated according to:
1
d
a
+
d
b
[
∑
b
r
2
a
∗
I
KL
(
b
)
∑
a
R
(
a,b
)
+
∑
a
r
2
b
∗
I
KL
(
a
)
∑
b
R
(
a,b
)
]
(8)
where
a
and
b
index into the latent units of models
i
and
j
respectively,
r
a
= max
a
R
(
a,b
)
and
r
b
= max
b
R
(
a,b
)
.
I
KL
indicate the “informative” latent units within each model, and
d
is the
number of such latent units. The final score for model
i
is calculated by taking the median of UDR
ij
across all
j
.
Average correlation ratio and average unit proportion
For each neuron we calculated the
absolute magnitude of Pearson correlation with each of the “informative” model units. We then
calculated the ratio between the highest correlation and the sum of all correlations per neuron. The
ratio scores were then averaged (mean) across the set of unique model units with the highest ratios,
and this formed the
average correlation ratio
score per model. The number of unique model units
with the highest ratios divided by the total number of informative model units formed the
average
unit proportion
score.
Decoding novel faces from single neurons
We first found the best one-to-one match between
single model units and corresponding single neurons. To do this, we calculated a correlation matrix
D
ij
=
Corr
(
z
i
,
r
j
)
between the responses of each model unit
z
i
and the responses of each neuron
r
j
over the subset of 2,038 face images that were seen by both the models and the primates, where
Corr
stands for Pearson correlation. We then used Hopcroft-Karp
59
algorithm to find the best
one-to-one assignment between each model unit and a unique neuron based on the lowest overall
(1
−
D
ij
)
score across all matchings. We used the resulting one-to-one assignments to regress the
responses of single latent units from the responses of their corresponding single neurons to the
heldout 62 faces, using the same subset of 2,038 face images that were seen by both the models
and the primates for estimating the regression parameters. We standardised both model units and
neural responses for the regression. The resulting predicted latent unit responses were fed into
the pre-trained model decoder to obtain reconstructions of the novel faces. We calculated the
cosine distance between the standardised predicted and real latent unit responses for each face (after
filtering out the “uninformative” units), and presented the mean scores across the 62 held out faces
for each model.
Statistical tests
We used a two-tailed Welsch’s t-test for all pairwise model comparisons.
Preprint. Under review.
17
CNN
CNN
reconstructed
image
input
image
latent units
FC
FC
b
decoder
encoder
-3
+3
-3
+3
age
ethnicity
skin
shade
fringe
hair
length
hair
thickness
c
a
-3
+3
horizontal
position
vertical
position
size
unit 1
unit 2
unit 3
face
shape
smile
-3
+3
age
hair
colour
gender
unit 1
unit 2
unit 3
unit 1
-3
+3
rotation
chair
type
leg
type
unit 2
unit 3
-3
+3
nose
prominence
eye
distance
eyebrow
thickness
unit 1
unit 2
unit 3
Figure 1:
Disentangled representation learning. a.
Latent traversals used to visualise the semantic
meaning encoded by single disentangled latent units of a trained model. In each row the value of a
single latent unit is varied between -3 and 3, while the other units are fixed. The resulting effect on
the reconstruction is visualised. Each column represents a different model trained to disentangle a
different dataset. Aspects of some sub-figures are reproduced with the permission of Burgess et
al
19
and Lee et al
20
.
b.
Schematic representation of a self-supervised deep neural network. The
encoder maps the input image into a low dimensional latent representation, which is used by the
decoder to reconstruct the original image. Blue indicates trainable neural network units that are
free to represent anything. Pink indicates latent representation units that are compared to neurons.
CNN, convolutional neural network. FC, fully connected neural network.
c.
Latent traversals of
eight units of a
β
-VAE model trained to disentangle 2,100 natural face images. The initial values of
all latent units were obtained by encoding the same input image.
Preprint. Under review.
18
a
neuron 95
neuron 117
neuron 140
unit 2 - hair thickness
unit 3 - hair length
unit 4 - face shape
neuron 151
unit 5 - ethnicity
neuron 136
unit 6 - smile
neuron 68
unit 10 - age
b
variance explained
variance explained
Figure 2:
Responses of single neurons are well explained by single disentangled latent units.
a.
Coronal section showing the location of fMRI-identified face patches in two primates, with patch
AM circled in red. Dark black lines, electrodes.
b.
Explained variance of single neuron responses to
2,100 faces. Response variance in single neurons is explained primarily by single disentangled units
encoding different semantically meaningful information (insets, latent traversals as in Fig. 1a,c).
Preprint. Under review.
19