of 12
Temporal multiplexing of perception and
memory codes in IT cortex
In the format provided by the
authors and unedited
Nature | www.nature.com/nature
Supplementary information
https://doi.org/10.1038/s41586-024-07349-5
Methods
1
2
Seven
male rhesus macaques (Macaca mulatta) of 5
-
13 years old were used in this study.
3
All procedures conformed to local and US National Institutes of Health guidelines,
4
including the US National Institutes of Health Guide for Care and Use of Laboratory
5
Animals. All experiments were performed with the approval of the Caltech
and UC
6
Berkeley
Institutional Animal Care and Use Committee.
7
8
Visual stimuli
9
Face patch localizer.
The fMRI localizer stimuli
contained 5 types of blocks, consisting of
10
images of faces, hands, technological objects, vegetables/fruits, and bodies. Face blocks
11
were presente
d in alternation with non
-
face blocks. Each block lasted 24 s blocks (each
12
image lasted 500 ms). In each run, the face block was repeated four times and each of
13
the non
-
face blocks was shown once. A block of grid
-
scrambled noise patterns was
14
presented betw
een each stimulus block and at the beginning and end of each run. Each
15
scan lasted 408 seconds. Additional details can be found in
48
.
16
17
Monkey face model.
To generate a large number of monkey faces, we built an active
18
appearance model for monkey faces
49
, similar to the method used for human faces in
50
.
19
Images of frontal views of 165 monkey faces were obtained from the following sources:
20
a private database kindly provided by Dr. Katalin Gothard (101 images), the
PrimFace
21
database (visiome.neuroinf.jp/primface)
51
(22 images), YouTube videos of macaques
22
(
https://www.youtube.com/@ArrozMarisco
360
)
(1
6
images),
documentary movie (
Love Is
23
in the Wild Part 3
-
A Monkeys Life
,
https://kwanza.fr/catalogue/love
-
is
-
in
-
the
-
wild
) (3
24
images)
,
and face images of macaques from our lab (23 images). The “shape”
25
parameters were obtained by
manually labelling
59 landmarks on each of the frontal face
26
images
.
A 2D triangulated mesh was defined on these landmarks. The coordinates of the
27
landmarks of each image were normalized by subtracting the mean and scaling to the
28
same width, and a landmark template was obtain
ed by averaging corresponding
29
landmarks across faces. The “appearance” parameters were obtained by warping each
30
face to the landmark template through affine transform of the mesh. To reduce the
31
dimensionality of the model, principal component analysis was
performed on both the
32
coordinates of the landmarks (shape) and pixels of the warped images (appearance)
33
independently. The first 20 PCs of shape and first 100 PCs of appearance were kept for
34
the final model, capturing 96.1% variance in the shape distributi
on and 98.4% variance in
35
the appearance distribution. We used this model not only to generate unfamiliar monkey
36
faces, but also to compute shape
-
appearance features of familiar monkey faces (note:
37
these
familiar
faces were included in the 165
-
face database
). For the latter, we projected
38
the 59 landmarks and projected these onto the shape PCs; we then morphed the
39
landmarks to the standard landmark template and projected the resulting pixels of the
40
warped images onto the 100 appearance PCs.
41
42
Stimuli for elect
rophysiology experiments.
Eight
different sources of images were used to
43
generate three different stimulus sets (
Extended Data Fig.
2
).
44
1) Personally familiar human faces: Frontal views of faces of 9 people in the lab/animal
45
facility who interacted with t
he subject monkeys on a daily basis.
46
2) Personally familiar monkey faces: Frontal views of faces of 9 monkeys in our animal
1
facility that were current or previous roommates or cagemates of the subject monkeys,
2
reconstructed using the monkey face model.
3
3)
Personally familiar objects: Images of 8 toys the subject monkeys interacted with
4
extensively.
5
4
) Pictorially familiar monkey faces: Frontal views of faces of 8 monkeys from the
6
PrimFace database (visiome.neuroinf.jp/primface)
51
, reconstructed using the monkey
7
fa
ce model.
8
5
) Cinematically familiar monkey faces: Frontal views of faces of 19 monkeys from 7
9
movies clipped from 7 videos from YouTube
10
(
https://www.youtube.com/@ArrozMarisco360
) and documentary movie
11
(
https://kwanza.fr/catalogue/love
-
is
-
in
-
the
-
wild
)
, reco
nstructed using the monkey face
12
model.
13
6
) Unfamiliar human faces: 1840 frontal view of faces from various face databases:
14
FERET
52,53
, CVL
54
, MR2
55
, Chicago
56
, CelebA
57
, FEI (fei.edu.br/~cet/facedatabase.html),
15
PICS (pics.stir.ac.uk), Caltech faces 1999, Essex (Face Recognition Data, University of
16
Es
sex,
UK;
http://cswww.essex.ac.uk/mv/allfaces/faces95.html),
and
MUCT
17
(www.milbo.org/muct). The background was removed, and all images were aligned,
18
scaled, and cropped so that the two eyes were horizontally located at 45% height of the
19
image and the width
of the two eyes equaled 30% of the image width using an open
-
20
source face aligner (
github.com/jrosebr1/imutils
).
21
7
) Unfamiliar monkey faces: 1840
images were generated using the monkey face model
22
described above by randomly drawing from independent Gaussian distributions for shape
23
and appearance parameters, following the same standard deviation as real monkey faces
24
for each parameter. Faces with any
parameter larger than 0.8 * maximum value found in
25
a real monkey face were excluded to avoid unrealistic faces.
26
8
)
Unfamiliar objects: Images of objects were randomly picked from a subset of categories
27
in the COCO dataset (arXiv:1405.0312). The choice of
categories was based on two
28
criteria: 1)
only categories that our macaque subjects had no experience with (e.g.,
29
vehicles) were included,
2) categories with highly similar objects were excluded (e.g., stop
30
signs). The included super
-
categories were: 'acce
ssory', 'appliance', 'electronic', 'food',
31
'furniture', 'indoor', 'outdoor', 'sports', and 'vehicle'. 1500 images of objects with area larger
32
than 200
2
pixels were isolated, centered, and scaled to the same width or height,
33
whichever was larger.
34
35
We empha
size that due to the difficulty of obtaining a large set of high
-
quality monkey
36
face images, we used the monkey face model described above to synthesize unfamiliar
37
monkey faces; for consistency, all familiar monkey faces used in this study were also
38
recons
tructed using the monkey face model. Thus any differences in responses to familiar
39
versus unfamiliar faces cannot be attributed to use of synthetic stimuli.
Values of each
40
feature dimension were normalized by the standard deviation of the feature dimension
for
41
analysis purposes.
42
43
From these
8
stimulus sources,
3
different stimulus sets were generated:
44
1) Screening set consisting of 8 or 9 images from
9
different categories (human faces,
45
monkey faces, and objects, each
personally familiar
,
pictorially familiar,
or unfamiliar)
46
(
Extended Data Fig.
2
a
).
There were 74 screening stimuli in all
;
resp
o
nses to
24 pictorially
1
familiar faces and objects were not shown in Figure 1.
For unfamiliar stimuli, 8 novel
2
images were used for each cell or simultaneously recorded group of cells. Each image
3
was presented in random order, centered at the fixation spot, for 150 ms on 150 ms off
4
(gray screen), repeated 5
-
10 times. The size of each
image was 7.2° x 7.2°. Data using
5
the screening stimulus set are shown in
Fig. 1, 4
b,
c,
e,
f
,
Extended Data Fig.
3
c
, 4,
7
,
9
d
,
6
e
,
10
a
-
c
.
7
2) Thousand monkey face set
,
consisting of 1000 unfamiliar (examples shown in
8
Extended Data Fig.
2
b
) and 36 familiar faces (personally familiar, pictorially familiar, and
9
cinematically familiar faces,
Extended Data Fig.
2
a, c
), presented using the same
10
parameters as the screening set, except for number of repetitions (3
-
5 times). In addition,
11
the 8 nove
l unfamiliar faces shown in the screen
ing
set were shown again. Data using
12
the stimulus set are shown in
Fig. 2, 3, 4
d,
g,
Extended Data Fig.
3e
-
h
,
5
(except PR of
13
E,
and
TP)
,
9
,
10
d
-
k
.
14
3
)
T
housand monkey face set
2
,
to match both low
-
level and high
-
level features for
15
familiar vs. unfamiliar faces.
It
consist
ed
of
36 familiar faces (personally familiar, pictorially
16
familiar, and cinematically familiar faces,
Extended Data Fig. 2a, c
)
,
1044
unfamiliar
faces
17
(
each set of 36 unfamiliar faces were
generated by
random
permutation of shape
18
appearance features of
the
36 familiar faces
;
the distributions
of pairwise distance
of low
-
19
level features of Alex
Net
layer 1
for familiar vs. unfamiliar faces
were
check
ed
by
20
Kolmogorov
Smirnov test
,
29
sets of unfamiliar faces that were not significantly different
21
in either distribution were used
)
. These stimuli were
presented using the same
22
parameters as the screening set, except
the
number of repetitions
was
3
-
5 times
for 1008
23
unfamiliar face
s,
and to
control for
response reliability difference
s
due to
familiarity,
a
24
higher
number of repeats
was used for one set of 36 unfamiliar faces as well as for familiar
25
faces
(
15
-
25 times
)
.
Data
from
PR
of
monkey E and TP
of
monkey
s
A and E were
26
collected using this stimulus set.
27
28
Behavioral task
29
For electrophysiology and behavior experiments, monkeys were head fixed and passively
30
viewed
a screen in a dark room. Stimuli were presented on an LCD monitor (Acer
31
GD235HZ). Screen size covered 26.0
°
x 43.9
°
.
Gaze position was monitored using an
32
infrared camera eye tracking system (ISCAN) sampled at 120 Hz.
33
34
Passive fixation task
. All monkeys performed this task for both fMRI scanning and
35
electrophysiological recording.
Juice reward was delivered every 2
-
4 s in exchange for
36
monkeys m
aintaining fixation on a small spot (0.2° diameter).
37
38
Preferential viewing task
. Two monkeys were
trained to perform this task.
In each trial a
39
pair of face images (
7.2
°
x 7.2
°
)
were presented on the screen side by side with 14.4°
40
center distance (
Fig. 1
b
,
Extended Data Fig.
3
d
). Juice reward was given every 2
-
4 s in
41
exchange for monkeys viewing either
one of the images. Each pair of images lasted 10
42
s.
Face pairs were presented in random order. To avoid side bias, each pair was
43
presented twice with side swapped.
44
45
Face identification task.
A delayed match
-
to
-
sample task was used to test performance
1
on
face identification in two monkeys. The task was performed using a touch screen in a
2
cage without head fixation. The subject touched a dot at the center of the screen to initiate
3
a trial. In each trial, a face image (the sample) was shown for 1000 ms, fol
lowed by a pair
4
of images (the target and the distractor). The subject had 9 seconds to touch the face that
5
matched the sample. A juice reward was given for correct response. The sample was
6
presented at four different blur levels (clear, Gaussian blur stan
dard deviation 5, 10, and
7
20 pixels), while the target and distractor were presented as clear versions. The target
8
was identical to the sample except for the difference in blur. The task included 30 pairs of
9
familiar
-
unfamiliar faces and 30 pairs of unfami
liar
-
unfamiliar faces. For each face pair,
10
the distractor was matched to the target in low
-
level features (mean luminance, mean
11
contrast, hue distribution, and shape of the face outline). The Euclidean distance in shape
12
appearance feature space between the
target and distractor face was the same across
13
familiar and unfamiliar face targets.
14
15
MRI scanning and analysis
16
Subjects were scanned in a 3T TIM (Siemens, Munich, Germany) magnet equipped with
17
AC88 gradient insert. 1) Anatomical scans were performed usin
g a single loop coil at
18
isotropic 0.5 mm resolution. 2) Functional scans were performed using a custom eight
-
19
channel coil (MGH) at isotropic 1 mm resolution, while subjects performed a passive
20
fixation task. Contrast agent (Molday ION) was injected to impr
ove signal/noise ratio.
21
Further details about the scanning protocol can be found in
58
.
22
23
MRI Data Analysis.
Analysis of functional volumes was performed using the FreeSurfer
24
Functional Analysis Stream
59
and FSL
60
. Volumes were corrected for motion and
25
undistorted based on acquired field map. Runs in which the nor
m of the residuals of a
26
quadratic fit of displacement during the run exceeded 5 mm and the maximum
27
displacement exceeded 0.55 mm were discarded. The resulting data were analyzed using
28
a standard general linear model. The face contrast was computed by the a
verage of all
29
face blocks compared to the average of all non
-
face blocks.
30
31
Single
-
unit recording
32
Multiple different types of electrodes were used in this study. Single electrodes (Tungsten,
33
1 Mohm at 1 kHz, FHC) were used to collect most of the data. A
Neuropixel prototype
34
probe (128 channel, HHMI) was used to record ML from subject A. A multi
-
channel
35
stereotrode (64 channel, Plexon S
-
probe) was used to record AM during muscimol
36
silencing of PR in subject E. A chronic implanted microwire brush array (64
channel,
37
MicroProbes) (McMahon, et al., 2014) was used to record from face patch AM in subject
38
C. The electrode trajectories that could reach the desired targets were planned using
39
custom software
61
, and custom angled grids that guided the electrodes to the target were
40
produced using a 3D printer (3D system). Extracelluar neural signals were amplified and
41
recorded using Plexon. Spikes were sampled at 40 kHz. For single channel recorded
42
data, spike s
orting was performed manually by clustering of waveforms above a threshold
43
in PCA space using a custom
-
made software (Kofiko) in Matlab. Multichannel recorded
44
data was automatically sorted by Kilosort2 (github.com/MouseLand/Kilosort2)
62
and
45
manually refined in Phy (github.com/co
rtex
-
lab/phy).
46
1
Muscimol experiment
2
To silence face patch PR, 1 μl (5 mg/ml) muscimol (Sigma) was injected into PR at 0.5
3
μl/min using G33 needle (Hamilton) connected to a 10 μl micro
-
syringe controlled by a
4
micro
-
pump (WPI, UltraMicroPump 3). AM cells we
re recorded both before and 30 min
5
after injection.
6
7
Data analysis
8
All visually
-
responsive cells were included for analysis. To determine visual
9
responsiveness, a
two
-
sided
T
-
test was performed comparing activity at [
-
50 0] ms to that
10
at [50 300] ms
after stimulus onset. Cells with
p
-
value < 0.05 were included.
11
12
Face selectivity index
13
A face selectivity index (FSI) was defined for each cell as:
14
퐹푆퐼
=
푓푎푐푒
푛표푛
푓푎푐푒
푓푎푐푒
+
푛표푛
푓푎푐푒
15
where
is the average neuronal response
in a 50
-
300 ms window after stimulus onset
16
(
Extended Data Fig.
3
c
).
17
18
Population average of response time course
19
For each cell, responses to the same stimulus category were first averaged in 10 ms time
20
bins, then the responses were baseline
-
subtracted (usi
ng the average response in the
21
time window 0 to 50 ms), and normalized by the maximum response across different
22
stimulus categories after stimulus onset. The normalized responses were finally averaged
23
across cells for each category after smoothing by a Gau
ssian function with 10 ms
24
standard deviation (
Fig. 1
g
right,
Fig. 3
b
,
Extended Data Fig.
4
c
,
10
c
right
,
10
g
).
25
26
To determine the time point at which responses rose above baseline (e.g.,
Fig. 1
e
), we
27
compared the response at each time point to the baseline response (average response
28
over [
-
50 0] ms) using a
one
-
tailed
T
-
test, and determined the first time point at which
p
29
< 0.01.
30
31
Preferred axis of cells
32
The preferred axis of cells was computed
in two different ways
63
:
33
34
Spike
-
triggered average (STA)
. The average firing rate of a neuron was computed to each
35
stimulus, either in
a full time window [50
-
300] ms or sliding 50 ms time window after
36
stimulus onset. The STA was defined as:
37
38
푠푡푎
=
(
̅
)
39
40
where
is
1
×
vector of the firing rate response to a set of
face stimuli,
̅
is the mean
41
firing rate,
and
is a
×
matrix, where each row consists of the
parameters
42
representing each face stimulus in the feature space.
43
44
Linear regression/Whitened STA.
For a small sample of stimuli, e.g., 36 familiar faces,
1
the features are not necessarily white (i.e., uncorrelated)
. As a control,
to ensure that the
2
difference in STA observed in
Fig. 2b
,
c
was not due to mismatched feature distributions
3
between familiar and unfamiliar faces,
we repeated our main analysis using a whitened
4
STA (
Extended Data Fig.
9
a,
bottom
) as
follows:
5
푙푖푛
=
(
̅
)
(
)
1
6
7
For all figures, we used 20 dimensions to compute the preferred axis (first 10 shape and
8
first 10 appearance dimensions).
9
10
Principal orthogonal axis
11
The principal orthogonal axis was defined as the
longest
axis orthogonal to its
preferred
12
axis
. First, for e
ach of the 1000 unfamiliar face images represented as
-
dimensional
13
vector (
) in face feature space, its
component along the preferred axis (
) of the cell
14
was
subtracted
15
1
=
(
/
|
|
2
)
.
16
Then principal component analysis was performed on the set of 1000 vectors (
1
), and
17
the p
rincipal orthogonal axis was the
first principal component
.
18
19
Quantifying significance of axis tuning
20
For each cell, we compared the explained variance by the
axis model to a distribution of
21
explained variances computed for data in which stimulus identities were shuffled (1000
22
repeats). We considered axis tuning significant if the frequency of a higher explained
23
variance in the shuffle distribution was less than
5% (
Extended Data Fig.
3e, g
).
24
25
Quantifying consistency of preferred axis
26
For each cell
,
the stimuli were randomly split into two halves, and a preferred axis was
27
calculated using responses to each subset. Then, the Pearson correlation (
) was
28
calculated between the two. This process was repeated 100 times, and the consistency
29
of preferred axis for the cell was defined as the average
value across 100 iterations
30
(
Extended Data Fig.
3f, h
).
31
32
Face feature
decoding and
reconstruction
33
To decode face features, firing rates after stimulus onset in a chosen time window (see
34
Fig. 2
d,
e
legend) were first averaged across multiple repeats of the same stimulus, then
35
linear regression was performed on a training set of 999 unfamiliar faces to c
ompute the
36
linear mapping from population response vector
to face feature vector
:
37
=
38
The decoding was performed on the remaining one unfamiliar and all familiar faces using
39
this mapping
M
. Decoding accuracy was measured by (i) the correlation
coefficient
40
between decoded and actual face features (
Fig. 2
d
), (ii) the mean square error between
41
decoded and actual face features (
Fig. 4
g
)
.
For
both
methods, the decoding accuracy for
42
unfamiliar faces was computed 1000 times through leave
-
one
-
out cross
validation.
43
44
To reconstruct faces (
Fig. 2
e
), we built a face feature decoder
as above
using responses
1
to unfamiliar faces, computed either in a short ([120 170] ms) or long ([220 270] ms)
2
latency window.
3
4
Face identity decoding
5
To decode face identity
(
Fig.
4b
)
, firing rates after stimulus onset in a chosen time window
6
of each trial were randomly split in half and averaged. Then a multi
-
class linear SVM
7
decoder was trained to classify each face identity for 30 familiar or 30 unfamiliar feature
-
8
matched (
Ex
tended Data Fig.
8
) faces separately, using one half for training and testing
9
on the other half. This was repeated 20 times.
10
11
Familiarity decoding
12
Firing rates after stimulus onset in a chosen time window (stated for each particular case
13
in the figure
legends) were first averaged across multiple repeats of the same stimulus,
14
then the decoding accuracy was obtained as the average of leave
-
one
-
out cross
-
15
validated linear SVM decoding. For the thousand face set, the training sample was
16
balanced by randomly
subsampling 36 unfamiliar faces, repeated 10 times (
Fig. 3c
).
17
18
Centroid shift analysis
19
To determine the time when the shift of the neural representation centroids for familiar
20
and unfamiliar faces provided familiarity discriminability (
Fig. 3
d
), population responses
21
(50 ms sliding time window, step size 10 ms) to all 36 familiar faces and randomly
22
subsampled 36 unfamiliar faces were first projected to the axis connecting neural
23
centroids of familiar and unfamiliar faces. Then
d’
was computed fo
r the projected values:
24
=
푓푎푚푖푙푖푎푟
푢푛푓푎푚푖푙푖푎푟
1
2
(
푓푎푚푖푙푖푎푟
2
+
푢푛푓푎푚푖푙푖푎푟
2
)
25
Here
μ
and
σ
are the mean and variance of the projected values, respectively. The
26
computation was repeated 10 times for each random subsampling of 36 unfamiliar faces.
27
Chance level
d
’ was estimated by randomly shuffling the population responses to each
28
face 10 times.
The time when d’ was significantly higher than chance was determined by
29
o
ne
-
tailed T
-
test (
p
<
0.01
,
n
=
10
)
.
30
31
Analysis of orthogonality between familiarity and face feature decoding axes
32
To determine the cosine similarity between familiarity and face
feature decoding axes
33
(
Fig. 3
f
), we obtained the familiarity decoding axis as described above in the section on
34
“Familiarity decoding
.
” For unfamiliar faces, we obtained the face feature decoding axis
35
as described in the section above on “
Face feature
decoding and
reconstruction
.
For
36
familiar faces, we obtained the face feature decoding axis by computing the pseudo
-
37
inverse of the face feature encoding axis (necessary due to the small number of familiar
38
faces); we obtained the latter as described above
in the section on “Preferred axis of cells”
39
(using the STA).
40
41
Computing normalized firing rate changes
42
Normalized firing rate change in
Fig
.
4
e
was computed as follow:
43
푎푓푡푒푟
푏푒푓표푟푒
푎푓푡푒푟
+
푏푒푓표푟푒
1
where
푏푒푓표푟푒
is the mean firing rate within 50
-
300 ms after stimulus onset before
2
Muscimol injection, and
푎푓푡푒푟
is the same for after muscimol injection.
3
4
Matching face feature distributions
5
We wanted to ensure that the difference in preferred axis (
Fig. 2
b
,
c
)
and the difference
6
in pairwise distance in the neural state space (
Extended Data
Fig.
10
j, k
) were not due to
7
mismatched feature distributions between familiar and unfamiliar faces. To this end, we
8
identified a
feature
-
matched subset
of 30 familiar and 30
unfamiliar faces. For the top 20
9
face features, these two face sets were matched in feature variance
(
Extended Data Fig.
10
8
a
)
, distribution of pairwise face distances in feature space
(
Extended Data Fig.
8
b
)
, and
11
distribution of each feature
(
Extended Data Fig.
8
c
).
This was achieved by searching for
12
a subset of faces that minimized the following cost function:
13
=
푣푎푟
+
14
The first term evaluated the difference of variance:
15
푣푎푟
=
(
푓푎푚푖푙푖푎푟
(
)
푢푛푓푎푚푖푙푖푎푟
(
)
)
2
=
1
+
|
푓푎푚푖푙푖푎푟
(
)
=
1
푢푛푓푎푚푖푙푖푎푟
(
)
=
1
|
16
where
v(i)
is
the
variance of
the
ith
feature,
n
= 20 is
the
number of features. It is the sum
17
of mean square error and absolute value of mean difference between the variance of
18
each feature.
19
20
The second term ensured the distributions in consideration are not significantly different,
21
which was measured by the
p
val
ues of K
-
S test being larger than 0.05:
22
=
(
푚푖푛
(
,
1
,
...
,
)
)
23
(
)
=
{
1
/
(
+
0
.
001
)
<
0
.
05
0
0
.
05
24
where
is the
p
value of K
-
S test between distributions of pairwise face distances for
25
familiar vs. unfamiliar,
is the
p
value of K
-
S test bet
ween distributions of the
ith
feature
26
for familiar vs. unfamiliar.
27
28
The optimization was performed using a gradient
-
descent
-
like algorithm: in each iteration
29
푑퐶
was estimated by removing or adding each face, and the change that decreased
30
the most was
applied, until
did not decrease anymore. To balance the number of familiar
31
and unfamiliar faces in the result, we set a minimum number of familiar faces (23
-
36).
32
When the number was chosen to be 30, the resulting number of unfamiliar faces also
33
happened
to be 30.
34
35
Finally, we confirmed that for the resulting set of 30 familiar and 30 unfamiliar faces, the
36
faces were indeed feature matched
(
Extended Data Fig.
8
),
and the axis model explained
37
similar amounts of variance for both familiar and unfamiliar face responses
.
38
39
In
Extended Data Fig.
2
b
, to demonstrate the diversity of faces in the 1000 face set and
40
the capability to match each of our familiar face sets, we
used the same matching method
41
with different subsets of familiar faces.
42