A Number Sense as an Emergent Property
of the Manipulating Brain (Supplementary Material)
N. Kondapaneni and P. Perona – California Institute of Technology
March 13, 2024
A Additional Experiments
A.1 Controlling for spurious correlates of “number”
Do image properties, other than the abstraction of “object number”, drive the quantity
estimate of our model? Many potential
confound variables
, such as the count of pixels that
are not black, are correlated with object number and might play a role in the model’s ability
to estimate the number of objects in the scene. If that were the case, one might argue that
our model is not learning the abstraction of “number”, but rather learning to measure image
properties that are correlated with number.
We controlled for this hypothesis by exploiting the natural variability of our test set
images. We explored three image properties that correlate with the number of objects and
might thus be exploited to estimate the number of objects: (a) overall image brightness, (b)
the area of the envelope of the objects in the image, and (c) the total number of pixels that
differ from the background. Since objects in training set B vary both in size and in contrast,
these three variables are not deterministically related to object number and thus, we reason,
counfound variable fluctuations ought to affect error rates independently of the number of
objects.
We focused on close-call relative estimate tasks (e.g. 16 vs 18 objects), where errors
are frequent both for our model and for human subjects, and, while holding the number of
objects constant in each of the two scenes being compared, we studied the behavior of error
rates as a function of fluctuations in the confound variables. One would expect more errors
when comparing image pairs where quantities that typically correlate with the number of
objects are anticorrelated in the specific example (Fig. S1). Conversely, one would expect
lower error rates when the confound variables are positively correlated with number.
In Fig. S2 error rates are plotted vs each one of the confound variables when the n.
of objects is held constant. We could not find large systematic biases even for extreme
variations in the confound variables. In conclusion, we do not find support for the argument
that any of the confound variables we studied is implicated significantly in the estimate of
quantity.
1
0.408
Intensity
Numerosity : 14
0.000
Numerosity : 16
-0.309
Numerosity : 18
0.129
Summed
Object Areas
0.000
-0.286
0.201
Convex Hull
0.000
-0.269
Figure S1: Sample images where covariates are anticorrelated with number.
We
sample images where the three covariates we study (one covariate per row) are anticorrelated
with the number of objects. The number below each plot shows the fractional difference from
the value of the covariate in the reference image (center column). For example, in the top
right, there is a 30.9% decrease in average image intensity when compared to the intensity
in the reference image (center column). Another example: in the last row, the scene with
18 objects has a 26.9% smaller convex hull than the corresponding scenes with 14 and 16
objects. For each row, from the lowest numerosity to the highest, the model predicts a
perceived numerosity of 12.82, 14.01, and 16.60 (Intensity); 13.21, 14.43, 15.55 (Summed
Object Area); 13.22, 15.28, 16.44 (Convex Hull). Thus, our model correctly classifies the
relative numerosity for each one of the image pairs that may be formed from each row (our
model slightly underestimates numerosity, see Figure 5B.) Image pairs formed this way are
used in the experiments shown in Figure S2, where this manipulation was repeated multiple
times and confidence intervals were computed.
2
0.0
1.5
3.0
4.5
0.6
0.7
0.8
0.9
1.0
Reference : 3
1
2
4
5
0
1
2
3
1
2
4
5
0.0
2.5
5.0
7.5
1
2
4
5
0.6
0.0
0.6
1.2
1.8
0.6
0.7
0.8
0.9
1.0
Reference : 9
7
8
10
11
0.4
0.0
0.4
0.8
1.2
7
8
10
11
0.4
0.0
0.4
0.8
1.2
7
8
10
11
0.4
0.0
0.4
0.8
0.6
0.7
0.8
0.9
1.0
Reference : 16
14
15
17
18
0.25
0.00
0.25
0.50
0.75
14
15
17
18
0.2
0.0
0.2
0.4
14
15
17
18
0.3
0.0
0.3
0.6
Median Binned Intensity
0.6
0.7
0.8
0.9
1.0
Accuracy
Reference : 24
22
23
25
26
0.2
0.0
0.2
0.4
Median Binned Summed
Object Areas
22
23
25
26
0.15
0.00
0.15
0.30
Median Binned Convex Hull
22
23
25
26
Figure S2: Effects of covariates of numerosity.
Three covariates of the number of
objects in the scene are explored for possible influence on our model’s estimate of numeros-
ity. These are average image intensity
(left column)
, the sum of the areas of the objects
(middle column)
, and the area of the objects’ convex hull
(right column)
. Each plot
shows the error rates in a relative quantity discrimination task like the one in Figure 5A.
We generate a test set of 4650 test images, 150 images per number of objects. For each plot
we chose reference images containing respectively 3, 9, 16 and 24 objects (rows of the figure)
and had our model judge relative numerosity w.r. to test images containing a different but
similar number of objects (indicated in the legend and associated with colors). Given the
stochastic nature of the images, the covariates vary over a wide range for each number of
objects (see examples in Fig. S1). For each number of objects, we plot the model’s error
rates (y axis) as a function of the value of the covariate quantity (x axis) which is expressed
as fractional difference from the reference image (the values are binned). Shadows display
95% Bayesian confidence intervals(
N >
100, where N is bin size). Horizontal error lines in-
dicate no correlation of numerosity estimation with the covariate quantity. A few lines have
slopes that differ slightly from zero indicating a possible correlation. However, some of the
slopes indicate a negative correlation (i.e. the better the signal, the higher the error rate).
From this evidence it is difficult to conclude that that the model is exploiting anything but
“number” to estimate numerosity.
3
2
0
2
1
2
3
2
4
2
6
2
8
Embedding Dimension
10
2
10
1
Error
Take
Shake
Put
Figure S3: Action classification error as a function of embedding dimension.
Classification errors for Model B, averaged over the number of items in the scene (0 - 3) are
plotted as a function of the dimension of the embedding (a free parameter in our model).
Since the effect is minimal we arbitrarily picked a dimension of two for ease of visualization
(Figs. 4, S5). The shadows show 95% Bayesian confidence intervals (287
≤
N
≤
355).
A.2 Interpreting the Embedding Space
Does the dimension of the embedding space influence the action classification error? We
wondered what is the effect of this free parameter on the model’s performance. We explored
this question by training our model repeatedly with the same training images, and varying
the dimension of the embedding (Fig. 1). Figure S3 shows that the effect of the embedding
dimension is negligible. This was initially surprising to us. An explanation may be found
in the fact that learning produces an embedding that is organized as a line (see Fig 4 and
Sec. A.4).
Next, we explored the structure of the embedding space in the region where images
containing 0-3 objects (the training range) are represented. As discussed in the main text
we find that the embedding is organized into clusters (Fig. S5 (A,B)). Each cluster contains
embeddings of images with the same number of objects. For each pair of images that were
generated by a
put
action we drew a red arrow connecting the corresponding embeddings.
We used blue arrows for
take
pairs. It is clear from the figure that by following the red
arrows one may visit numbers in increasing order: 0-1-2-3 and vice-versa for blue arrows, i.e.
the embedding that is produced by our model supports counting up and down.
A.3 Varying Training Limit
In our main experiment we trained our model to classify actions with scenes containing
from zero to three objects. Does this choice influence qualitatively or quantitatively our
4