1
o.:ȋͬͭͮͯͰͱͲͳʹ͵Ȍ
Scientific Reports
| (2024) 14:6858
|
https://doi.org/10.1038/s41598-024-56828-2
www.nature.com/scientificreports
A number sense as an emergent
property of the manipulating brain
Neehar Kondapaneni
*
& Pietro Perona
The ability to understand and manipulate numbers and quantities emerges during childhood, but
the mechanism through which humans acquire and develop this ability is still poorly understood. We
explore this question through a model, assuming that the learner is able to pick up and place small
objects from, and to, locations of its choosing, and will spontaneously engage in such undirected
manipulation. We further assume that the learner’s visual system will monitor the changing
arrangements of objects in the scene and will learn to predict the effects of each action by comparing
perception with a supervisory signal from the motor system. We model perception using standard
deep networks for feature extraction and classification. Our main finding is that, from learning the
task of action prediction, an unexpected image representation emerges exhibiting regularities that
foreshadow the perception and representation of numbers and quantity. These include distinct
categories for zero and the first few natural numbers, a strict ordering of the numbers, and a one-
dimensional signal that correlates with numerical quantity. As a result, our model acquires the ability
to estimate
numerosity
, i.e. the number of objects in the scene, as well as
subitization
, i.e. the ability
to recognize at a glance the exact number of objects in small scenes. Remarkably, subitization and
numerosity estimation extrapolate to scenes containing many objects, far beyond the three objects
used during training. We conclude that important aspects of a facility with numbers and quantities
may be learned with supervision from a simple pre-training task. Our observations suggest that cross-
modal learning is a powerful learning mechanism that may be harnessed in artificial intelligence.
Background
Mathematics, one of the most distinctive expressions of human intelligence, is founded on the ability to reason
about abstract entities. We are interested in the question of how humans develop an intuitive facility with num-
bers and quantities, and how they come to recognize numbers as an abstract property of sets of objects. There is
wide agreement that innate mechanisms play a strong role in developing a
number sense
1
–
3
, that development and
learning also play an important
role
2
, that naming numbers is not necessary for the perception of
quantities
4
,
5
,
and a number of brain areas are involved in processing
numbers
6
,
7
. Quantity-tuned units have been described
in physiology
experiments
3
,
8
–
10
as well as in computational
studies
11
–
14
.
Related work
The role of learning in developing abilities that relate to the natural numbers and estimation has been recently
explored using computational models. Fang et al.
15
trained a recurrent neural network to count sequentially
and Sabathiel et al.
16
showed that a neural network can be trained to anticipate the actions of a teacher on three
counting-related tasks—they find that specific patterns of activity in the network’s units correlate with quantities.
The ability to perceive
numerosity
, i.e. a rough estimate of the number of objects in a set, was explored by Stoi
-
anov, Zorzi and
Testolin
11
,
12
, who trained a deep network encoder to efficiently reconstruct patterns composed of
dots, and found that the network developed units or “neurons” that were coarsely tuned to quantity, and by Nasr
et al.
13
, who found the same effect in a deep neural network that was trained on visual object classification, an
unrelated task. In these models quantity-sensitive units are an emergent property. In a recent study, Kim et al.
14
observed that a random network with no training will exhibit quantity-sensitive units. After identifying these
units
11
–
14
, train a supervised classifier on a two-set comparison task to assess numerosity properties encoded
by the deep networks. These works showed that training a classifier with supervision, in which the classifier is
trained and evaluated on the same task and data distribution, is sufficient for recruiting quantity-tuned units for
relative numerosity comparison. Our work focuses on this supervised second stage. Can more be learned with
less supervision? We show that a representation for numerosity, that generalizes to several tasks and extrapolates
OPEN
California Institute of Technology, Pasadena, USA.
*
email: nkondapa@caltech.edu
2
Vol:.(1234567890)
Scientific Reports
| (2024) 14:6858 |
https://doi.org/10.1038/s41598-024-56828-2
www.nature.com/scientificreports/
to large quntities, may arise through a simple, supervised pre-training task. In contrast to prior work, our pre-
training task only contains scenes with up to 3 objects, and our model generalizes to scenes with up to 30 objects.
Approach
We focus on the interplay of action and perception as a possible avenue for this to happen. More specifically,
we explore whether perception, as it is naturally trained during object manipulation, may develop representa
-
tions that support a number sense. In order to test this hypothesis we propose a model where perception learns
how specific actions modify the world. The model shows that perception develops a representation of the scene
which, as an emergent property, can enable the ability to perceive numbers and estimate quantities at a
glance
17
,
18
.
In order to ground intuition, consider a child who has learned to pick up objects, one at a time, and let them
go at a chosen location. Imagine the child sitting comfortably and playing with small toys (acorns, Legos, sea
shells) which may be dropped into a bowl. We will assume that the child has already learned to perform at will,
and tell apart, three distinct operations (Fig.
1
A). The
put
(P) operation consists of picking up an object from the
surrounding space and dropping it into the bowl. The
take
(T) operation consists in doing the opposite: picking
up an object from the bowl and discarding it. The
shake
(S) operation consists of agitating the bowl so that the
objects inside change their position randomly without falling out. Objects in the bowl may be randomly moved
during put and take as well.
We hypothesize that the visual system of the learner is engaged in observing the scene, and its goal is predict-
ing the action that has taken
place
19
as a result of manipulation. By comparing its prediction with a copy of the
action signal from the motor system it may correct its perception, and improve the accuracy of its predictions
over time. Thus, by performing P, T, and S actions in a random sequence, manipulation generates a sequence of
labeled two-set comparisons to learn from.
We assume two trainable modules in the visual system: a “perception” module that produces a representa-
tion of the scene, and a “classification” module that compares representations and guesses the action (Fig.
1
).
During development, perceptual maps emerge, capable of processing various scene properties. These range
from basic elements like
orientation
20
and boundaries
21
to more complex features such as
faces
22
and
objects
23
,
24
.
We propose that, while the child is playing, the visual system is being trained to use one or more such maps
to build a representation that facilitates the comparison of the pair of images that are seen before and after a
manipulation. These representations are often called embeddings in machine learning.
Figure 1.
Schematics of our model. (
A
) (Left-to-right) A sequence of actions modifies the visual scene over
time.
(B)
(Bottom-to-top) The scene changes as a result of manipulation. The images
x
t
and
x
t
+
1
of the scene
before and after manipulation are mapped by perception into representations
z
t
and
z
t
+
1
. These are compared
by a classifier to predict which action took place. Learning monitors the error between predicted action and a
signal from the motor system representing the actual action, and updates simultaneously the weights of both
perception and the classifier to increase prediction accuracy. (
C
) (Bottom-to-top) Our model of perception
is a hybrid neural network composed of the concatenation of a convolutional neural network (CNN) with a
fully-connected network (FCN 1). The classifier is implemented by a fully connected network (FCN 2) which
compares the two representations
z
t
and
z
t
+
1
. The two perception networks are actually the same network
operating on distinct images and therefore their parameters are identical and learned simultaneously in a
Siamese network
configuration
25
. Details of the models are given in Fig. S15.
3
Vol.:(0123456789)
Scientific Reports
| (2024) 14:6858 |
https://doi.org/10.1038/s41598-024-56828-2
www.nature.com/scientificreports/
A classifier network is simultaneously trained to predict the action (P, T, S) from the representation of the
pair of images (see Fig.
1
). As a result, the visual system is progressively trained through spontaneous play to
predict (or, more accurately, post-dict) which operation took place that changed the appearance of the bowl.
We postulate that signals from the motor system are available to the visual system and are used as a super
-
visory signal (Fig.
1
B). Such signals provide information regarding the three actions of put, take and shake and,
accordingly, perception may be trained to predict these three actions. Importantly, no explicit signal indicating
the number of objects in the scene is available to the visual system at any time.
Using a simple model of this putative mechanism, we find that the image representation that is being learned
for classifying actions, simultaneously learns to represent and perceive the first few natural numbers, to place
them in the correct order, from zero to one and beyond, as well as estimate the number of objects in the scene.
We use a standard deep learning model of
perception
26
–
28
: a
feature extraction
stage is followed by a
classifier
(Fig.
1
). The feature extraction stage maps the image
x
to an internal representation
z
, often called an
embedding
.
It is implemented by a deep
network
27
composed of convolutional layers (CNN) followed by fully connected
layers (FCN 1). The classifier, implemented with a simple fully connected network (FCN 2), compares the
representations
z
t
and
z
t
+
1
of the
before
and
after
images to predict which action took place. Feature extraction
and classification are trained jointly by minimizing the prediction error. We find that the embedding dimension
makes little difference to the performance of the network (Fig. S3). Thus, for ease of visualization, we settled on
two dimensions.
We carried out train-test experiments using sequences of synthetic images containing a small number of
randomly arranged objects (Fig.
2
). When training we limited the top number of objects to three (an arbitrary
choice), and each pair of subsequent images was consistent with one of the manipulations (put, take, shake). We
ran our experiments twice with different object statistics. In the first dataset the objects were identical squares, in
the second they had variable size and contrast. In the following we refer to the model trained on the first dataset
as
Model A
and the model trained on the second dataset as
Model B
.
Results
We found that models learn to predict the three actions on a test set of novel image sequences (Fig.
3
) with
an error below 1% on scenes up to three objects (the highest number during training). Performance degrades
progressively for higher numbers beyond the training range. Model B’s error rate is higher, consistently with the
task being harder. Thus, we find that our model learns to predict actions accurately as one would expect from
supervised learning. However, there is little ability to generalize the task to scenes containing previously unseen
numbers of objects. Inability to generalize is a well-known shortcoming of supervised machine learning and
will become relevant later.
When we examined the structure of the embedding we were intrigued to find a number of interesting regu
-
larities (Fig.
4
). First, the images’ representations do not spread across the embedding, filling the available
dimensions, as is usually the case. Rather, they are arranged along a one-dimensional structure. This trait is very
robust to extrapolation: after training (with up to three objects), we computed the embedding of novel images
that contained up to thirty objects and found that the line-like structure persisted (Fig.
4
A). This
embedding
line
is also robust with respect to the dimensions of the embedding—we tested from two to 256 and observed
it each time (Fig. S3).
Second, images are arranged almost monotonically along the embedding line according to the number of
objects that are present (Fig.
4
A). Thus, the representation that is developed by the model contains an order. We
were curious as to whether the
embedding coordinate
, i.e. the position of an image along the embedding line, may
be used to estimate the number of objects in the image. Any one of the features that make up the coordinates
of the embedding provides a handy measure for this position, measured as the distance from the beginning of
the line—the value of these coordinates may be thought of as the firing rate of specific
neurons
29
. We tested
this hypothesis both in a relative and in an absolute quantity estimation task. First, we used the embedding
Figure 2.
Training image sequence samples. We trained our model using sequences of images that were
generated by randomly concatenating take (T), put (P) and shake (S) manipulations, while limiting the number
of objects to the
{
0
...
3
}
set (see “
Methods
”-Training Sets). We experimented with two different environment/
scene statistics: (
A
) Identical objects (
15
×
15
pixel squares) with random position. (
B
) Objects (squares) of
variable position, size and contrast. The overall image intensity is a poor predictor of cardinality in this dataset
(statistics in Fig. S14). Images have been inverted to better highlight objects with low contrast.