of 9
SHORT REPORT
Smartphone-based gaze estimation for in-home autism research
Na Yeon Kim
1
| Junfeng He
2
| Qianying Wu
1
| Na Dai
2
| Kai Kohlhoff
2
|
Jasmin Turner
1
| Lynn K. Paul
1
| Daniel P. Kennedy
3
| Ralph Adolphs
1,4,5
|
Vidhya Navalpakkam
2
1
Division of the Humanities and Social
Sciences, California Institute of Technology,
Pasadena, California, USA
2
Google Research, Mountain View,
California, USA
3
Department of Psychological and Brain
Sciences, Indiana University, Bloomington,
Indiana, USA
4
Division of Biology and Biological
Engineering, California Institute of
Technology, Pasadena, California, USA
5
Chen Neuroscience Institute, California
Institute of Technology, Pasadena,
California, USA
Correspondence
Na Yeon Kim, Division of the Humanities and
Social Sciences, California Institute of
Technology, Pasadena, CA, USA.
Email:
nayeon@caltech.edu
Funding information
Simons Foundation Autism Research Initiative;
Della Martin Foundation; Google
Abstract
Atypical gaze patterns are a promising biomarker of autism spectrum disorder.
To measure gaze accurately, however, it typically requires highly controlled stud-
ies in the laboratory using specialized equipment that is often expensive, thereby
limiting the scalability of these approaches. Here we test whether a recently devel-
oped smartphone-based gaze estimation method could overcome such limitations
and take advantage of the ubiquity of smartphones. As a proof-of-principle, we
measured gaze while a small sample of well-assessed autistic participants and con-
trols watched videos on a smartphone, both in the laboratory (with lab personnel)
and in remote home settings (alone). We demonstrate that gaze data can be effi-
ciently collected, in-home and longitudinally by participants themselves, with suf-
ficiently high accuracy (gaze estimation error below 1

visual angle on average)
for quantitative, feature-based analysis. Using this approach, we show that autis-
tic individuals have reduced gaze time on human faces and longer gaze time on
non-social features in the background, thereby reproducing established findings in
autism using just smartphones and no additional hardware. Our approach pro-
vides a foundation for scaling future research with larger and more representative
participant groups at vastly reduced cost, also enabling better inclusion of under-
served communities.
Lay Summary
Atypical eye gaze is a promising biomarker of autism but generally requires con-
trolled laboratory assessments with expensive eye trackers. Here we leveraged a
recently developed smartphone-based method that measures eye movements to
overcome such challenges. The smartphone method reliably characterized reduced
gaze onto human faces versus other non-human backgrounds, while individuals
watch videos on the phone screen, indicating promise for larger-scale clinical and
scientific studies of atypical eye gaze in autism.
KEYWORDS
autism, eye tracking, remote assessment, smartphones, visual attention
INTRODUCTION
Where we look reflects what interests us and determines
the visual information available for neural processing
(Itti & Koch,
2001
). Atypical eye gaze has been one of the
Na Yeon Kim and Junfeng He joint first authors.
Daniel P. Kennedy, Ralph Adolphs, and Vidhya Navalpakkam joint senior
authors.
Received: 14 December 2023
Accepted: 26 March 2024
DOI: 10.1002/aur.3140
This is an open access article under the terms of the
Creative Commons Attribution-NonCommercial-NoDerivs
License, which permits use and distribution in any
medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
© 2024 The Authors.
Autism Research
published by International Society for Autism Research and Wiley Periodicals LLC.
Autism Research.
2024;1
9.
wileyonlinelibrary.com/journal/aur
1
most consistent findings in autism research, suggesting
potential as a particularly promising biomarker with enor-
mous clinical promise for screening, diagnosis, and treat-
ment assessment (Bacon et al.,
2020
; Jones et al.,
2023
;
Perochon et al.,
2023
;Shicetal.,
2022
).
A fundamental limitation of eye movement research,
however, has been the over-reliance on specialized,
expensive hardware in tightly controlled laboratory set-
tings that cannot easily scale. This makes it difficult to
obtain large and diverse samples. Recent advances in
camera-based methods have been suggested to provide
solutions to overcome such challenges, but these have
generally been limited in accuracy (e.g., only distinguish-
ing gaze to the left or right sides of the screen) (Chang
et al.,
2021
; Erel et al.,
2023
; Valtakari et al.,
2023
;
Werchan et al.,
2022
). The spatial resolution of gaze mea-
sures appears to be crucial to achieve better performance
as a diagnostic tool (Jones et al.,
2023
) and may help
expand the range of visual stimulus types for hypothesis-
driven research. Here we leverage a recently developed
smartphone-based gaze estimation method (Valliappan
et al.,
2020
), which measures eye movements with higher
accuracy than previous camera-based methods. In the
current study, we validate and establish this smartphone-
based method in a small group of autistic and control
participants, by showing that this measure can be
acquired with high accuracy in home settings. Our results
provide a foundation for remote, longitudinal research in
large samples, and increase research access to under-
served communities.
Atypical gaze patterns in ASD can broadly be sum-
marized as reduced gaze on socially relevant features
(e.g., faces, body parts) and increased gaze on non-social
objects and background (Chita-Tegmark,
2016
;
Constantino et al.,
2017
; Guillon et al.,
2014
; Keles
et al.,
2022
; Klin et al.,
2015
; Wang et al.,
2015
). Such
feature-based differences in gaze have been observed
while individuals freely view static images or videos
depicting social interactions. Naturalistic videos, such as
movies, TV shows, and short video clips, have emerged
as a powerful stimulus type because they most potently
capture visual attention, facilitate participant engage-
ment, can be selected to include a wide variety of fea-
tures, and are more pervasive in the real world than static
or artificial stimuli (Chevallier et al.,
2015
; Grall &
Finn,
2022
). Thus, in validating the smartphone
approach, we characterized atypical social gaze in ASD
while watching YouTube videos on a smartphone.
METHODS
Participants
A total of 17 ASD and 22 typically developing
(TD) control participants completed the study. ASD par-
ticipants of all sexes and races were recruited from the
community and local clinics; TD participants were
selected from an existing database to be matched best to
the ASD group (Supplementary Table
1
). The diagnosis
of ASD was confirmed by a combination of ADOS-2
Module 4 and the revised scoring algorithm (Hus &
Lord,
2014
) and DSM-5 diagnostic interview by a clini-
cian (L.K.P). Autistic individuals were not excluded due
to comorbid diagnoses of depression, anxiety, or ADHD,
but were excluded if the total score on the Beck Depres-
sion Inventory-2 was >24 (i.e., moderate-to-severe
depression) to avoid potential confounds in gaze patterns
that are not specific to ASD (e.g., Suslow et al.,
2020
).
All participants gave written informed consent after
they were offered a full explanation of the study procedures
and opportunities to ask questions. The study protocol was
approved by the Caltech Institutional Review Board. All
participants had normal or corrected-to-normal vision.
Stimuli and procedures
To test the validity of the smartphone approach for in-
home settings, we compared the quality of smartphone-
based gaze data collected at home by the participants them-
selves, to smartphone-based data collected with the same
participants in controlled, laboratory settings with trained
personnel, as well as to data obtained from a standard
desktop-based Tobii Pro Spectrum eye tracker (Figure
1a
;
respectively:
Remote-Phone
,
Lab-Phone
and
Lab-Tobii
).
Participants viewed video clips (30 to 155 s each) while we
measured their gaze patterns: a total of 18 video-clips
(about 25 min of videos) in the
Lab-Phone
condition, and
9 clips in each of the 10
Remote-Phone
sessions (180 min of
videos from all 10 sessions). Participants watched the same
videos in the
Lab-Phone
and
Lab-Tobii
conditions to com-
pare gaze patterns across different methods and screen
sizes, but the longitudinal
Remote-Phone
sessions had dis-
tinct sets of videos. All video clips were obtained from vari-
ous shows in YouTube Originals ranging from sitcoms,
interviews, to trailers (see
Supplementary Material
for
more information about the video stimuli).
Video clips were presented in task blocks, with 3 video
clips per block. The
Lab-Phone
and
Lab-Tobii
conditions
consisted of 6 blocks (18 video clips), and the
Remote-
Phone
condition included 3 blocks (9 video clips). Partici-
pants took breaks between blocks if they needed. Each
task block began by asking participants to position their
head relative to the phone with a camera feed that indi-
cated whether the participant
s face was detected at the
correct viewing distance (30 cm). When participants set-
tled in a comfortable position, they underwent calibra-
tion procedures (see
Gaze estimation (eye tracking)
for
details). After calibration procedures, three video clips
were presented in a randomized order (within each
block). There was a 1-s central fixation period right
before each video clip. Each task block lasted 5 to 6 min.
Participants from the local community were first
invited to visit the laboratory for the
Lab-Phone
and
Lab-Tobii
conditions. This allowed them to become
2
KIM
ET AL
.
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
familiar with using the phone for the study before taking
it to their home for the subsequent
Remote-Phone
ses-
sions. One participant opted for remote participation and
completed the
Lab-Phone
session via video chat, receiv-
ing detailed instructions from an experimenter. The
Lab-
Tobii
condition was omitted for this remote participant.
For other participants, the
Lab-Phone
and
Lab-Tobii
ses-
sions were conducted in a quiet room with bright over-
head lighting. The smartphone was placed on a stand on
the desk, positioning the screen at the eye level in a near-
frontal head pose (i.e., no pan/tilt/roll). To avoid extreme
distances to the phone, an oval shape appeared on the
phone screen at the beginning and participants were
asked to adjust their distance to the phone such that their
face fit within the oval. Detailed instructions were pro-
vided to participants to ensure the proper setup of light-
ing conditions, smartphone placement, and the overall
environment for the
Remote-Phone
sessions (see
F I G U R E 1 High-accuracy gaze estimation with smartphones.
(a) Experimental setup. We included three conditions (
Lab-Phone
,
Remote-Phone
,
and
Lab-Tobii
) to validate the smartphone-based method in the lab space as well as in home environments, while comparing to the standard desktop-
based eye tracker. The illustrations present screen sizes, viewing distances, and the comparison of an example feature size between smartphone and
desktop displays. (b) Across
Lab-Phone
and
Remote-Phone
conditions, we obtained good data quality that is comparable to previous studies using the
same smartphone-based method (Valliappan et al.,
2020
). There was no group difference within the
Lab-Phone
and
Remote-Phone
conditions, but as
expected, the
Remote-Phone
condition has slightly worse data quality in both groups (*
p
< 0.05). The ASD group showed greater gaze error in the
Lab-Tobii
condition compared to the TD group. Open circles are outlier subjects who met exclusion criteria (average gaze error >0.8 cm). Gray
dotted lines indicate the size of the human face feature illustrated in the panel (a) on smartphone and Tobii displays, which helps with the
interpretation of gaze error relative to the display size. (c) Visualization of estimated gaze for 13 locations during the testing phase of the smartp
hone-
based gaze estimation model. The 13 locations consisted of 9 points (left, center, right

top, center, bottom) and 4 points (2

2 grid centers) with
random noise on each point. These 13 validation locations were identical for all participants. During this model testing phase, a small emoji
(approximately 0.5 cm on screen) appeared in each location. White crosses mark each of the 13 locations, and dotted circles around each white cross
indicate 1

and 2

deviations. Fixation data were collected for each location from all participants within each group (ASD and TD) and each
condition (
Lab-Phone
and
Remote-Phone
), yielding a fixation density map for each location, which was then normalized and smoothed with a two-
dimensional Gaussian with a standard deviation of 0.5

. Density estimates for each location ranged from 0 (white) to 1 (dark).
KIM
ET AL
.
3
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Supplementary Material
for verbatim instructions). To
minimize additional fatigue and eliminate potential
effects of hand movements, participants were instructed
not to hold the phone but rather to place it on a hard sur-
face, using a stable object as a phone stand (e.g., an open
laptop or a pile of books). Participants had the option to
meet briefly with an experimenter over video chat
to ensure their setup was correct, especially during the
first and second
Remote-Phone
sessions.
Data exclusion
There were broadly three reasons for missing or excluded
data at the level of participan
t or session. Data availability
and reasons for exclusion were specified for every partici-
pant and condition in Figure
S1
. (1) Since the study
involves multiple components
, some participants performed
only a subset of the components (Figure
S1
, purple).
(2) Some of the
Remote-Phone
data were missing due to
technical issues that resulted in transfer failure (Figure
S1
,
dark orange). (3) For all available datasets, gaze estimation
error was calculated for each session (see below for details).
We excluded data when session-wise gaze error was
>0.8 cm (Figure
S1
, gray). After applying the three criteria
as above, two ASD participants (ASD #13 and #15 in
Figure
S1
) did not have valid data across all sessions and
were excluded from the subsequent analyses. Except for the
two participants, there was no additional participant-wise
exclusion for missing many components of the study; those
with valid data from any of the conditions were included in
the corresponding conditions.
Thus, the final sample across all three conditions
(
Lab-Phone
,
Remote-Phone
, and
Lab-Tobii
) consisted of
15 ASD (2 females; age 21
42 years, mean
=
32.1
± 6.3 years; full-scale IQ 106.6 ± 10.7) and 22 TD partic-
ipants (3 females; age 22
49 years, mean
=
33.0
± 6.3 years; full-scale IQ 105.7 ± 8.1); A detailed charac-
terization of participants is included in Supplementary
Table
1
. participant information. From this sample,
13 ASD and 20 TD participants were included in the ana-
lyses for the
Lab-Phone
condition, 12 ASD and 17 TD
for the
Remote-Phone
condition, and 12 ASD and 22 TD
for the
Lab-Tobii
condition (Figure
S1
).
Any video clip with less than 50% valid data was
excluded from the analyses. This criterion led to no exclu-
sions for the
Lab-Phone
condition, exclusion of 3 clips
(from one ASD participant) from the
Lab-Tobii
condi-
tion, and exclusion of 4 clips (2 from one TD partici-
pants, 1 from each of 2 ASD participants; 0.2% of the
clips from included participants and sessions) from the
Remote-Phone
condition. After this exclusion, the aver-
age proportion of valid data was 96.6 ± 3.7% in the
Lab-
Phone
condition, 93.8 ± 4.7% in the
Lab-Tobii
condition,
and 97.3 ± 4.8% in the
Remote-Phone
condition, and
there was no difference between the ASD and TD groups
(for all three conditions
p
> 0.05).
Gaze estimation (eye tracking)
Throughout the study, gaze data were collected with a
custom Android app installed on a Google Pixel 3a XL
phone. As in Valliappan et al. (
2020
), the app displayed
the stimuli along with task instructions while receiving
user responses via click/touch on the screen. The app also
captured and stored the front-facing camera feed, which
was subsequently transferred to a secure data server at
Caltech. The two eye regions of the video were then
cropped to protect participant confidentiality, and the
cropped video data were shared with Google for gaze
estimation. All gaze estimation was performed offline
using the algorithm described in Valliappan et al. (
2020
).
The output of the algorithm consisted of the x- and
y-coordinates of the estimated gaze point and timestamp.
The bounding box of the face region was recorded from
the camera feed at each time point, which was used to
calculate changes in head pose and viewing distance.
As illustrated in Figure
1a
, the phone screen dimen-
sion was 13.6

6.8 cm (full device dimension:
16.0

7.6

0.8 cm) with a resolution of 2160

1080
pixels. All tasks on the phone were conducted in land-
scape mode. The viewing distance for the smartphone
session was 28 to 33 cm (11 to 13 inches), which was
adjusted in the beginning of every task block as described
below. The average temporal resolution of the
smartphone-based data was 20 Hz. A desktop-based
Tobii eye tracker (Tobii Pro Spectrum; 600 Hz;
52.6

29.6 cm; 1920

1080 pixels) was used to obtain
reference data for comparison. The viewing distance for
the Tobii session was 60 to 65 cm.
Similar to Valliappan et al. (
2020
), calibration proce-
dures consisted of a smooth-pursuit task with a small
emoji (60 s) and a 13-location validation (20 s), both of
which were conducted at the beginning of every task
block. Gaze data during the smooth pursuit were used to
fine-tune the gaze estimation model, and the 13-location
validation was used to test the model accuracy. The
13 locations included 9 points (left, center, right

top,
center, bottom) and 4 points (2

2 grid centers) with ran-
dom noise on each point. The same 13 locations were
used for all participants. Gaze estimation error was
defined by the average Euclidean distance of estimated
gaze points from each of the 13 locations. Gaze errors
from all blocks of the same session were then averaged
into one value per participant per session, which was used
as a criterion for quality-based exclusion. The entire ses-
sion (either the entire
Lab-Phone
condition or each of the
10
Remote-Phone
sessions) was excluded when
the session-average gaze error was >0.8 cm (1.5

). In
addition to estimated gaze, two measures of head motion
and positional changes were also captured from the
smartphone camera feed: the change in viewing distance,
which was measured by the variability of the size of the
bounding box that identifies the participant
s face, and
the angular motion of the head, which was measured by
4
KIM
ET AL
.
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
the variability of the distance between the two eyes. We
examined how motion might contribute to large gaze
error (Figure
S2b
).
For the
Lab-Tobii
condition, we used a 9-point cali-
bration and validation procedure at the beginning of
every task block.
Automatic segmentation of areas of
interest (AOIs)
To characterize each participant
s social visual attention
while watching the video stimuli, we computed gaze
duration on human facial features (
Face
) and
non-human regions (
Background
). We first annotated
human face and body areas within each frame of the
video stimuli by employing pre-trained neural networks
from publicly available packages, as illustrated in
Figure
2a
. Face areas were identified as bounding boxes
that outline the inner area (i.e., eyes, nose, mouth) of
each detected face, without the hair or neck area, using
the RetinaFace model provided in the InsightFace deep
face analysis toolbox (Deng et al.,
2020
). The contours of
human body areas were detected by using the DensePose
module in the Detectron2 software (Guler et al.,
2018
),
and regions outside the human body contours were iden-
tified as the Background. While these AOIs were fixed
for a given frame and consistent across all participants,
they moved around dynamically on the screen between
frames as the characters or objects in the video moved.
We visualized the annotations in all video stimuli and
ensured that these AOIs were identified accurately.
Gaze duration analyses
Each gaze data point was coded to Face, Background,
and neither, in order to compute the percentage of total
gaze time to each content. For the
Lab-Phone
and
Remote-Phone
conditions, we first upsampled gaze data to
50 Hz to overcome inconsistencies in sampling intervals
and then downsampled to the frame rate (i.e., one gaze
point per frame). Each gaze point of the downsampled
data was represented as a 1

-diameter disc. The gaze point
was coded as Face if any part of the gaze disc overlapped
with the face areas. If the gaze disc did not overlap either
with human face boxes or with areas inside the body con-
tours, it was coded as Background. The proportion of
Face or Background-coded gaze out of all video frames
that consisted of any human content was calculated and
averaged across the video clips within each session.
RESULTS
We found low gaze estimation error in both
Lab-Phone
and
Remote-Phone
settings (Figure
1b
; outliers with a
gaze error >0.8 cm on screen (approximately 1.5

) were
excluded from analysis and are indicated by open circles;
ASD: 4 out of 17; TD: 2 out of 22). After excluding out-
liers, we obtained similar gaze accuracy as reported in
other studies using the same smartphone-based technol-
ogy in the general adult population (Tseng et al.,
2021
;
Valliappan et al.,
2020
). Importantly, gaze accuracy did
not differ between the ASD (
Lab-Phone
:0.84

± 0.25;
Remote-Phone
: 0.95

± 0.22) and TD (
Lab-Phone
: 0.79

± 0.26;
Remote-Phone
: 0.84

± 0.22) groups,
F
(1, 31)
=
1.13,
p
=
0.30. As expected, gaze error increased
slightly for both groups in the
Remote-Phone
condition;
F
(1, 25)
=
7.04,
p
=
0.014, potentially due to greater var-
iability in the environmental setup, such as lighting, pose
and distractions. Gaze error in the
Lab-Tobii
data was
slightly higher in the ASD group (0.85

± 0.54) than the
TD group (0.44

± 0.40),
F
(1, 33)
=
6.89,
p
=
0.01. It is
important to note that due to the differences in screen
size and viewing distance, a feature (e.g., human face in
Figure
1a
) would appear half the size on a smartphone
screen (3

) compared to a desktop monitor attached to
the Tobii eye tracker (6

). In Figure
1b
, gray dotted lines
illustrate the size of this example feature on both display
types as a guide for the interpretation of gaze error rela-
tive to the display sizes. Furthermore, as visualized in
Figure
1c
, the majority of estimated gaze remained
within 1

across different locations on the screen, ensur-
ing excellent quality sufficient for us to proceed with gaze
analyses based on specific features of videos. Additional
analyses of head motion suggested that greater changes
in head pose angle and viewing distance may have con-
tributed to the larger gaze error in the excluded outliers
(Figure
S2
).
Next, we examined atypical social gaze in ASD that
is characterized by reduced gaze on faces and increased
gaze on nonsocial features compared to controls. Within
each video frame, we identified areas in which human
faces were displayed (
Face
) and no human face or body
features were displayed (
Background
) as shown in
Figure
2a
and compared the proportion of the time that
participants spent looking at each of those regions. As
demonstrated in Figure
2b
, the
Lab-Phone
condition
indeed replicated this pattern of decreased gaze on faces
(
t
=
3.66,
p
< 0.001,
d
=
1.30 [0.50, 2.10], Wilcoxon-
Mann
Whitney W
=
46, p
=
0.001) and increased gaze
on the background (
t
=
3.0,
p
=
0.005,
d
=
1.07 [0.29,
1.85], Wilcoxon-Mann
Whitney W
=
194, p
=
0.02),
confirming that smartphone-based data can reproduce
classic feature-based gaze differences in the ASD litera-
ture. The same pattern of results was obtained when we
controlled for gaze error, angular changes of head pose,
and distance changes (for both main effects:
p
< 0.01).
Since the very same participants and stimuli were also in
the
Lab-Tobii
condition, we next compared each individ-
ual
s gaze duration measures between the
Lab-Phone
and
the
Lab-Tobii
conditions. Despite the considerable differ-
ences in the screen size and spatiotemporal resolution, we
KIM
ET AL
.
5
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
found high within-individual reliability between the two
conditions, both for the time spent looking at faces
(Spearman
s
rho
=
0.74,
p
< 0.001, Figure
2c
) and for the
time spent looking at the background (Spearman
s
rho
=
0.67,
p
< 0.001). This further validates feature-
based gaze measures from the smartphone method, in
comparison to the conventional eye-tracking method.
How well does smartphone-based method work in the
much less controlled environment of the participants
own home and without an experimenter directly oversee-
ing the acquisition? Our
Remote-Phone
condition asked
participants to collect and upload data that they them-
selves collected without the presence of the experimenter.
In line with the high quality calibration data
F I G U R E 2 Reduced attention to social features in autism.
(a) Clips from YouTube Originals were used as video stimuli containing both social
and nonsocial features. Locations of human faces (
Face
, orange; detected as bounding boxes that cover inner facial features) or locations without
people (
Background
, areas outside the white contours) were annotated as shown in the example frame. Panels on the right present aggregated gaze
heatmaps from each group that were obtained within 1-s since participants viewed the example frame. (b) Using the high-accuracy smartphone-based
method, we reproduced the established finding in the literature of reduced face-looking time in autism, both for the
Lab-Phone
and
Lab-Tobii
conditions. (c) To compare the smartphone-based measures with the conventional desktop-based eye tracker, the same individuals watched the same
video stimuli once on a smartphone (
Lab-Phone
) and another time on a big computer screen (
Lab-Tobii
). Individual participants
bias for looking at
faces was reliable across
Lab-Phone
and
Lab-Tobii
, based on rank-order correlations between conditions. (d) We also found reduced face-looking
time in autism for the
Remote-Phone
condition. This plot presents average gaze durations in each individual across all
Remote-Phone
sessions.
(e) Individual participants
bias for looking at faces was significantly correlated between
Lab-Phone
and
Remote-Phone
conditions, indicating within-
individual reliability of face-looking time measures between lab and remote settings. (f) The stability of findings was further examined across
10 weeks of longitudinal data collection in the
Remote-Phone
condition. Rank-order correlations on the proportion of time looking at faces (similar
to panels (c) and (e)) were examined for all possible pairs of the 10
Remote-Phone
sessions, yielding significant positive
rho
values (bold, black fonts)
in the majority of the pairs. A potential reason for lower reliability in sessions 1, 9, and 10 (non-significant Spearman
s
rho
in gray) is that a relatively
small number of participants were included in the analyses after applying the exclusion criteria (see Methods). *
p
< 0.05, **
p
< 0.01, ***
p
< 0.001.
6
KIM
ET AL
.
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
(Figure
1b,c
), we also replicated the finding of reduced
social gaze in the
Remote-Phone
condition (Figure
2d
).
Once again, we observed reduced gaze on faces (
t
=
2.91,
p
=
0.007,
d
=
1.10 [0.27, 1.93], Wilcoxon-Mann
Whitney W
=
50, p
=
0.02) and increased gaze on the
background (
t
=
2.26,
p
=
0.03,
d
=
0.85 [0.043, 1.66],
Wilcoxon-Mann
Whitney W
=
144, p
=
0.07). Partici-
pants showed reliable gaze biases across the
Remote-
Phone
and
Lab-Phone
conditions (Spearman
s
rho
=
0.58,
p
=
0.002; Figure
2e
) although different
videos were used in the two conditions. A final advantage
of the
Remote-Phone
condition is the opportunity to col-
lect dense longitudinal data more easily. We obtained a
rank-order correlation on the proportion of time looking
at faces across all participants for every possible pair of
the 10
Remote-Phone
sessions. As shown in Figure
2f
,we
found positive correlations across the
Remote-Phone
ses-
sions (
rho
=
0.48 ± 0.18), suggesting that individual dif-
ferences in face-looking behavior were stable across time
and across different videos. Lower reliability in sessions
1, 9, and 10 (non-significant Spearman
s
rho
in gray in
Figure
2f
) was potentially due to the relatively small
number of participants included in the analyses after
applying the exclusion criteria. Future studies with a
larger sample could further examine the reliability of
individual differences across time and across different
stimuli and explore any meaningful variability within
each individual.
DISCUSSION
The purpose of the current study was to test the applica-
tion of a new smartphone-based gaze estimation method
in clinical and scientific studies of eye movement patterns
in ASD. From the smartphone-based gaze measures, we
characterized reduced social visual engagement in autistic
participants compared to controls, based on the propor-
tions of looking-time to faces and non-human back-
ground features presented in YouTube videos. These
gaze-time measures were highly consistent with what we
obtained from the desktop-based eye tracker. We further
demonstrated that gaze data can also be collected in
remote home settings without extensive instructions or
monitoring by experimenters. Our study suggests that
this in-home smartphone-based method could be used in
future studies to track longitudinal changes across
development, aging, or through the variability of
everyday life.
With a rapidly growing effort to develop advanced
algorithms to estimate gaze from camera inputs (Kaduk
et al.,
2023
; Park et al.,
2019
; Saxena et al.,
2023
;
Valliappan et al.,
2020
), smartphones are an excellent
tool to achieve both broad (large sample sizes) and dense
(i.e., extensive data per participant or longitudinal stud-
ies) datasets. The ubiquity of smartphones would facili-
tate more representative participant samples and the
inclusion of participants from traditionally underrepre-
sented communities (Werchan et al.,
2022
). Various
camera-based gaze estimation methods have been
recently explored not only in autism research (Chang
et al.,
2021
; Perochon et al.,
2023
), but also in develop-
mental and decision sciences (Erel et al.,
2023
; Werchan
et al.,
2022
; Yang & Krajbich,
2021
). Our work leveraged
a smartphone-based gaze estimation method that
achieves a level of accuracy suitable for analyzing gaze
patterns based on specific features of naturalistic videos,
thus well suited to characterize gaze patterns in autism.
This accuracy is also suitable for computational gaze
models that incorporate a rich set of audiovisual features
at multiple levels, from low-level visual to higher-level
semantic descriptions, allowing researchers to explore a
wide range of hypotheses (Wang et al.,
2015
). This
approach can offer valuable insight into the variability
both within and between individuals
for example, the
possibility of discovering data-driven subtypes of autism.
The ability to collect large, densely sampled data through
accurate and scalable methods, like the one employed in
our current study, holds great promise for future
research.
Our study had a number of limitations that can be
improved in future work. We identified head movement
during calibration (both angular and distance from the
screen) as one factor reducing data quality
(Figure
S2b,c
); This could be improved by instructing
participants to rest their head against a head rest, or by
mounting the smartphone on a movable arm on a reclin-
ing chair or on a bed to maximize comfort. In addition,
our sample consisted of high-IQ autistic and non-autistic
adults. Further research would be necessary to examine
how this method can be used with young children or indi-
viduals with lower IQ. In addition to requiring explicit
consent and supervision of parents/legal guardians, it
would be critical to assess whether comparable accuracy
can be maintained, considering these groups may exhibit
greater head movement or a tendency to look away from
the screen. Tailoring the setup
by providing detailed
instructions for seating arrangements with caregivers,
refining calibration procedures, and using stimuli that are
more engaging and capable of capturing attention
would be essential for accommodating these groups.
Nonetheless, even with a relatively small sample, we
demonstrated that significant differences in gaze patterns
while watching videos can be reliably reproduced using
just smartphones and no other specialized hardware, pro-
viding an initial validation of this novel method for ASD
research.
This in-home smartphone-based method has the
potential to revolutionize the scope of eye-gaze studies in
ASD. It offers the potential for several orders of magni-
tude scaling of both the sample size (from a few tens to
several thousand participants across the world, with the
relevant consents) and the amount of data per participant
through multiple short sessions in comfortable home
KIM
ET AL
.
7
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
environments. This approach would offer an opportunity
to investigate important open questions regarding hetero-
geneity and subtypes in autism (Lombardo et al.,
2019
).
Finally, it holds tremendous potential for clinical applica-
tions in line with the growing interest in utilizing
smartphones for remote diagnoses and interventions in
psychiatry (
digital phenotyping
) that could help
improve the well-being of autistic individuals (Gillan &
Rutledge,
2021
; Insel,
2017
).
ACKNOWLEDGMENTS
We would like to thank Dr. Umit Keles for helpful
advice on automatic feature segmentation and data anal-
ysis, members of the Emotion and Social Cognition Lab-
oratory at Caltech for helpful discussions, and all our
participants and their families for their participation.
This study was funded in part by grants from the Della
Martin Foundation, the Simons Foundation Autism
Research Initiative, and Google.
CONFLICT OF INTEREST STATEMENT
J.H., K.K., and V.N. are employees of Google. N.D. was
at Google when the study was designed and conducted.
All other authors declare no competing interests.
DATA AVAILABILITY STATEMENT
To protect participants
privacy, captured full face
image data will not be publicly available. The de-
identified gaze estimates (x- and y-coordinates of esti-
mated gaze on screen) and video stimuli are available
upon reasonable request. Implementation details of the
gaze estimation model are available in Valliappan et al.,
(
2020
). Preprocessing and analyses of gaze data and auto-
matic feature segmentation were performed using custom
MATLAB and Python scripts, which are available and
can be accessed via
https://github.com/nayeonckim/
smartphoneET_autism_analysis
.
ORCID
Na Yeon Kim
https://orcid.org/0000-0002-6832-5528
Kai Kohlhoff
https://orcid.org/0000-0003-2068-2531
REFERENCES
Bacon, E. C., Moore, A., Lee, Q., Carter Barnes, C., Courchesne, E., &
Pierce, K. (2020). Identifying prognostic markers in autism spec-
trum disorder using eye tracking.
Autism
,
24
(3), 658
669.
https://
doi.org/10.1177/1362361319878578
Chang, Z., Di Martino, J. M., Aiello, R., Baker, J., Carpenter, K.,
Compton, S., Davis, N., Eichner, B., Espinosa, S., Flowers, J.,
Franz, L., Harris, A., Howard, J., Perochon, S., Perrin, E. M.,
Krishnappa Babu, P. R., Spanos, M., Sullivan, C.,
Walter, B. K., & Sapiro, G. (2021). Computational methods to
measure patterns of gaze in toddlers with autism spectrum disor-
der.
JAMA Pediatrics
,
175
(8), 827
836.
https://doi.org/10.1001/
jamapediatrics.2021.0530
Chevallier, C., Parish-Morris, J., McVey, A., Rump, K. M.,
Sasson, N. J., Herrington, J. D., & Schultz, R. T. (2015). Measur-
ing social attention and motivation in autism spectrum disorder
using eye-tracking: Stimulus type matters.
Autism Research
,
8
(5),
620
628.
https://doi.org/10.1002/aur.1479
Chita-Tegmark, M. (2016). Attention allocation in ASD: A review and
meta-analysis of eye-tracking studies.
Review Journal of Autism
and Developmental Disorders
,
3
(3), 209
223.
https://doi.org/10.
1007/s40489-016-0077-x
Constantino, J. N., Kennon-McGill, S., Weichselbaum, C., Marrus, N.,
Haider, A., Glowinski, A. L., Gillespie, S., Klaiman, C.,
Klin, A., & Jones, W. (2017). Infant viewing of social scenes is
under genetic control and is atypical in autism.
Nature
,
547
(7663),
340
344.
https://doi.org/10.1038/nature22999
Deng, J., Guo, J., Ververas, E., Kotsia, I., & Zafeiriou, S. (2020). Reti-
naface: Single-shot multi-level face localisation in the wild.
Pro-
ceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition
, 5203
5212.
https://doi.org/10.
1109/CVPR42600.2020.00525
Erel, Y., Shannon, K. A., Chu, J., Scott, K., Struhl, M. K., Cao, P.,
Tan, X., Hart, P., Raz, G., Piccolo, S., Mei, C., Potter, C., Jaffe-
Dax, S., Lew-Williams, C., Tenenbaum, J. B., Fairchild, K.,
Bermano, A., & Liu, S. (2023). iCatcher
+
: Robust and automated
annotation of Infants
and young Children
s gaze behavior from
videos collected in laboratory, field, and online studies.
Advances
in Methods and Practices in Psychological Science
,
6
(2).
Gillan, C. M., & Rutledge, R. B. (2021). Smartphones and the neurosci-
ence of mental health.
Annual Review of Neuroscience
,
44
, 129
151.
https://doi.org/10.1146/annurev-neuro-101220-014053.
Smartphones
Grall, C., & Finn, E. S. (2022). Leveraging the power of media to drive
cognition: A media-informed approach to naturalistic neurosci-
ence.
Social Cognitive and Affective Neuroscience
,
17
(6), 598
608.
https://doi.org/10.1093/scan/nsac019
Guillon, Q., Hadjikhani, N., Baduel, S., & Rogé, B. (2014). Visual
social attention in autism spectrum disorder: Insights from eye
tracking studies.
Neuroscience and Biobehavioral Reviews
,
42
, 279
297.
https://doi.org/10.1016/j.neubiorev.2014.03.013
Guler, R. A., Neverova, N., & Kokkinos, I. (2018). DensePose: Dense
human pose estimation in the wild.
Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
, 7297
7306.
http://arxiv.org/abs/1612.01202
Hus, V., & Lord, C. (2014). The autism diagnostic observation schedule,
module 4: Revised algorithm and standardized severity scores.
Journal of Autism and Developmental Disorders
,
44
(8), 1996
2012.
https://doi.org/10.1007/s10803-014-2080-3
Insel, T. R. (2017). Digital phenotyping: Technology for a new science
of behavior.
JAMA
,
318
(13), 1215
1216.
https://doi.org/10.1001/
jama.2017.11295
Itti, L., & Koch, C. (2001). Computational modelling of visual atten-
tion.
Nature Reviews Neuroscience
,
2
(3), 194
203.
https://doi.org/
10.1038/35058500
Jones, W., Klaiman, C., Richardson, S., Aoki, C., Smith, C.,
Minjarez, M., Bernier, R., Pedapati, E., Bishop, S., Ence, W.,
Wainer, A., Moriuchi, J., Tay, S. W., & Klin, A. (2023). Eye-
tracking-based measurement of social visual engagement com-
pared with expert clinical diagnosis of autism.
JAMA
,
330
(9),
854
865.
https://doi.org/10.1001/jama.2023.13295
Kaduk, T., Goeke, C., Finger, H., & Konig, P. (2023). Webcam eye
tracking close to laboratory standards: Comparing a new webcam-
based system and the EyeLink 1000.
Behavior Research Methods
,
1-21
.
https://doi.org/10.3758/s13428-023-02237-8
Keles, U., Kliemann, D., Byrge, L., Saarimäki, H., Paul, L. K.,
Kennedy, D. P., & Adolphs, R. (2022). Atypical gaze patterns in
autistic adults are heterogeneous across but reliable within individ-
uals.
Molecular Autism
,
13
(1), 1
16.
https://doi.org/10.1186/
s13229-022-00517-2
Klin, A., Shultz, S., & Jones, W. (2015). Social visual engagement in
infants and toddlers with autism: Early developmental transitions
and a model of pathogenesis.
Neuroscience and Biobehavioral
8
KIM
ET AL
.
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Reviews
,
50
, 189
203.
https://doi.org/10.1016/j.neubiorev.2014.
10.006
Lombardo, M. V., Lai, M. C., & Baron-Cohen, S. (2019). Big data
approaches to decomposing heterogeneity across the autism spec-
trum.
Molecular Psychiatry
,
24
(10), 1435
1450.
https://doi.org/10.
1038/s41380-018-0321-0
Park, S., Mello, S. D., Molchanov, P., Iqbal, U., Hilliges, O., &
Kautz, J. (2019). Few-shot adaptive gaze estimation.
Proceedings
of the IEEE/CVF International Conference on Computer Vision
,
9368
9377.
Perochon, S., Di Martino, J. M., Carpenter, K. L. H., Compton, S.,
Davis, N., Eichner, B., Espinosa, S., Franz, L., Krishnappa
Babu, P. R., Sapiro, G., & Dawson, G. (2023). Early detection of
autism using digital behavioral phenotyping.
Nature Medicine
,
29
,
2489
2497.
https://doi.org/10.1038/s41591-023-02574-3
Saxena, S., Fink, L. K., & Lange, E. B. (2023). Deep learning models
for webcam eye tracking in online experiments.
Behavior Research
Methods
,
1-17
.
https://doi.org/10.3758/s13428-023-02190-6
Shic, F., Naples, A. J., Barney, E. C., Chang, S. A., Li, B.,
McAllister, T., Kim, M., Dommer, K. J., Hasselmo, S.,
Atyabi, A., Wang, Q., Helleman, G., Levin, A. R., Seow, H.,
Bernier, R., Charwaska, K., Dawson, G., Dziura, J., Faja, S., &
McPartland, J. C. (2022). The autism biomarkers consortium for
clinical trials: Evaluation of a battery of candidate eye-tracking
biomarkers for use in autism clinical trials.
Molecular Autism
,
13
(1), 15.
https://doi.org/10.1186/s13229-021-00482-2
Suslow, T., Husslack, A., Kersting, A., & Bodenschatz, C. M. (2020).
Attentional biases to emotional information in clinical depression:
A systematic and meta-analytic review of eye tracking findings.
Journal of Affective Disorders
,
274
, 632
642.
https://doi.org/10.
1016/j.jad.2020.05.140
Tseng, V. W., Valliappan, N., Ramachandran, V., Choudhury, T., &
Navalpakkam, V. (2021). Digital biomarker of mental fatigue.
Npj
Digital Medicine
,
4
(1), 47.
https://doi.org/10.1038/s41746-021-00415-6
Valliappan, N., Dai, N., Steinberg, E., He, J., Rogers, K.,
Ramachandran, V., Xu, P., Shojaeizadeh, M., Guo, L.,
Kohlhoff, K., & Navalpakkam, V. (2020). Accelerating eye
movement research via accurate and affordable smartphone eye
tracking.
Nature Communications
,
11
(1), 1
12.
https://doi.org/10.
1038/s41467-020-18360-5
Valtakari, N. V., Hessels, R. S., Niehorster, D. C., Viktorsson, C.,
Nystrom, P., Falck-Ytter, T., Kemner, C., & Hooge, I. T. C.
(2023). A field test of computer-vision-based gaze estimation in
psychology.
Behavior Research Methods
,
56
, 1900
1915.
https://
doi.org/10.3758/s13428-023-02125-1
Wang, S., Jiang, M., Duchesne, X. M., Laugeson, E. A.,
Kennedy, D. P., Adolphs, R., & Zhao, Q. (2015). Atypical visual
saliency in autism spectrum disorder quantified through model-
based eye tracking.
Neuron
,
88
(3), 604
616.
https://doi.org/10.
1016/j.neuron.2015.09.042
Werchan, D. M., Thomason, M. E., & Brito, N. H. (2022). OWLET:
An automated, open-source method for infant gaze tracking using
smartphone and webcam recordings.
Behavior Research Methods
,
1-15
, 3149
3163.
https://doi.org/10.3758/s13428-022-01962-w
Yang, X., & Krajbich, I. (2021). Webcam-based online eye-tracking for
behavioral research.
Judgment and Decision Making
,
16
(6), 1485
1505.
SUPPORTING INFORMATION
Additional supporting information can be found online
in the Supporting Information section at the end of this
article.
How to cite this article:
Kim, N. Y., He, J., Wu,
Q., Dai, N., Kohlhoff, K., Turner, J., Paul, L. K.,
Kennedy, D. P., Adolphs, R., & Navalpakkam, V.
(2024). Smartphone-based gaze estimation for
in-home autism research.
Autism Research
,1
9.
https://doi.org/10.1002/aur.3140
KIM
ET AL
.
9
19393806, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aur.3140 by California Inst of Technology, Wiley Online Library on [31/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License