of 27
Nature Methods
| Volume 20 | July
2023 | 1010–1020
1010
nature methods
Analysis
https://doi.org/10.1038/s41592-023-01879-y
The Cell Tracking Challenge: 10 years of
objective benchmarking
Martin Maška
1
, Vladimír Ulman
1,2
, Pablo Delgado-Rodriguez
3,4
,
Estibaliz Gómez-de-Mariscal
3,4,5
, Tereza Nečasová
1
,
Fidel A. Guerrero Peña
6,7
, Tsang Ing Ren
6
, Elliot M. Meyerowitz
8
,
Tim Scherr
9
, Katharina Löffler
9
, Ralf Mikut
9
, Tianqi Guo
10
, Yin Wang
10
,
Jan P. Allebach
10
, Rina Bao
1 1,1 2
, Noor M. Al-Shakarji
12
, Gani Rahmon
12
,
Imad Eddine Toubal
12
, Kannappan Palaniappan
12
, Filip Lux
1
, Petr Matula
1
,
Ko Sugawara
1 3,1 4
, Klas E. G. Magnusson
15
, Layton Aho
16
, Andrew R. Cohen
16
,
Assaf Arbelle
17
, Tal Ben-Haim
17
, Tammy Riklin Raviv
17
, Fabian Isensee
1 8,1 9
,
Paul F. Jäger
19,20
, Klaus H. Maier-Hein
18,21
, Yanming Zhu
22,23
,
Cristina Ederra
24
, Ainhoa Urbiola
24
, Erik Meijering
22
, Alexandre Cunha
7
,
Arrate Muñoz-Barrutia
3,4
, Michal Kozubek
1,25
&
Carlos Ortiz-de-Solórzano
24
,25
The Cell Tracking Challenge is an ongoing benchmarking initiative that
has become a reference in cell segmentation and tracking algorithm
development. Here, we present a significant number of improvements
introduced in the challenge since our 2017 report. These include the
creation of a new segmentation-only benchmark, the enrichment of
the dataset repository with new datasets that increase its diversity and
complexity, and the creation of a silver standard reference corpus based
on the most competitive results, which will be of particular interest for
data-hungry deep learning-based strategies. Furthermore, we present
the up-to-date cell segmentation and tracking leaderboards, an in-depth
analysis of the relationship between the performance of the state-of-the-art
methods and the properties of the datasets and annotations, and two
novel, insightful studies about the generalizability and the reusability
of top-performing methods. These studies provide critical practical
conclusions for both developers and users of traditional and machine
learning-based cell segmentation and tracking algorithms.
The field of automated cell tracking has contributed extremely valuable
tools to life scientists with which to conduct their research
1
3
. However,
the emergence of technical developments that improve the resolu
-
tion
4
, dimensionality
5
, extent and throughput
6
of optical microscopes
demands new, improved tracking algorithms. Furthermore, the fast
evolution of machine learning
7
is changing the way cell tracking is per
-
formed, as deep neural networks rapidly replace classical image analysis
methods. These models provide impressive results while posing their
own share of challenges related to their training strategies, quality and
quantity of available training data, parametrization, and generalization.
The Cell Tracking Challenge (CTC) (
http://celltrackingchallenge.
net
) is an ongoing initiative that promotes the development and objec
-
tive evaluation of automated cell tracking algorithms. Launched in 2013
under the auspices of the 10th IEEE (Institute of Electrical and Electron
-
ics Engineers) International Symposium on Biomedical Imaging (ISBI),
the CTC provides developers with a rich and diverse annotated dataset
Received: 5 August 2022
Accepted: 13 April 2023
Published online: 18 May 2023
Check for updates
A full list of affiliations appears at the end of the paper.
e-mail:
kozubek@fi.muni.cz
;
codesolorzano@unav.es
Nature Methods
| Volume 20 | July
2023 | 1010–1020
1011
Analysis
https://doi.org/10.1038/s41592-023-01879-y
a selected part of the embryo is covered. The silver truth consists of
computer-generated segmentation annotations obtained by fusing the
results of high-performing benchmarked methods over the training
sequences, using the detection gold truth to drive the fusion process
(see ‘Silver standard reference annotation’ in Methods). This silver
truth improves the cell instance coverage (99.1% on average), providing
the participants with a larger set of annotated cell instances that can
be used, for example, to train deep learning models. Supplementary
Data Tabs 1 and 2 contain the coverage of both gold truth and silver
truth annotations for all datasets. Note that 100% coverage was not
attainable because even the best-performing benchmarked methods
did not always detect and segment all cell instances in a particular
video. All reference annotations are publicly available for the training
datasets but are kept secret for the test datasets. This helps prevent
overfitting by not enabling the methods to be tuned specifically for
the test data. Thus, it ensures that the performance of the methods is
evaluated based on their ability to generalize rather than their ability
to memorize the training data.
Participants, algorithms and handling of submissions
The CTC has witnessed a remarkable increase in participation since
the time of our last report in 2017 (ref.
9
). The number of participat
-
ing teams has increased from 16 to 50, representing 19 countries. The
number of benchmarked algorithms has also increased from 21 to 89. All
submissions, consisting of labeled segmentation masks and structured
text files with cell-lineage graphs in the case of tracking results, fol
-
lowed standardized naming conventions and were verified by the CTC
organizers using the provided executable versions of the algorithms.
A complete list of the participants’ segmentation and tracking algo
-
rithms can be found in Supplementary Data Tabs 3 and 4, and a global
overview of the strategies and techniques used is presented in Fig.
3
.
Approximately one-third of segmentation methods use a separate
detection step first (DetSeg) instead of segmenting the objects directly
(Seg). Regarding the tracking task, we confirm the overall dominance
of methods in which linking is based on a prior per-frame segmentation
(SegLnk or DetSegLnk) over those in which the linking part is based on
per-frame detection only (DetLnkSeg) or simultaneous segmentation
and linking (Seg&Lnk).
Technical performance of the submitted algorithms
The CTC has two benchmarks: the Cell Tracking Benchmark (CTB) and
the Cell Segmentation Benchmark (CSB). The CTB, which has been
active since the inception of the CTC, evaluates the segmentation (SEG)
and tracking (TRA) accuracy of the submitted methods. The CSB, intro
-
duced in 2019, focuses on segmentation (SEG) and detection (DET)
accuracy, without considering the linking of cells over time. All these
evaluation measures are described in Methods (in the ‘Quantitative
performance criteria’ section).
The scores of all CSB and CTB submissions received before
1 June 2022 can be found in Supplementary Data Tabs 5–9 (CSB) and
Tabs 10–14 (CTB). Figure
4a
shows the SEG and DET performance of
the top-3 CSB methods, along with the overall performance measure
(OP
CSB
), calculated as the arithmetic mean of both measures. Likewise,
Fig.
4b
shows the SEG and TRA performance of the top-3 CTB methods,
along with the overall performance measure (OP
CTB
). To globally rank
the methods, we computed the weighted number of occurrences of
each method or its generalizable version (labeled with an asterisk,
see ‘Generalizability study’) in the top-3 positions of the CSB and CTB
leaderboards (Fig.
4
). We assigned 1, 2 or 3 points for each top-3, top-2
and top-1 occurrence, respectively. Based on this calculation, the top-3
CSB methods are CALT-US (*) (ref.
13
), KIT-GE (3) (ref.
14
), and (shar
-
ing third place) DKFZ-GE
15
, KIT-GE (4) (ref.
16
) and KTH-SE (1) (ref.
17
),
and the top-3 CTB methods are KIT-GE (3), KIT-GE (4) and KTH-SE (1).
A description of these methods is given in the ‘Top-performing Algo
-
rithms’ section in Methods (and on the challenge website).
repository of multidimensional time-lapse microscopy videos along
with objective measures and procedures to evaluate their algorithms.
These highly valuable resources are freely available to the scientific
community for use in their research.
In 2014 the first report was published
8
, describing the CTC sub
-
mission and evaluation procedures and presenting the analysis of the
results submitted by six participants for a repository containing eight
datasets. In 2017 an in-depth analysis of 21 algorithms was published
9
,
based on the segmentation and tracking results submitted for 13 data
-
sets. From the results presented, we concluded that the methods that
used contextual (that is, spatial and temporal) information and those
few at the time that followed learning strategies outperformed the
more conventional methods. Notably, the state-of-the-art U-Net
10
archi
-
tecture was among the top-performing approaches for cell segmenta
-
tion in several contrast-enhanced datasets. It was also notable that
completely unsupervised tracking methods were still a distant dream.
The optimal solutions remained dataset specific due to the complexity
and diversity of the datasets. Moreover, most proposed methods were
still inadequate for low signal-to-noise ratio videos or for tracking cells
with complex shapes or textures. Large three-dimensional (3D) data-
sets, such as those of developing embryos, were identified as extremely
challenging due to the high number and density of cells, as well as the
computational requirements of their processing.
Since 2017, the CTC has received a significant number of new
submissions and has addressed many of the challenges previously
identified in the field of automated cell tracking, as described in the
following sections.
Results
Datasets
The CTC dataset repository has been extended from 13 data
-
sets in 2017 to 20 datasets. The new datasets consist of two-
dimensional (2D) epi-fluorescence time-lapse videos of human
hepatocarcinoma-derived cells expressing a fusion yellow fluores
-
cent protein (YFP)-TIA-1 protein (Fig.
1a
); 2D bright-field time-lapse
videos of mouse hematopoietic (Fig.
1b
) or muscle (Fig.
1c
) stem
cells in hydrogel microwells; 3D time-lapse videos of green fluores
-
cent protein (GFP)-actin A549 lung cancer cells (Fig.
1d
) and their
computer-generated counterparts (Fig.
1e
) displaying prominent,
highly dynamic filopodial protrusions; and mesoscopic videos
(imaged across several millimeters at video frame rate) of develop
-
ing
Tribolium castaneum
embryos available as 3D cartographic pro
-
jections (>10 GB per sequence) (Fig.
1f
) or as complete 3D datasets
(>100 GB per sequence) (Fig.
1g
). Supplementary Table 1 provides a
technical description of all of the datasets and Fig.
2
contains a sum-
mary of the main quality properties of the datasets.
Reference annotations
The CTC provides two measure-specific reference annotations: seg
-
mentation annotations consisting of cell instance masks, which out
-
line individual cell regions, and tracking annotations consisting of
cell markers interlinked between frames to form lineage trees. The
reference annotations can be classified into three types based on their
source and how they were generated. For synthetic datasets, generated
using in-house developed software
11
,
12
, the segmentation and detec
-
tion and/or tracking reference annotations are the exact, simulated
digital cell phantoms prior to the addition of distorting noise and blur.
For real datasets, we distinguish between a gold standard reference
corpus (in short, gold truth) and a novel silver standard reference
corpus (in short, silver truth). The gold truth is obtained by taking a
majority opinion between three experts. The segmentation gold truth
offers limited cell instance coverage (17.8% on average; Supplemen
-
tary Data Tabs 1 and 2) due to the labor-intensive nature of manual
annotations. The detection and tracking gold truth offers complete
cell instance coverage, except in large embryonic datasets, where only
Nature Methods
| Volume 20 | July
2023 | 1010–1020
1012
Analysis
https://doi.org/10.1038/s41592-023-01879-y
Globally, the evolution of the CSB scores obtained by the
best-performing methods from 2017 to June 2022 is given in Extended
Data Fig. 1a, which shows clear improvement in both detection and
segmentation on most datasets, with particularly impressive improve
-
ments on two of the most complex datasets (Fluo-C2DL-MSC and
Fluo-N3DL-DRO). In summary, even if the cell detection task seems
nearly solved for most datasets, the segmentation task still requires fur
-
ther attention for some of the old (Fluo-C2DL-MSC, Fluo-C3DL-MDA231,
Fluo-N3DL-DRO and PhC-C2DL-PSC) and new datasets. Looking at the
evolution of the CTB segmentation scores (Extended Data Fig. 1c), there
is also significant improvement in most datasets but more work needs
to be done to improve the segmentation and tracking performance in
the same datasets that have been mentioned for the CSB.
Image quality versus algorithm performance
We analyzed the relationship between the technical performance val-
ues obtained by the participants (Supplementary Data Tabs 5–9 and
Tabs 10–14) and the quality of the datasets listed in Fig.
2
. As described
in the ‘Statistical analysis’ section of Methods, we calculated the Spear
-
man’s rank correlation between each numerical quality measure of
the dataset and the performance of all competing algorithms. The
analysis was conducted globally, that is, considering all datasets, per
data modality, and individually. The results are presented in Supple
-
mentary Figs. 1–40.
Globally, we discovered that only the cell overlap (Ove) showed
correlation (moderate, rho = 0.4) with the segmentation performance
(SEG) of the algorithms (Supplementary Fig. 34). This result, besides
pointing at other cofactors that may work along with Ove, indicates
that the cells that do not dramatically change their shape, or show a
moderate motility, are easier to segment than those with high shape
variability or high motility.
Looking at the correlations per modality, strong correlations were
found between the performance of the methods on Fluo-2D datasets
and the signal-to-noise ratio (SNR; positive for TRA, Supplementary
Fig. 4), resolution (Res; negative for TRA, Supplementary Fig. 20),
shape (Sha; positive for SEG, Supplementary Fig. 22) and mitotic divi-
sion rate (Mit; positive for TRA, Supplementary Fig. 40). The positive
effect of high SNR and regular cell shape (high Sha) could be expected.
The counterintuitive benefit of a low Res for tracking can be explained
by the negative effect of the low performance values obtained for
two complex datasets, Fluo-C2DL-MSC and Fluo-C2DL-Huh7, which
have relatively high Res levels but which are plagued by irregular cell
shape (low Sha), high photobleaching (high change in cell signal inten
-
sity over time, that is, high Cha), low SNR and low contrast ratio (CR)
(Fluo-C2DL-MSC), high levels of signal heterogeneity (both inside
and between cells, that is, Het
i
and Het
b
, respectively) and low Mit
(both Fluo-C2DL-MSC and Fluo-C2DL-Huh7). These negative factors
clearly outweigh the benefits of their relatively high Res. Regarding the
Fluo-3D datasets, the high differences between datasets is reflected in
the moderate correlations between the performance of the methods
and Res (positive for SEG and TRA, Supplementary Figs. 18 and 20), Sha
(negative for SEG, Supplementary Fig. 22), and the spacing between
cells (Spa; positive for SEG Supplementary Fig. 26).
Bright-field performance values correlate with SNR (positive for
SEG and TRA, Supplementary Figs. 2 and 4), CR (negative for SEG and
TRA, Supplementary Figs. 6 and 8), Het
i
(negative for SEG and TRA,
Supplementary Figs. 10 and 12), Het
b
(negative for SEG and TRA, Sup
-
plementary Figs. 14 and 16), Res (negative for SEG, Supplementary
a
b
c
d
e
f
g
Fig. 1 | CTC datasets added after 2017.
a
, Fluo-C2DL-Huh7.
b
, BF-C2DL-HSC.
c
, BF-C2DL-MuSC.
d
, Fluo-C3DH-A549.
e
, Fluo-C3DH-A549-SIM.
f
, Fluo-N3DL-
TRIC (due to the cartographic post-production of this dataset, the spatial
resolution varies with position in the image between 0.10 and 0.76 μm per pixel;
thus, a fixed-size bar is inappropriate for this dataset).
g
, Fluo-N3DL-TRIF. For the
definitions of the dataset names please see the Fig.
2
legend. Scale bars:
a
c
,
g
, 50 μm;
d
,
e
, 20 μm.
Nature Methods
| Volume 20 | July
2023 | 1010–1020
1013
Analysis
https://doi.org/10.1038/s41592-023-01879-y
Fig. 18), Sha (positive for SEG, Supplementary Fig. 22), and Spa (nega-
tive for SEG, Supplementary Fig. 26). Most of these correlations could
be expected, except for the counterintuitive effect of Res and Spa.
This could be explained by the fact that in one of the two bright-field
datasets (BF-C2DL-HSC), the negative effect of the low Res and Spa,
compared with the other dataset, BF-C2DL-MuSC, seems to be lower
than the benefits of a lower Het
b
, uniform shape (high Sha) and high
Ove. Regarding the phase contrast (PhC) datasets, the performance
values correlate with CR (negative for SEG and TRA, Supplementary
Figs. 6 and 8), Het
i
(positive for SEG and TRA, Supplementary Figs. 10
and 12), Het
b
(positive for SEG and TRA, Supplementary Figs. 14 and
16), Res (positive for SEG and TRA, Supplementary Figs. 18 and 20), Sha
(negative for TRA, Supplementary Fig. 24), Spa (positive for SEG and
TRA, Supplementary Figs. 26 and 28) and Cha (negative for SEG and
TRA, Supplementary Figs. 30 and 32). These results are heavily influ-
enced by the fact that the two types of phase contrast datasets available
have strikingly different characteristics. This explains, for instance, the
unexpected negative correlation found with CR, given that the dataset
with higher CR values (PhC-C2DL-PSC) is more complex to analyze than
PhC-C2DH-U373, due to the negative impact of other factors, most
notably a significantly lower Res, Spa and higher Mit. Interestingly, the
levels of heterogeneity (both Het
i
and Het
b
), positively correlate with
the performance of the methods for this modality, suggesting that the
characteristic complex texture and halo-like artifacts of phase contrast
images are beneficial for methods that are based on the recognition
of patterns, as is the case for machine learning methods. Finally, no
correlations were found for the only existing differential interference
contrast (DIC) dataset, as could be expected due to the low number of
elements (
n
= 2 videos) available for the analysis. Beyond these global
and modality-specific results, other relevant observations (outside
the scope and length of this paper) relating to the properties that affect
the segmentation and tracking performances can be obtained from the
distributions shown per dataset in Supplementary Figs. 1–40.
Annotation quality versus algorithm performance
We next analyzed the relationship between the performance of the algo
-
rithms and the quality of the available annotations. Figure
5
reports the
quality of the gold truth segmentation (MSEG
GT
), detection (MDET
GT
)
and tracking (MTRA
GT
) annotations, and of the silver truth segmenta-
tion (SEG
ST
) and detection (DET
ST
) annotations. These quality param-
eters were calculated as explained in the Quality of Annotations and
Human-level Performance section in the Methods. Note that these anno
-
tation quality measurements are not mutually comparable because
the former assesses the difficulty of the manual annotation task itself
(that is, how much the annotators agreed when manually annotating
a particular video), whereas the latter assesses the quality of the fused
computer-generated results.
We next looked at the correlation between the quality of the refer
-
ence annotation parameters listed in Fig.
5
, and the performance of the
competing submitted algorithms (Supplementary Data Tabs 5–9 and
Tabs 10–14). The complete set of results can be found in Supplementary
Figs. 41–50. Globally, all three gold truth quality annotation parameters
moderately correlate with the performance of the algorithms (Sup
-
plementary Figs. 42, 44 and 46), conveying the arguable expectation
Name
BF-C2DL-HSC
SNR
CR
Het
i
Het
b
Res
Sha
Spa
Cha
Ove
Mit
Syn
Ent/Leav
Apo
Deb
1.09
2.01
0.45
9.54
3.18
16.15
20.69
11.32
4.92
22.68
8.06
7.27
2.25
7.18
12.48
2.90
3.12
6.06
6.35
5.30
0.86
0.95
1.00
8.63
1.56
3.23
2.63
5.02
9.99
1.02
3.72
6.90
3.12
2.16
7.11
1.10
1.44
1.64
1.23
1.24
3.76
17.26
32.59
1.10
1.04
0.37
0.54
1.07
0.89
0.35
0.64
0.97
0.31
0.29
0.26
13.03
0.87
0.52
0.96
1.07
0.47
0.97
1.42
0.71
0.77
NA
0.48
0.26
0.85
0.61
0.27
0.35
0.19
0.33
0.18
0.87
0.40
NA
271
944
0.82
0.57
0.70
0.65
0.27
0.37
5.73
77.74
7.05
13.57
57.35
NA
0.01
0.00
0.43
0.07
84.42
0.50
4.45
4.44
0.01
2.63
0.21
0.02
1.18
1.62
14.38
0.02
0.08
1.05
0.18
0.17
0.68
0.57
0.91
0.92
0.73
0.89
0.83
0.68
0.92
0.88
0.72
0.88
0.67
0.85
0.77
0.91
0.91
0.94
0.89
0.86
0.16
0.02
0.02
0.23
0.01
NA
0.00
0.17
0.07
1.45
1.74
0.06
1.05
0.43
0.89
0.00
1.99
NA
N
N
N
N
N
N
N
N
N
N
Y
N
N
Y
Y
N
N
N
N
N
N
N
Y
Y
Y
N
Y
Y
Y
Y
N
Y
N
N
N
Y
Y
N
Y
Y
N
N
Y
N
N
N
N
N
N
Y
N
Y
N
N
N
N
N
N
N
N
N
N
Y
Y
N
N
N
N
Y
Y
N
N
N
N
Y
Y
Y
N
N
N
BF-C2DL-MuSC
DIC-C2DH-HeLa
Fluo-C2DL-Huh7
Fluo-C2DL-MSC
Fluo-C3DH-A549
Fluo-C3DH-H157
Fluo-C3DL-MDA231
Fluo-N2DH-GOWT1
Fluo-N2DL-HeLa
Fluo-N3DH-CE
Fluo-N3DH-CHO
Fluo-N3DL-DRO
Fluo-N3DL-TRIC
Fluo-N3DL-TRIF
PhC-C2DH-U373
PhC-C2DL-PSC
Fluo-C3DH-A549-SIM
Fluo-N2DH-SIM+
Fluo-N3DH-SIM+
12 093
6 006
14 349
75 745
497 671
1 306
3 268
610
7 866
19 921
1 832
299
7 558
4 387
143
59 587
1 809
38 388
0.43
186.66
0.48
0.84
0.83
0.46
0.29
0.62
0.83
0.85
14.48
56.29
13.15
2.61
33.39
7.18
6.68
5.93
0.55
104.45
0.59
0.33
0.73
0.76
4.87
NA
19.24
16.35
0.48
0.41
0.49
0.49
Easy
DiŒicult
Fig. 2 | Quantitative and qualitative properties of the test datasets.
For
individual datasets, the columns show their numerical quality measures:
signal-to-noise ratio (SNR), contrast ratio (CR), heterogeneity of the signal
intensity inside the cells (Het
i
) and between the cells (Het
b
), resolution (Res),
shape (Sha), spacing between cells (Spa), change in cell signal intensity over
time (Cha), overlap (Ove) and mitotic division rate (Mit). The remaining
columns list qualitative observations of various features, such as the presence of
synchronous cell divisions (Syn), cells entering or leaving the field of view (Ent/
Leav), apoptotic cells (Apo) or debris (Deb). For each quantitative property, the
computed values are first filtered for outliers, that is, values more than 1.5-fold
the interquartile range below the first quartile, or above the third quartile of the
data. The remaining values are linearly mapped onto a green-yellow-red color
scale to indicate the a priori level of complexity. (The outliers are shown with
the darkest green and the darkest red backgrounds, located before and after
the white vertical bars on the color key.) These values were computed using
the methodology established in 2017 (see the ‘Dataset properties’ section in
Methods). Dataset names: 2D, two dimensional; 3D, three dimensional; A549,
human lung adenocarcinoma cells; BF, bright-field; C, cytoplasmic staining; CE,
Caenorhabditis elegans;
CHO, Chinese hamster ovarian cells; DIC, differential
interference contrast; DRO,
Drosophila melanogaster
; Fluo, fluorescence;
GOWT1, mouse embryonic stem cells; H, high resolution; H157, human oral
squamous cell carcinoma cells; HeLa, Henrietta Lacks human uterine cervical
carcinoma immortalized cells; HSC, mouse hematopoietic stem cells; Huh7,
human hepatocarcinoma-derived cells; L, low resolution; MDA231, human
breast metastatic adenocarcinoma lines; MSC, rat mesenchymal stem cells;
MuSC, mouse muscle stem cells; N, nuclear staining; PhC, phase contrast; PSC,
pancreatic stem cells; TRIC,
Tribolium castaneum
(cartographic projection);
TRIF,
Tribolium castaneum
(full 3D volume); SIM, simulated cells; SIM+, second-
generation simulated cells; U373, human glioblastoma–astrocytoma cells.
Nature Methods
| Volume 20 | July
2023 | 1010–1020
1014
Analysis
https://doi.org/10.1038/s41592-023-01879-y
that what is difficult for humans to do is also difficult for automated
algorithms to solve. In the context of segmentation, there exists room
for more consistent annotation, as indicated by the MSEG
GT
quality
scores (Fig.
5
). Therefore, increasing the consistency of the annotations
should improve algorithm performance. Our per-modality look at the
correlations confirm the global trend with different levels of strength,
except for DIC, which could be partly due to the low number of datasets
of this modality. Regarding the quality of the silver truth annotations,
a strong or moderate global correlation of the quality parameters with
SEG
ST
and DET
ST
was found (Supplementary Figs. 48 and 50). This is also
expected given that the silver truth annotations were obtained as a
combination of the best-performing methods, resulting in almost fully
annotated datasets. Finally, modality-based and individual deviations
from this rule were also found, in most cases due to a low number of
datasets of the modality.
Evolution of the segmentation and tracking paradigms
We analyzed how different segmentation and tracking strategies
(Fig.
3a
), as well as individual detection, segmentation and linking
techniques (Fig.
3b
), relate to the technical performance of the bench
-
marked algorithms. Our analysis shows that the DetSeg strategies sig
-
nificantly outperform the Seg strategies for the datasets with heavily
clustered cells such as DIC-C2DH-HeLa (Extended Data Fig. 2b). Indeed,
the machine learning-based detection of individual cells turns out to
be a crucial factor that reduces the number of under-segmentation and
over-segmentation errors penalized by DET, as first demonstrated by
MU-CZ (2) and MPI-GE (CBG) (1) in two dimensions and three dimen
-
sions, respectively, in 2019 (Extended Data Fig. 3 and Supplementary
Data Tabs 3 and 4). Nowadays, this detection-driven strategy dictates
also the state of the art when analyzing embryonic datasets, as shown
by the fact that the top three places in terms of DET are mostly occupied
by detection-driven strategies (IGFL-FR, JAN-US, MPI-GE (CBG) (2),
MPI-GE (CBG) (3), OX-UK and RWTH-GE (3)) for the Fluo-N3DH-CE,
Fluo-N3DL-DRO, Fluo-N3DL-TRIC and Fluo-N3DL-TRIF datasets (Fig.
4a
).
In terms of segmentation performance, machine learning-based
techniques globally outperform traditional thresholding-based and
region-growing-based techniques. This holds for label-free microscopy
datasets (Extended Data Fig. 4a–c), for which the establishment of
appropriate handcrafted features and rules is generally more difficult
than learning them autonomously using neural networks, and also for
both the Fluo-2D (Extended Data Fig. 4d) and Fluo-3D (Extended Data
Fig. 4e) datasets. Over time, one can observe a substantial improve
-
ment in the segmentation performance thanks to the introduction of
self-configured neural networks (Extended Data Fig. 5 and Supplemen
-
tary Data Tabs 3 and 4), such as the nnU-Net ('no new U-net') used in
DKFZ-GE or NAS (neural architecture search) used in UNSW-AU, as well
as multi-branch predictions used in KIT-GE (3) and KIT-GE (4). Finally,
we have not found any statistically significant difference in the track
-
ing performance (TRA) of machine learning-based and non-machine
learning-based linking techniques across all datasets (Extended Data
Fig. 6). Overall, over the 10 year existence of the CTC, one can observe
a greater performance improvement of the rapidly evolving machine
a
b
Strategies
Techniques
Segmentation-only task [27]
Segmentation and tracking task [62]
Segmentation (Seg) [16]
D1
D2
D3
S1
S2
S3
S4
S1
S2
S3
S4
L1
L2
L3
L4
L5
L1
L2
L3
L4
L5
Thresholding
Detection
Segmentation
Linking
Thresholding
Machine learning
Contour evolution
Graph-based optimization
Nearest neighbor
Label propagation
Region growing
Machine learning
Energy minimization
Peak localization
Machine learning
Intensity
Boundary
Spatial statistics
Spatiotemporal statistics
3
2
2
Distance
Overlap
Motion analysis
Shortest path
Minimum cost flow
Probability
Multiple hypothesis
Decision tree
U-Net variant
R-CNN variant
HRNet variant
Siamese tracker
Graph neural network
Detection
Segmentation (DetSeg) [11]
Segmentation
Linking (SegLnk) [38]
Segmentation

Linking (Seg&Lnk) [5]
Detection
Segmentation
Linking (DetSegLnk) [13]
Detection
Linking
Segmentation (DetLnkSeg) [6]
D1
D2
D3
15
7
5
3
3
4
5
1
1
1
3
1
1
7
5
6
1
1
6
15
14
2
1
2
23
38
4
4
2
2
Fig. 3 | Taxonomy of the strategies and techniques used by the challenge
participants.
a
, A taxonomy of the cell segmentation and tracking strategies
followed.
b
, Stratification of the detection, segmentation and linking techniques
used by the benchmarked methods. The numbers in brackets are the number of
submissions received for individual tasks and the number of submissions that
followed a particular strategy. The numbers in the table indicate the number of
submissions that use each technique.