of 13
View
Online
Export
Citation
CrossMark
RESEARCH ARTICLE
|
FEBRUARY 06 2024
Nanomechanical mass measurements through feature-
based time series clustering
Adam P
. Neumann
;
Alfredo Gomez
;
Alexander R. Nunn
;
John E. Sader
;
Michael L. Roukes
Rev
. Sci. Instrum.
95, 025001 (2024)
https://doi.org/10.1063/5.0176303
07 February 2024 20:44:22
Review of
Scientific Instruments
ARTICLE
pubs.aip.org/aip/rsi
Nanomechanical mass measurements through
feature-based time series clustering
Cite as: Rev. Sci. Instrum.
95
, 025001 (2024); doi: 10.1063/5.0176303
Submitted: 12 September 2023
Accepted: 9 December 2023
Published Online: 6 February 2024
Adam P. Neumann,
1
Alfredo Gomez,
1
Alexander R. Nunn,
1
John E. Sader,
2, a)
and Michael L. Roukes
1, a)
AFFILIATIONS
1
Department of Physics, California Institute of Technology, Pasadena, California 91125, USA
2
Graduate Aerospace Laboratories and Department of Applied Physics, California Institute of Technology, Pasadena,
California 91125, USA
a)
Authors to whom correspondence should be addressed:
jsader@caltech.edu and roukes@caltech.edu
ABSTRACT
Recent years have seen explosive growth in miniaturized sensors that can continuously monitor a wide variety of processes, with applications
in healthcare, manufacturing, and environmental sensing. The time series generated by these sensors often involves abrupt jumps in the
detected signal. One such application uses nanoelectromechanical systems (NEMS) for mass spectrometry, where analyte adsorption produces
a quick but finite-time jump in the resonance frequencies of the sensor eigenmodes. This finite-time response can lead to ambiguity in the
detection of adsorption events, particularly in high event-rate mass adsorption. Here, we develop a computational algorithm that robustly
eliminates this often-encountered ambiguity. A moving-window statistical test together with a feature-based clustering algorithm is proposed
to automate the identification of single-event jumps. We validate the method using numerical simulations and demonstrate its application
in practice using time-series data that are experimentally generated by molecules adsorbing onto NEMS sensors at a high event rate. This
computational algorithm enables new applications, including high-throughput, single-molecule proteomics.
Published under an exclusive license by AIP Publishing.
https://doi.org/10.1063/5.0176303
I. INTRODUCTION
In nanoelectromechanical systems (NEMS)-based mass spec-
trometry (MS), individual molecules adsorbing to a mechanical
resonator induce a change in the resonance frequencies of the device.
The fractional eigenfrequency shifts,
Δ
f
n
, of mode
n
=
1, 2,
...
, are
related to the adsorbed mass,
m
, and its position,
x
, by
1–5
Δ
f
n
=
m
2
M
device
Φ
2
n
(
x
)
,
(1)
where
M
device
is the device mass and
Φ
n
(
x
)
is the corresponding
eigenmode. The masses of separate analytes can then be mea-
sured experimentally by detecting individual adsorption events and
quantifying the associated shifts in the resonant frequencies of the
device.
The accurate measurement of frequency shifts due to indi-
vidual adsorption events is complicated by several processes. In
practice, the resonance frequency of each eigenmode of the device is
tracked using a phase-locked loop (PLL). This is a control loop that
maintains stability with respect to high-frequency noise fluctuations
while simultaneously achieving a controlled response to large jumps
in frequency. A PLL optimized for these criteria creates finite-time
transients when an adsorption event occurs. When multiple analytes
adsorb within a narrow time interval, succeeding frequency shifts
interrupt the finite-time transients of prior events and create ambi-
guity in the frequency shift due to the adsorption of a single analyte.
Further complicating the problem, the multimodal time-series can
have a low signal to noise ratio, drift, and noise processes correlated
over long time intervals, which invalidates the assumptions of many
conventional statistical approaches.
Prior studies for detecting adsorption events in NEMS-MS
have fit the inter-sample difference of the multimode time-series
to a multivariate Gaussian distribution. Adsorption events are then
identified as outliers that deviate from the mean by a user-defined
threshold.
3,6
The disadvantage of this approach is that it fails to
differentiate between frequency shifts due to single or multiple
adsorption events.
7
As we shall demonstrate, this can significantly
distort the mass distribution predicted by NEMS-MS.
Rev. Sci. Instrum.
95
, 025001 (2024); doi: 10.1063/5.0176303
95
, 025001-1
Published under an exclusive license by AIP Publishing
07 February 2024 20:44:22
Review of
Scientific Instruments
ARTICLE
pubs.aip.org/aip/rsi
A robust detection methodology is required that can accu-
rately measure the change in resonance frequencies and distinguish
between single and multiple mass adsorption events occurring in
proximity. This is crucial since accurate frequency shift measure-
ments can only be taken from single-event jumps. Indeed, such
a scheme is a necessity for scaling NEMS-MS to high-throughput
applications where so called,
multi-event jumps
become frequent.
It is also highly desirable that this detection method performs
in near real time,
8
which would enable the analysis and selective
response to individual molecular adsorption events. This would
allow NEMS-MS to act as a pre-selector for further downstream
measurements.
Here, we present a modular, noise-driven approach for detect-
ing single adsorption events within noisy multimodal NEMS-MS
time-series data. The approach consists of three stages: jump detec-
tion, classification, and mean frequency shift measurement. In the
detection phase, the algorithm identifies abrupt shifts in the time
series. In the subsequent classification stage, time-domain features
of the events are extracted, and events are categorized as either
single-event jumps, caused by a single mass absorbent, or multi-
event jumps, caused by multiple mass absorbents with overlapping
transients. The final measurement phase retains only single-event
jumps and measures the change in mean resonance frequency before
and after the event. This multi-step process enables the robust mea-
surement of single-event frequency shifts and excludes ambiguous
multi-event jumps.
The study is organized as follows: In Sec. II, the method-
ology and theory of our method are presented, and in Sec. III,
adsorption events are numerically simulated on a NEMS device to
produce a multimodal time-series with known adsorption times.
This is used to evaluate the performance of the method as the level
of noise increases. In Sec. IV, we demonstrate the application of
the method to experimental data obtained from a hybrid Orbitrap-
NEMS system, reported elsewhere.
7,9
The frequency-shift response
due to single macromolecule absorption events is extracted from
noisy multimodal time-series data, and the mono-disperse mass dis-
tribution is recovered. A discussion of results is given in Sec. V, with
the current Python implementation of the method provided in the
Appendix.
II. METHODOLOGY
In this section, we outline the proposed methodology for pro-
cessing NEMS-MS time-series data. Consisting of jump identifica-
tion, single-event classification, and output phases, the approach is
modular, allowing practitioners to interchange methods at each step
as needed. A diagram of the approach is shown in Fig. 1.
To detect shifts in noisy time-series data during the detection
phase, we utilize methods derived from change point analysis.
10–12
These methods can be broadly categorized as either exact or approx-
imate, with the former using the entire time-series (offline) and the
latter providing an estimate based on the observed history (online).
To analyze the time-series data of our problem, which is embed-
ded in non-stationary and long-time correlated noise processes, we
employ a window-based approximate change point method. These
methods are robust regardless of the number or position of the fre-
quency shifts, making them well-suited for NEMS-MS analysis in
FIG. 1.
Event measurement scheme for multimode time series data. A sliding win-
dow approach with a built-in gap for the response time of the sensor is used to
calculate a two-sample statistic. Jumps are detected when the statistic exceeds
a statistical significance threshold. Features of the time series surrounding each
jump are then extracted, and jumps are filtered to single-event jumps using a
clustering algorithm based on similarity to the central tendency of the extracted
features. Finally, the jump magnitudes of the remaining jumps are measured and
reported.
real time. The statistical test used to identify shifts between rolling
sample windows is described in Sec. II A.
In the classification step, we employ techniques from
time-
series clustering
to group detected events based on similarity.
13,14
Time-series clustering methods can be broadly divided into raw
data-based, model-based, or feature-based methods. Raw data-based
methods include every sample of an event in the classification.
This approach has the advantage of making minimal assumptions
about the characteristics of events but performs poorly on datasets
with high noise and a small sample size. In contrast, model-based
approaches make strong assumptions about the characteristics of
events and fit each event to a model. However, since the charac-
teristics of events in NEMS-MS cannot be determined precisely, we
consider feature-based time-series clustering. By extracting several
key features related to the shape of the event peak in the moving
window statistic, such as the full-width at half maximum (FWHM),
this approach has the advantages of being robust to noise, signifi-
cantly reducing the dimensionality, and thereby complexity, of the
classification problem, and including minimal assumptions about
event behavior. We describe feature-based time-series clustering in
Sec. II B. By performing this clustering, we demonstrate that multi-
event jumps can be identified automatically by their deviations away
from the central single-event jump tendency of the dataset in feature
space.
We conclude by describing a consistent approach for mea-
suring the mean frequency shift due to a single-event jump in
Sec. II C.
Rev. Sci. Instrum.
95
, 025001 (2024); doi: 10.1063/5.0176303
95
, 025001-2
Published under an exclusive license by AIP Publishing
07 February 2024 20:44:22
Review of
Scientific Instruments
ARTICLE
pubs.aip.org/aip/rsi
A. Step 1—Identify all frequency jumps
Consider two samples,
X
and
Y
, representing the multimode
frequency time series before and after each event. Samples
X
and
Y
are taken using a simple moving window approach with a time gap,
t
jump
, between the samples to exclude the jump transient. For NEMS
devices,
t
jump
can be initially estimated from the PLL time chosen
for the experiment, but as will be shown later, it can be fine-tuned
after a relatively insensitive initial guess. Previous applications of
jump detection in NEMS-MS utilized the student’s
t
-test to automat-
ically find frequency jumps.
7
The embodiment in this article applies
Hotelling’s two-sample
t
2
statistic, a multivariate generalization of
the
t
statistic,
15
which is given as
t
2
=
n
x
n
y
n
x
+
n
y
(
X
Y
)
T
Σ
1
(
X
Y
)
,
(2)
where
n
x
and
n
y
represent the number of degrees of freedom in
X
and
Y
,
X
and
Y
represent the vector of the column mean for
X
and
Y
,
and
Σ
is the pooled covariance matrix.
16
In the case where
n
x
=
n
y
,
Σ
is the mean of
Σ
X
and
Σ
Y
(the covariance matrices for
X
and
Y
,
respectively). Equation (2) can be related to an
F
-distribution, which
has more widespread support in standard software packages, with a
simple multiplicative factor required for conversion,
17
n
x
+
n
y
N
1
N
(
n
x
+
n
y
2
)
t
2
F
(
N
,
n
x
+
n
y
N
1
)
,
(3)
where
N
represents the number of dimensions, in this case, modes.
If each data point is drawn independently from two indepen-
dent multivariate normal distributions,
n
x
and
n
y
also represent the
number of data points in
X
and
Y
. With NEMS experimental data,
however, there is a high degree of autocorrelation in the time series
due to the non-white nature of the noise as well as the dynamical
nature of the PLL, which incorporates an integrator, low-pass filter,
and feedback. For this reason, the value of the F-statistic calculated
in Eq. (3) with
n
x
and
n
y
set to the number of data points in
X
and
Y
is not comparable to that obtained from standard models and cannot
directly be converted to a confidence interval or
p
-value.
To relate the F-statistic calculated in Eq. (3) to a
p
-value,
we employed a non-parametric bootstrapping method known as
moving block bootstrapping
18–20
on an event-free (i.e., pure noise)
dataset. This is always possible to obtain with NEMS devices. If
such a dataset is not available, it might be possible to approximate
one by running through the jump detection framework with an ini-
tial pass, excluding data near all suspected jumps, and appending
the remaining time series. Briefly,
X
and
Y
are sampled randomly
with replacement, with the resulting F-statistics sorted and plot-
ted alongside the corresponding fraction of the collected samples.
The pure noise dataset acquired with the same device used to col-
lect the experimental data is provided along with the bootstrapping
results in Fig. 2. In addition, shown in Fig. 2 are contour plots in
relative frequency space corresponding to 1-, 2-, and 3-sigma sig-
nificance. Further details about the experimental data and device
are provided in Sec. IV. These contour plots offer an alternative
representation of the minimum frequency fluctuation detectable by
the device, which is more typically calculated using a single-mode
Allan deviation; this is a measure of frequency stability involving the
standard deviation of the differences of time-averaged fractional fre-
quencies.
21
Importantly, the gap between samples required for the
jump is automatically incorporated into this bootstrapping calcu-
lation, in contrast with the Allan deviation, which requires careful
modification of the standard formula. The window length (the size
of
X
and
Y
) can be fixed by choosing the value that minimizes the
volume of these frequency fluctuation contours in
N
-dimensional
space.
Once bootstrapping has been performed, jumps can be detected
when the F-statistic crosses above a threshold related to the
p
-value.
We chose
p
=
0.003, which is approximately equivalent to a 3-sigma
deviation. It should be emphasized that both the jump threshold and
window length are directly informed by our application and data,
which is not always the case and is typically cited as a weakness of
moving-window based change point detection.
Following event detection, an additional processing step called
filtering
is performed to eliminate some erroneous events. These fil-
tering steps are manually chosen to select events with clear outlier
behavior and include (1) eliminating jumps with too little detected
time spent above the threshold—these are jumps that are presumed
to be noise fluctuations, (2) removing identified events that are near
one another, and (3) jumps with a positive frequency shift. The latter
FIG. 2.
Noise characterization and significance bootstrapping. Fractional frequency noise is shown for experimental data acquired with no events. With these data, moving
block bootstrapping is used to empirically determine the significance of any given window sample based on its F-statistic, which is used to select a detection threshold.
Rev. Sci. Instrum.
95
, 025001 (2024); doi: 10.1063/5.0176303
95
, 025001-3
Published under an exclusive license by AIP Publishing
07 February 2024 20:44:22
Review of
Scientific Instruments
ARTICLE
pubs.aip.org/aip/rsi
is specific to our application of NEMS-based mass sensing, asso-
ciated with the constraint that we are only interested in detecting
physical events with positive added mass.
B. Step 2—Reduce to single-event jumps
With the observation that the F-statistic vs time exhibits dif-
ferent dynamics based on the presence (or not) of multiple jumps
in proximity, we seek to develop statistical features that reduce the
dimensionality in a manner that is robust to noise and not sen-
sitive to the knowledge of the exact jump time. In this way, we
aim to develop inputs for a feature-based clustering algorithm that
could automate the detection of single-event jumps. The general
approach considered here involves the expansion of the F-statistic vs
time,
F
(
t
)
, near each detected jump in terms of statistical moments,
treating it as a probability distribution. The first five moments,
normalized to give magnitude and width independence, are
F
0
=
t
1
t
0
F
(
t
)
dt
,
F
1
=
t
1
t
0
tF
(
t
)
dt
/
F
0
,
F
2
=
t
1
t
0
(
t
F
1
)
2
F
(
t
)
dt
/
F
0
,
F
3
=
t
1
t
0
(
t
F
1
)
3
F
(
t
)
dt
/(
F
0
F
2
3
/
2
)
,
F
4
=
t
1
t
0
(
t
F
1
)
4
F
(
t
)
dt
/
(
F
0
F
2
2
)
,
(4)
where
t
0
and
t
1
are the start and end of the detected jump (as
determined by the times it crosses above and below the detection
threshold), respectively, and the integrals are calculated numerically.
The first two features represent the time average of
F
(
t
)
relative
to the detected jump time and the standard deviation. The next
two features represent the skewness and kurtosis, i.e., the third
and fourth central moments normalized by the standard deviation
(to create new independent features), respectively. Finally, FWHM
represents the fifth feature. FWHM, as introduced in Sec. II, is
observed to be highly sensitive to the sharper peaks with nearly
instantaneous jumps—those with an event beginning before a prior
event ends. In the proposed method, we used the five quantities
F
1
t
jump
,
F
2
,
F
3
,
F
4
, and FWHM.
We now investigate the use of clustering algorithms on the
extracted statistical features to isolate single-event jumps in a pro-
cess. This takes advantage of the observed central tendency of the
single-event jump in a manner that minimizes human bias. To tar-
get this spatial relationship, we use DBSCAN (Density Based Spatial
Clustering of Applications with Noise), which employs a density-
based selection of clusters of arbitrary shapes with minimal domain
knowledge.
14
Moreover, DBSCAN is an unsupervised algorithm,
meaning that exact knowledge about the behavior of single- and
double-jump events is not a necessity, which is important when
analyzing experimental data. DBSCAN identifies both clusters and
noise (points not belonging to any cluster) for data using two para-
meters:
N
min
and
ε
, which denote the minimum number of points
required to form a cluster and neighborhood radius about each
point, respectively. The parameter
N
min
is typically set to twice the
number of dimensions—ten for five-dimensional data—while the
neighborhood radius,
ε
, is the freely adjustable parameter. To give
each feature equal weight when using Euclidean distance as the dis-
tance metric, features are normalized such that 50% of the data
would lie within the range
1 to
+
1.
The original formulation of DBSCAN is accompanied by a
heuristic for choosing a value for the neighborhood radius,
ε
, that
separates the noise from the clusters.
14
This involves estimating the
percent of noise and selecting a threshold near this estimate based
on the visual behavior of the data sorted according to the distance to
the
k
-th nearest point (where
k
is equal to
N
min
). In our application,
there is no clear opportunity for making a specific choice of
ε
based
on this heuristic; this might be because the noise can be arbitrarily
close to the data with no clear transition, and the low overall SNR for
some datasets. For this reason, we find that estimating the percent of
noise and using this choice to determine
ε
is the simplest and most
consistent method.
For our dataset, “noise” consists of (1) fluctuations misclassi-
fied as events (false positives) and (2) multiple-jump events. The
first type of noise depends on the choice of the F-statistic threshold
used to detect events and can be estimated by finding the number
of events detected on an event-free (pure noise) dataset. For NEMS
experiments, it is always possible to acquire these data separately.
Simulations in Sec. III A show this is a relatively small contribution.
The second type of noise is estimated for our application by assum-
ing the experimental events occur at a constant rate and follow a
Poisson process. The number of single-event jumps is estimated by
finding the event rate from the total number of detected events, then
calculating the probability that no events will occur within a jump
window
Δ
t
jump
and two measurement windows
t
meas
(to account
for the measurement before and after a given jump). This follows
from the Poisson process because events occur independently of
each other. Thus, estimating the probability of events within a time
period, 2
t
meas
+
Δ
t
jump
, is equivalent to estimating the probability of
just one event having occurred in that time period, conditional upon
a jump having already been detected with jump time,
t
jump
.
With observations on the simulated datasets, we adapt
DBSCAN to only consider the largest discovered cluster based
on the above noise estimations. Simulations show there is a
trade-off between accepting multiple-jump events and removing
low-magnitude events.
C. Step 3—Output frequency shifts of single-event
jumps
The optimal measurement of the jump magnitude,
Y
X
,
requires positioning the measurement windows as close as possi-
ble (to reduce the impact of noise and drift) while ensuring that
the jump transient is excluded. For datasets with no time-dependent
response, i.e., those with instantaneous jumps, the jump time is sim-
ply calculated as the peak in the moving-window statistic
10
described
in Sec. II A. For this application with a finite-time response, a cus-
tom peak-finding routine is developed to establish a precise and
consistent jump time, as follows: First, we calculate the peak of the
F-statistic,
F
peak
, near each jump, then locate the time in which the
F-statistic reaches half that value, working backward in time from
the end of the jump time (the time at which the F-statistic crosses
below the detection threshold). This algorithm consistently places
this jump time,
t
jump
, slightly to the right (i.e., ahead in time) of
Rev. Sci. Instrum.
95
, 025001 (2024); doi: 10.1063/5.0176303
95
, 025001-4
Published under an exclusive license by AIP Publishing
07 February 2024 20:44:22