Early warning of coalescing neutron-star and neutron-star-black-hole binaries from nonstationary noise background using neural networks

The success of the multi-messenger astronomy relies on gravitational-wave observatories like LIGO and Virgo to provide prompt warning of merger events involving neutron stars (including both binary neutron stars and neutron-star-black-holes), which further depends critically on the low-frequency sensitivity of LIGO as a typical binary neutron star stays in this band for minutes. However, the current sub-60 Hz sensitivity of LIGO has not yet reached its design target and the excess noise can be more than an order of magnitude below 20 Hz. It is limited by nonlinearly coupled noises from auxiliary control loops which are also nonstationary, posing challenges to realistic early-warning pipelines. Nevertheless, machine-learning-based neural networks provide ways to simultaneously improve the low-frequency sensitivity and mitigate its nonstationarity, and detect the real-time gravitational-wave signal with a very short computational time. We propose to achieve this by inputting both the main gravitational-wave readout and key auxiliary witnesses to a compound neural network. Using simulated data with characteristic representing the real LIGO detectors, our machine-learning-based neural networks can reduce nonlinearly coupled noise by about a factor of 5 and allows a typical binary neutron star (neutron-star-black-hole) to be detected 100 s (10 s) before the merger at a distance of 40 Mpc (160 Mpc). If one can further reduce the noise to the fundamental limit, our neural networks can achieve detection out to a distance of 80 Mpc and 240 Mpc for binary neutron stars and neutron-star-black-holes, respectively. It thus demonstrates that utilizing machine-learning-based neural networks is a promising direction for the timely detection of the coalescence of electromagnetically bright LIGO/Virgo sources.


I. INTRODUCTION
The current generation of ground-based gravitationalwave interferometers [1-3] firmly established a new way to observe our cosmos. Since the first detection of gravitational waves (GWs) from a binary black hole (BBH) merger [4], Advanced LIGO (aLIGO [1]) and Advanced Virgo (aVirgo [2]) have gone on to document dozens of gravitational-wave candidates [5,6] that have been confirmed and added to by the broader astrophysical community [7][8][9][10][11][12][13][14]. One of the most spectacular discoveries made by Advanced LIGO and Virgo is the first observed binary neutron star (BNS) coalescence, GW170817. GW170817 was jointly detected in lowlatency in gravitational waves [15] and by Fermi-GBM in gamma rays [16]. The subsequent discovery and followup of kilonova AT 2017gfo led to a concerted followup ef- * hangyu@caltech.edu fort across the electromagnetic (EM) spectrum [17]. The resulting multi-messenger observations enabled an abundance of new science: constraints on the maximum NS mass [18], better understandings of neutron star mode coupling and equation of state [19][20][21], as well as tests of general relativity [22].
Despite the successes surrounding GW170817, there is still much to be learned about compact binary mergers containing at least one neutron star. In particular, there are various astrophysical processes that can generate precursor and/or early-stage signals that are yet to be detected. For example, tidal interactions might shatter crusts of neutron stars and lead to short gammaray burst [23]. The property of the final merger product may be better revealed with prompt X-ray and optical observations [24]. In the radio band, precursor magnetosphere interactions might cause radio emissions [25,26] and could be a potential mechanism leading to fast radio bursts [27,28]. See, e.g., Ref. [29] for further discussions on potential early-warning signals as well as a nice summary of the follow-up capacity of various EM obser-vatories.
To detect the prompt signatures of these processes, LIGO and Virgo would need to be able to identify the existence of a GW event and then determine its sky location in a timely manner. This is especially important for binaries where at least one component is a neutron star. The GW alert for GW170817 was not sent out until ∼ 40 minutes after the merger and the sky location was not released until another 4 hours later [17]; in principle, this information can be obtained minutes prior to the final merger as a typical BNS event will stay in the sensitivity band of LIGO and Virgo for minutes at their designed sensitivities.
There are presently four low-latency, matched-filter based pipelines that produce near real-time gravitationalwave alerts for BNS and NSBH mergers: GstLAL [30,31], PyCBC [32], MBTA [33], and SPIIR [34,35]. Several of these pipelines have already developed analyses capable of early warning detection [29,30,35]. See also Ref. [36] for a summary of current efforts carried out by the LIGO and Virgo collaborations during the second observing run to low-latency warnings. The prospect of pre-merger detection is ultimately limited by latencies surrounding data acquisition, handling, and analysis. Ref. [37] recently demonstrated that even at present latencies, the LIGO-Virgo collaboration is capable of identifying, localizing, and broadcasting GW candidates prior to merger.
Machine-learning (ML) based neural networks (NNs) offers yet another attractive alternative to achieve the early warning of BNSs/NSBHs. Instead of individually computing the overlap between a time series of GW readout and each waveform template from a large template bank, a trained NN would only need to do the computation once to predict the existence and the property of the source. It can therefore serve as the first step for existing pipelines and further accelerate their computational efficiency.
Indeed, various authors have considered the possibility of detecting GW events using ML-based NNs. Refs. [38][39][40] showed that it is possible to input real-time GW readout and then use NNs to detect massive black hole binaries (BBHs) and later Refs. [41,42] considered the possibility of detecting BNSs with longer signal duration. Recently, Ref. [43] further considered detecting BNS events tens of seconds prior to the final merger.
However, almost all the analyses above assume a stationary Gaussian noise background and often at the designed sensitivity of aLIGO (one exception is Ref. [39], yet they focused on short BBH signals only with duration 1 s long in which the nonstationarity is less critical). While this is a decent approximation in the f > 100 Hz frequency band, at lower frequencies which matter most for the early warning the dector noise not only exceeds the designed level by orders of magnitude but also exhibits nonstationarity [10,44,45]. Therefore, it would be crucial to take into account these features of realistic detector noise in order to design a NN to achieve early warning in practice.
Our work thus extends the field by considering the detection of GW events from a nonstationary noise background representative of realistic LIGO detectors. In addition to the main GW readout, we further show that in principle one can also input to the NN some key auxiliary channels witnessing the sources of contamination to hence enhance the low-frequency sensitivity. As the contamination typically involves nonlinear and nonstationary coupling mechanisms, it cannot be mitigated by standard signal processing techniques assuming linear and stationary noise coupling. We demonstrate that with NNs involving nonlinear activations, one can nonetheless tackle the challenges of nonlinearity and nonstationarity and achieve simultaneous noise mitigation and signal detection both in real time.
The rest of the paper is organized as follows. In Sec. II we briefly overview the LIGO sensitivity during its third observing run (O3) and discuss the main source of contamination to the low frequency band of interest. In Sec. III we then describe the properties of the GW signal. This is followed by Sec. IV in which we provide the details of the construction of training of our early-warning NN. Specifically, we describe the preparation of our training datasets in Sec. IV A and then in Secs. IV B-IV D the procedures we adopt for the network training. The performance of our NN is examined in Sec. V. Lastly, we conclude and discuss our results in Sec. VI

II. OVERVIEW OF LIGO SENSITIVITY
While LIGO has achieved a great success, its sensitivity can still be further improved as we demonstrate in Fig. 1. Here the orange trace is the representative sensitivity at the LIGO Hanford observatory during the third observing run (O3) [45] and the red trace is its fundamental limit set by quantum and thermal fluctuations at the O3 configuration (which actually closely matches the designed sensitivity of aLIGO [1]). While the two traces overlaps at f 100 Hz, at lower frequencies the excess noise can be significant. At 30 Hz (20 Hz), the fundamental limit is a factor of 3 (10) below the current sensitivity, indicating a large room of improvement. Opening up the sensitivity in the low-frequency band can be especially rewarding for multi-messenger astronomy and astrophysics, as it allows a coalescing BNS (whose strain we show in the purple trace) to be detected at a lower frequency and hence a much earlier time prior to the merger; see the discussion in Sec. III.
A major source of contamination to the current lowfrequency sensitivity is the control noises of auxiliary degrees of freedom [45] (see also, e.g., Refs. [44,46]). For instance, while it is necessary to engage an active angular control system to maintain the alignment of test masses at below a few Hz during LIGO's observation, the system also inevitably feeds back the sensing noise in the 10-30 Hz band and causes excess angular perturbation θ(t). The angular perturbation further couples with the 10 [45] and the red trace its fundamental limit. We simulate noise according to the mechanism described in Eq.
(1) and a typical realization is shown in the blue trace. Note that it is in fact nonstationary and its spectrum can vary within the blue-shaded band. Because of the nonlinear nature of the noise coupling, a linear, coherence-based subtraction cannot mitigate the noise as shown in the dashedbrown trace. As a reference, we also show the strain of a typical coalescing BNS event in the purple trace. The event can be detected at a much lower frequency (hence a much earlier time) if the contamination at low frequencies can be mitigated.
off-pivot beam spot motion and leads to a longitudinal displacement that contaminates the GW readout as Such a contamination can be mitigated by both online feed-forward cancellation and offline signal regressions (see, e.g., Ref. [47]). However, standard signal processing techniques [such as computing the Wiener filter from θ(t) to δx(t)] assume the coupling is linear and stationary and therefore can only remove the constant coupling part ∝ x (DC) spot but not the fluctuating piece ∝ x (AC) spot (t). In fact, it is exactly due to the temporal variability of couplings like x (AC) spot (t) that the current LIGO noise background at low frequencies is nonstationary [44,45], dramatically complicating the data analysis process [10].
Furthermore, there are no direct witnesses for the spot position on the test masses, x (AC) spot (t), over the entire frequency band of interests. Instead, it has to be reconstructed from multiple sensors through complicated geometrical conversions as well as signal filtering and blending, with each step subject to its own calibration uncertainties.
Nonetheless, neural networks (NNs) using machine learning (ML) offers an attractive way to tackle this problem. By inputting sufficient auxiliary witness channels, a deep convolutional neural network (CNN) [48] would be able to figure out the correct, frequency-dependent combinations of the witness that reconstructs the contamination. Moreover, as each layer typically involves a nonlinear activation function, it would be able to capture nonlinear couplings like Eq. (1) that classical, linear signal processing techniques fail (see also Refs. [49][50][51][52] for some recent efforts to mitigate nonlinear noises in the LIGO detectors). Furthermore, as an NN is trained directly on time series, it is especially suitable to be implemented in real-time and has the potential to be integrated into a low-latency detection pipeline.
To demonstrate this point, we simulate excess noise according to the mechanism described in Eq. (1) and combine it with the fundamental limit to form the blue trace in Fig. 1. The x spot (t) and θ(t) as well their witness channels are simulated with similar characteristics as in realistic LIGO detectors, with one exception that we reduce the roll-off of θ(t) in the 25-80 Hz band so that the entire O3 sensitivity can be approximated by this mechanism (see Sec IV A for more details). In reality, the noise in the 25-80 Hz are dominated by other noise sources [45] which we ignore here for simplicity. Note that we have assumed that the constant coupling piece is already removed (i.e., x (DC) spot = 0), and linear subtraction cannot further mitigate the contamination. This is illustrated by the brown-dashed curve in Fig. 1 where we compute the multi-input-single-output coherence between all the auxiliary witness and the gain GW channel and then subtract out the coherent component in the frequency domain. To further simulate the nonstationarity on timescales longer than the length of each realization (256 s), we allow the overall root-mean-square (RMS) of x spot (t) to be a random variable. Thus the blue trace in Fig. 1 is just the ASD of a typical realization; the noises we simulate in fact has their spectra vary within the shaded blue region (see also Fig. 5).

III. GW SIGNAL
Having described the noise and how we may use ML techniques to mitigate it, we now turn to the discussion about detecting the astrophysical GW events. Specifically, our goal is to detect a GW event minutes before the final merger and further classify its type (NS vs BH) to assist the EM follow up strategies.
For the early warning purpose, we can approximate the waveform using only the leading-order quadrupole formula and write (with G = c = 1; see, e.g., [53]) where t m = t c − t is the time to merger and t c and Φ c are time and phase of coalescence. The time t m is further related to the GW frequency f according to In this work, we do not include the detailed antenna responses (which are encoded in the quantity A) nor the joint detection by multiple detectors. Instead, we set A = 1 and simply replace the distance to the source d in Eq. (2) by d eff / √ N det , where d eff 2.3d is the averaged effective distance [54,55] and N det is the number of detectors observing. For the rest of the work, we will use N det = 3 as the default value.
From the above equations we see that the waveform depends only on one intrinsic parameter of the source, the chirp mass M c , defined as with M 1,2 the component masses. Therefore, we put a GW event into three categories according to its chirp mass.
We define the first category as events with 1 M ≤ M < 1.8 and label such an event as a "BNS" event.
Note that a BNS with M 1 = M 2 = 2 M (which is the mass of the heaviest NS observed to date [56]) will have M c = 1.74 M . Therefore, we would expect that most astrophysical BNS events will fall into this cate- We also define a "BBH" category as sources with 4.5 M ≤ M c < 10 M . The lower boundary is inspired by noticing a BBH with M 1 = M 2 = 5 M would have M c = 4.35 M . In principle, the upper boundary of M c < 10 M for this category is not necessary (or it should be set to a much greater value). We nonetheless put it to 10 M for the training simplicity. Moreover, more massive systems merges in only a few seconds or even less in duration [Eq. (4) and Fig. 2], and therefore they are not the main target of our study here.
Lastly, we refer to sources with 1.8 M ≤ M c < 4.5 M as the "NSBH" category, as it covers events We nonetheless point out that this category may also contain a binary of NSs both massive than 2 M or a pair of light BHs both in the lower "mass gap" with M 1,2 < 5 M . While it is possible to refine our knowledge of the source if we include dynamics at high post-Newtonian orders and/or potential tidal interactions, these effects are encoded at higher GW frequencies and therefore is beyond the scope of our work targeting the early warning using only the low-frequency portion of the signal. Indeed, at the frequency range we are interested in here, the corrections we drop is only on the Nonetheless, we may imagine our work here would serve as a first step of a future, integrated early warning pipeline, and once an event is detected here, it can then trigger further analysis on the signal to refine its property.
In Fig. 2 we show the merger time t m for binaries with different chirp masses M c at three different GW frequencies. The two vertical, dotted lines indicate the boundaries between the three categories defined in our study. From the plot we see that if the event can be detected by 30 Hz, then for the three categories ("BNS", "NSBH", "BBH"), we should in principle be able to detect the signal [O(100), O(10), O(1)] s prior to the merger.
In reality, the situation may be more challenging because the current LIGO low-frequency sensitivity is orders of magnitude above its fundamental limit as we have already seen in Fig. 1. We illustrate this point further in Fig. 3 where we show the cumulative signal-to-noise ratio (SNR) ρ for a BNS with M 1 = M 2 = 1.4 M as a function of t m (bottom x-axis) and f (top x-axis). Specifically, we define the cumulative SNR through whereh(f ) = h(t) exp (i2πf t) dt and S n the detector's power spectral density. In the plot, we further normalize the curves by the total SNR assuming the fundamental O3 sensitivity (the red trace in Fig. 1). As can be seen from Fig. 3, with the current O3 sensitivity (blue trace), to accumulate to a normalized SNR of 0.2, we have to integrate the signal to around 40 Hz or t m 20 s. Such a time window might not be sufficient especially if one wants to catch potential precursor signals of BNS mergers given various realistic delays in the information communication and decision making. In contrast, if LIGO can reach its fundamental limit, one would only need to integrate to 15 Hz, which is 300 s prior to the merger. It thus demonstrates the great scientific reward of enhancing the low-frequency sensitivity, which we propose to achieve via ML-based nonlinear noise regression.

IV. NEURAL NETWORK
A cartoon illustrating the proposed NN structure is shown in Fig. 4. Here we input both the main GW readout and a few key auxiliary witness channels to simultaneously achieve noise mitigation and signal classification.
To assist the convergence of the network, we adopt a compound structure. We first use the network "CNNnoise" to preform noise reconstruction and then subtract its output from the noisy GW readout to form a cleaned strain signal. This is then fed to the network "CNNclass" to achieve signal detection and classification. Both sub-networks can be first trained individually and then combined together to preform a global optimization. Note that the label for each class is put under quotation marks because here we only loosely define each class by its chirp mass, the information that is best constrained at the early inspiral stage. The source's properties can be better refined by follow-up analysis utilizing data at higher GW frequencies. FIG. 4. A compound CNN we propose to use to detect GW events. In the "CNN-noise" CNN model we first reconstruct the noise that limits the low-frequency sensitivity using auxiliary channels. We then subtract its output from the main GW readout and then pass the residuals to "CNN-class" model which outputs the probability of the input time-series belonging to each one of the classes we defined. The two CNNs can be first trained individually and then combined and optimized globally.

A. Data preparation
We describe in this Section how we generate the data we used for training the CNN.
We generate the GW signal from Eq.
(2) with the distance d replaced by 2.3d eff / √ N det and N det = 3. For training the NN, it is not necessary to sample the masses following a specific astrophysical distribution. Instead, we sample M c from a normal distribution with a mean of 1.22 M and standard deviation of 0.3 M and truncate the distribution at [1, 1.8) M , the predefined range of the "BNS" class. For "NSBH" and "BBH", the masses are simply sampled from uniform distributions.
To achieve early warning, we do not use the entire waveform up to the merger, but truncate the highfrequency end of the waveform at a cutoff frequency f cut . For the "BNS" class, we randomly sample f cut between 24 Hz and 25 Hz. For typical BNS event with M 1 = M 2 = 1.4 M (M c = 1.2 M ), this corresponds to t m (f cut ) = 97 − 87 s. The starting frequency is chosen such that the integration time t int of the signal is 256 s. Because more massive systems can evolve to higher frequencies for a given amount of integration time, we set f cut to slightly higher values for the "NSBH" and the "BBH" classes; we sample f cut from [28,32) Hz and [35,40) Hz. For an NSBH event with (M 1 , M 2 ) = (8 M , 1.4 M ) this leads to 17 s to 12 s of pre-merger warning time, and for a BBH with M 1 = M 2 = 5 M , it is 4 s to 3 s prior to the merger. The integration time t int is the minimum of 256 s and t m (10 Hz) − t m (f cut ). In all cases, the phase at coalescence Φ c is always sampled randomly from [0, 2π).
Because we consider f cut < 40 Hz, we only need to sample each waveform at a rate of 256 Hz. Such a relatively low sampling rate is the key allowing us to integrate the signal for a duration as long as 256 s. In Table I we summarize the key parameters of the three signal classes we consider.
Once a waveform is generated, we then inject it to a noise background of 256 s long, containing both a sta-  Fig. 1), and an additional low-frequency contamination represented by the blue stripe in Fig. 1 (see the description shortly after). As one may imagine continuously passing the strain data to the CNN we trained here (specifically, "CNN-class" in Fig. 4), we align the signal so that it reaches f cut at the end of the time series.
Together with the three signal classes, we also consider a "null" class containing only the nonstatinary detector noise. The goal of CNN-class is then to output the probability of a 256-second data series belonging to one of the four classes.
To simulate the excess low-frequency contamination [blue stripe in Fig. 1 and Eq. (1)], we generate noises with similar characteristics as in realistic LIGO detectors.
Specifically, we simulate four independent time series of the fast ( 10 Hz) angular motion θ(t), corresponding to sensing noises in the four high-bandwidth angular feedback loops (for controlling pitch and yaw motions of the two arm cavities). [60] Instead of using realistic spectral shapes for θ(t) as in the LIGO system, we design them so that the contamination has a spectral shape similar to the full O3 sensitivity [45]. In other words, we give θ(t) extra power in the > 25 Hz band and ignore other sources of contamination in this band. This does not affect the main results of our study though because we choose f cut between 24 Hz and 25 Hz for the "BNS" class.
Meanwhile, we also simulate eight independent spotposition motions x spot (t) for the four test masses and two angular degrees of freedom (pitch and yaw). Their motions are mostly induced by the microseismic motion and peak in the 0.1 − 0.3 Hz band. This is the main source of nonstationarity on timescale of 10 s. At longer timescales, the overall RMS value of x(t) drifts and shows seasonal dependence: during winter times the microseimic motion is typically higher than in the summer. To simulate this, on top of a typical value RMS [x(t)] 0.3 mm, we additionally sample an overall scale factor uniformly from [0.7, 1.4] and apply it to x(t) for each realization.
In order to sense the true spot motions, we assume the information is contained in two sets of witness sensors. The first set of sensors probe the spot motion by exciting each mirror in angle and looking for length fluctuations at the excitation frequency. The angle-to-length conversion factor directly gives us the spot motion at each test mass [see Eq.
(1)]. However, they have very limited SNR and can only trace the long-term ( 0.1 Hz) drift of the spot motion. The other set of sensors are optical levers placed locally at each test mass. They senses the angular motion of each test mass relative to its local ground, which can then be converted to the spot motion using the cavity's geometry. They provide information in the 0.1 Hz band but are polluted by seismic and thermal drifts at lower frequencies and therefore are not coherent with the true spot motion at < 0.1 Hz. Consequently, we would need two sensors (one dithering-based sensor and one optical lever) for the spot motion per test mass per direction. In total, we thus need 20 auxiliary witness channels [4 for θ(t) and 16 for x spot (t)] to reconstruct the low-frequency contamination [61]. Same as the main GW readout, all of the auxiliary channels are sampled at a rate of 256 Hz.
To reduce the complexity of the problem, we first train the two sub models, "CNN-noise" and "CNN-class", individually, which we will describe in Sec. IV B and Sec. IV C. After each sub-model's convergence, we then load their weights into the compound model as the initial condition and preform a global optimization (Sec. IV D).

B. Noise subtraction
Our first step is to construct a NN that mitigates the excess low-frequency contamination to the GW readout in real time. We will refer to this NN specifically as "CNN-noise". It takes the 20 auxiliary witness channels we simulate as the input and estimates their nonlinear contamination to the main GW readout as the output (see also Fig. 4). To achieve supervised learning, we use time series from the noisy GW readout as training targets for this step. Because for most of the observation time there will be no GW signal present in the data, the training is thus preformed on signal-free data series only in this step. We also do not need to use the full 256 seconds of data for each training segment for noise mitigation because the contamination relies only on the instantaneous spot and angle [Eq.
(1)]. The time series only needs to be long enough to capture the microseismic motion (with a characteristic period of ∼ 10 s) which is the main cause of fluctuations in the spot motion. Consequently, we use 64 seconds of data from 21 channels (20 auxiliary witnesses as the input and 1 noisy GW readout as the target) for each segment (i.e., "batch" in the ML literatures), and train "CNN-noise" over 128 segments for each training epoch.
Moreover, for the convenience of the subsequent signal classification, we would like the "cleaned" GW readout to have a nearly white spectral shape. Therefore, we precondition the noisy GW readout before it is passed for training. Since the current O3 detector noise is orders of magnitude greater than the fundamental noise limit (the ideal output of noise cleaning) at f 20 Hz, the precondition is done in an iterative way.
In the first iteration, we whiten the GW readout according to the fundamental O3 noise limit. The spectrum of the residual after noise subtraction is then used to design the preconditioning filter for whitening the GW readout in the next iteration. Because of the nonlinearity involved in the noise coupling, a CNN trained to estimate δx(t) from [θ(t), x spot (t)] according to Eq. (1) does not apply for approximating L {δx(t)} from [L {θ(t)} , x spot (t)] with L denoting a generic linear operator. Therefore, the weights in "CNN-noise" need to be updated once the preconditioning filter changes. Nevertheless, we find the residual are similar for the first and second iterations, and therefore we do not to iterate further.
The same preconditioning filter is also applied to the witness channels for θ(t). While this does not preserve the exact coupling as we argued above, we nonetheless find it helps the CNN to converge faster numerically.
As for the witness channels for the spot motion, we only apply an overall calibration factor so that each channel's numerical values are of order unity. Specifically, we calibrate the dithering-based sensors to output spot motion in millimeter and the optical levers to output the low-frequency (< 1 Hz) angular motion [62] in microradians. Note that the overall RMS of each channel contains physical meaning [the coupling strength from θ(t) to δx(t)] and should not be normalized out. Similarly, for each GW readout we apply a fixed normalization constant.
Once the data are generated and preconditioned, we then pass them to "CNN-noise" to learn the nonlinear noise coupling from the auxiliary channels to the main GW readout. The best performing network structure is summarized in Table II.
We construct a custom loss function for the training. Specifically, we compute the loss as where S (resi) n is the power spectral density of the residual (i.e., target − prediction), and w is a weighting function defined as where S (trgt) n is the power spectral density of the target and C is an overall constant so that the initial loss is of order unity. Empirically, we set (f low , f high ) = (7.5 Hz, 75 Hz) and α = −0.5. In addition, we also sum a small contribution of the standard mean squared error (about 0.1 to the total loss) to the custom loss defined in Eq. 8 to avoid artificial offsets at DC due to numerical over-fitting.
We note that the loss function defined above aims to achieve a broad-band noise mitigation so that the results of "CNN-noise" can be applied for various purpose (signal detection, sky localization, etc.). The optimization for the specific purpose of this work (detecting and classifying BNS ∼ 100 s prior to the merger) is left for the final step where we combine CNN-noise and CNN-class to preform global training.
The resultant ASDs of CNN-noise are shown in Fig. 5. In the figure, each blue trace is the amplitude spectral density of a realization of the simulated O3 sensitivity. Similar to the real detector noise, it has a nonstationary nature as the RMS of x (AC) spot various with time (and different from realization to realization). The residual after removing the contamination predicted by CNN-noise using the 20 auxiliary channels is shown in the grey trace.
Overall, the contamination can be mitigated by a factor of ∼ 10, which is sufficient to reach the fundamental limit in the > 30 Hz band. At lower frequencies, f < 20 Hz, even the residual is still an order of magnitude or more above the fundamental limit and the it fluctuates as the spot motion RMS varies, indicating rooms for further improvements.
Note that each curve in Fig. 5 is the averaged ASD estimated using Welch's method over 256 second of data in total and 8 second per estimation segment. Therefore the fluctuations in Fig. 5 is due to the long-term variation of the RMS of the spot motion. We also show in Fig. 6 directly the time series to compare the original (blue) and the noise-subtracted (grey) series. In the simulated O3 data, the band-limited RMS in the [20,60] Hz varies on the timescale of 10 s as indicated by the envelopes of the time series. This is because the spot position on the test masses x (AC) spot moves due to the microseismic motion in the 0.1 − 0.3 Hz band. Such a modulation prohibits the removal of the noise using standard signal processing techniques (such as Wiener filter) assuming a stationary coupling. The "CNN-noise", nevertheless, successfully mitigates the 10-second-timescale nonstationariety in the time series.
Furthermore, as we shown in Fig. 3, with the cleaned sensitivity represented by the grey traces in Figs. 5 and 6, we can get more than 10% of the total SNR 100 seconds prior to the merger. This is sufficient for us to detect nearby BNS events like GW170817 (Sec. V). For future convenience, we also show the 5-and 95-percentiles at each frequency bin of the residual in the two brown traces in Fig. 5. We further defineρ as the SNR computed assuming a stationary noise background whose values are fixed at the 5-percentiles [i.e., using the lower brown trace for √ S n in Eq. (7)]. We will use this as an estimation of the SNR of the signal at a given distance, though one should keep in mind thatρ will in general be greater than the true SNR of each injection.

C. Signal detection and classification
Once we have trained the CNN-class sub-network, we then inject GW signal onto the cleaned noise background and train the CNN-class for signal detection and classification.
Examples of the input time series to CNN-class is shown in Fig. 6. It is the sum of a GW signal at most 256 seconds long (or zero for a null event) and a 256-second residual noise background produced by subtracting the prediction of CNN-noise and the simulated O3 detector noise (i.e., it corresponds to the grey trace in Fig. 6).
The training target is the label of each sample: we use (0, 1, 2, 3) for ("Null", "BNS", "NSBH", "BBH"), respectively. We further convert the label into the onehot representation, so that when we use CNN-class for prediction, the numerical value at each digit gives the  [20,60] Hz band varies on a timescale of 10 s due to modulations caused by the microseismic motion. The grey traces are the GW readout after noise mitigation by CNN-noise and they are the inputs to CNN-class. The whitened GW signal contained in each realization is highlighted in the purple trace. From top to bottom, they correspond respectively to a typical "BNS", "NSBH", and "BBH". In all the cases we setρ(f < fcut) = 16. probability of the input time series belonging to the corresponding signal class.
Tabel III shows the structure of the best performing CNN-class we find empirically. It consists of 5 CNN layers, each followed by a pooling layer (for the first one we use average pooling while for the rest maximum pooling is used). While ReLU activation are used in previous studies, we nonetheless find ELU activation gives a better convergence and therefore we use it for all the CNN layers. We include 3 Dense layers with ELU activation afterwards, and lastly, the output is produced by a Dense layer with the Softmax activation. The sparse categorical crossentropy loss is used together with an Adamax optimizer.
To help the convergence of the network, we utilize the "curriculum learning" approach [38,39]. That is, we first train CNN-class on very loud GW events with high SNR to guide the NN to an initial convergence. Then we gradually reduce the SNR of injected GW events in the training set to cover the more realistic SNR space of potential astrophysical events.
Specifically, in the first step, we sample GW events fromρ(f < f up cut ) ∈ [16,40) and with a probability ∝ [ρ(f < f up cut )] −2 , whereρ(f < f up cut ) is the SNR computed using the 5-percentile noise residual (the lower brown trace in Fig. 5) and integrated to f up cut = (25, 32, 40) Hz for ("BNS", "NSBH", "BBH"). Noteρ in general will be greater than the true SNR of an injected event because both f cut < f up cut for each realization of the GW event and the background noise is typically greater than 5-percentile value. The training set includes ∼ 2, 000 samples for each signal class, plus ∼ 2, 000 samples for null events. Additional 64 samples per class are used as validation.
Once the first step converges (both the training and loss plateau), we then reduce the SNR range toρ(f < f up cut ) ∈ [10, 40) andρ(f < f up cut ) ∈ [8,28) in the second and third training steps. In each step, we use ∼ 8, 000 samples per class. There exists a trade-off that training the network to identify low-SNR events would typically degrade its ability to classify null events (i.e., increasing the false alarm rate, or FAR). Consequently, we instead sample events uniformly in SNR in the second and third steps, and do not further lower the SNR of the training data.
As a comparison, we also construct a network with the same structure as CNN-class but train it on GW time series with stationary noise background generated according to the fundamental O3 sensitivity (which is similar to the aLIGO design sensitivity for f 40 Hz of interest; red trace in Fig. 1). This reference network is trained with the same curriculum training steps as CNNclass. The results are obtained using our compound NN (Fig. 4) with simulated O3 sensitivity. The blue traces is for a typical BNS event at d eff = 40 Mpc (see Figs. 8 and 9) and the orange trace is for a NSBH event at d eff = 160 Mpc (see Fig. 10).

D. Combined network
While it is sufficient to train CNN-noise and CNNclass individually as in Secs. IV B and IV C, we may further optimize the performance by combining the two networks and training globally. This is because CNN-noise is trained to achieve a broadband noise reduction so that the residual detector noise could potentially serve as the input for pipelines of various purposes. By combining it with CNN-class, the noise subtraction is then optimized specifically for the early detection and classification of GW events.
To achieve so, we utilize the structure shown in Fig. 4 and load the network weights obtained from individual training as the initial condition for the compound network. We generate ∼ 10, 000 samples for each class with the SNRρ(f < f up cut ) uniformly sampled from [8,28). Each time series of the main GW channel is input to the compound network together with 20 auxiliary channels to internally mitigate the detector noise. The training target, loss function, and optimizer are the same as described in Sec. IV C.
We find the compound network could achieve an enhanced performance compared to CNN-class alone (see the discussions in the following section). We will then use the compound network as our final NN and examine its performance of BNS early warning.

V. RESULTS
We access the performance of our NN by examining the receiver operator characteristic (ROC) curves which we construct using the Scikit-learn package [63]. This can be obtained by varying the detection threshold of the  predicted true probability and compute both the true alarm rate (TAR) and FAR at the given threshold, as demonstrated in Fig. 7. More conveniently, we can directly consider TAR as a function of FAR for a particular source, as shown in Fig. 8. For the GW event, we consider BNSs with M 1 = M 2 = 1.4 M and f cut = 25 Hz and vary the sources' averaged effective distance from 20 Mpc to 100 Mpc (corresponding to traces of different colors). At each distance, we inject the signal onto 2,000 realizations of the detector noises. The solid traces are results using simulated O3 sensitivity as the noise background with noise mitigation preformed by inputting the auxiliary channels to the compound CNN (Fig. 4). As a comparison, we also show the performance of the reference network in the dotted traces. It has the same structure as CNN-class but the noise background for training and prediction is generated according to the stationary fundamental O3 sensitivity. Here the FAR is constructed from ∼ 20, 000 realization of detector noises ("null" events; corresponding to 2 months of data). Note here the rate is measured per 256-second data segment, and as a result, FAR = 0.01 would correspond to approximately 1 false alarm every 7.1 hr of detector data.
Alternatively, we can fix the FAR and examine how the TAR varies as a function of the averaged effective distance d eff . The result is shown in Fig. 9. The astrophysical source is still fixed to be BNSs with M 1 = M 2 = 1.4 M and f cut = 25 Hz and the line styles have the same meaning as in Fig. 8. Different colors now represent different FAR threshold [FAR = (0.1, 0.03, 0.01) corresponds to 1 false alarm every (0.7, 2.4, 7.1) hr]. We see that if we could mitigate the noise to a level comparable to the grey stripe in Fig. 5, then a GW170817-like event at d 40 Mpc can be detected 1.5 minutes prior to the merger with a decent chance. Because of the nonstationarity in the background noise, the matched-filter SNR is not a constant even for a fixed effective distance. If we nonetheless treat the noise PSD as being stationary and use the 5 and 95 percentiles in the cleaned spectra (i.e., the two brown traces in Fig. 5), we estimate the SNR to be around 12 to 7.3. On the other hand, if the noise background becomes truly stationary and reaching the designed aLIGO sensitivity, then the early detection can be achieved to d eff 80 Mpc. The corresponding matched-filter SNR is 12. The required SNR for detection of a stationary noise background being similar to the SNR calculated using the 5-percentile of the nonstationary background suggests that our final, global training (Sec. IV D) mitigates the nonstationarity further and improves the NN's performance than treating the noise subtraction and signal detection as two separate, independent problems.
In addition to BNS mergers, NSBH mergers are another type of sources for multi-messenger astronomy, and we access the performance of our CNN for detecting them in Fig. 10 If we still use the (TAR, FAR)=(0.4, 0.01) as the threshold for detection, we find an NSBH can be detected 12 s before the merger at an averaged effective distance of d eff 160 Mpc using simulated O3 sensitivity with noise subtraction. The matched-filter SNR is estimated to be between 10.5 (5-percentile) and 7.0 (95-percentile).
Using the stationary, fundamental O3 sensitivity, we find the detection range to be around d eff ∼ 240 Mpc. The corresponding SNR is 10.5, again similar to the 5percentile value when the nonstationary noise is used, suggesting that the nonstationarity is largely removed with the internal noise cleaning. Interestingly, we note that at a given value ofρ, our NN typically preforms better for detecting NSBHs than BNSs. Whereas the CNN's sensitivity starts to drop sharply atρ 15 and essentially vanishesρ 10 for the "BNS" signal, for "NSBH" we still have a decent sensitivity atρ 10. In part, an NSBH has its signal "concentrated" in a shorter duration with a louder time-domain amplitude than a BNS and therefore it is more easily recognized by an NN (similarly, Ref. [42] also found that an NN typically performs better for BBHs than BNSs with the same matched-filter SNR). Meanwhile, we have also chosen a higher upper cutoff frequency for NSBHs (32 Hz) than for BNSs (25 Hz), and the fluctuation in the PSD of the background noise is less at higher frequencies between different realizations after the cleaning by CNNnoise.
Another quantity of interest is the false classification rate (FCR). Specifically, if there is a BBH event (which typically does not have an EM counterpart) present in the GW readout, we want to address the probability of classifying it as a "BNS" or "NSBH" and falsely triggering subsequent EM followup observations. The result is shown in Fig. 11. The FCR is constructed from 5,000 "BBH" injections. The "BBH" events are sampled from a distribution ∝ [ρ(f < 40 Hz)] −2 andρ(f < 40 Hz) ∈ [8,40). By comparing the top panel of Fig. 11 with Fig. 8, we see that a "BNS" trigger is much less likely to be confused by a true "BBH" event than by the detector noise. The "NSBH" class has slightly more false classifications from the "BBH" class, yet at FCR = 0.01, we still have TCR > 0.1 forρ(f < 32 Hz) > 7.
Lastly, we point out that our compound CNN not only provides a potential way to achieve real-time noise mitigation and signal detection, it could also serve as an efficient first step to existing match-filter-based pipelines. This is because the computationally expensive part is the training. Once the network is trained, the prediction time is typically only 100 ms for doing both noise mitigation and signal classification, or 30 ms for just preforming signal classification, as shown in Fig. 12. Indeed, once a signal is detected and classified by the network, subsequent matched-filter analysis would only need to perform searches over a small sub-bank after the classification preformed by CNN, potentially enhance the efficiency of the existing pipelines.

VI. CONCLUSION AND DISCUSSION
We showed that it would be possible to detect BNS (NSBH) signals from the real-time LIGO data series using a ML NN.
To achieve so, it requires improving the LIGO sensitivity in the 60 Hz band, which currently dominated by nonlinear cross-couplings from the auxiliary control loops and/or environmental perturbations. We demonstrated that one potential way to enhance the lowfrequency sensitivity is to input the auxiliary channels together with the main GW readout to an NN and use it to simultaneously preform noise cleaning and signal In the left is the time for "CNN-class" takes to classify a 256-second time-series, and in the right is the time for the compound problem where we input both the GW readout and 20 auxiliary channels. Once the training is complete, the prediction takes only ∼ 30 ms for the NN to classify a time series; even including real-time noise subtraction, the computation time is still less than 100 ms in most cases.

detection.
With noise mitigation reaching the level shown in Figs. 5 and 6, we can detect BNS (NSBH) ∼ 100 s (10 s) prior to merger out to d eff 40 Mpc (160 Mpc) with a TAR 0.4 and FAR = 0.01 (i.e., 1 false alarm every 7.1 hours). If we have a stationary, Gaussian noise background reaching the designed sensitivity, the early warning can be achieved out to d eff 80 Mpc and 240 Mpc for BNS and NSBH, respectively. The matched-filter SNR is 12 and 10 for typical BNSs and NSBHs, respectively. Moreover, we find the threshold SNRs for the Gaussian noise background are similar to the SNRs estimated using the 5-percentile of the nonstationary noise (the bottom brown trace in Fig. 5). This indicates that our compound network structure (Fig. 4) largely mitigates complications due to a nonstationary background, and the global training (Sec. IV D) enhances the NN's performance than treating the noise cleaning and signal detection as two separate problems.
We note that our current NN has not yet reached a sensitivity comparable to the existing low-latency pipelines. For example, Ref. [29] considered a similar early-warning problem using GstLAL and the designed aLIGO sensitivity. According to the associated data release [64], the authors of Ref. [29] preformed 1,446 BNS injections with distance from 80 Mpc to 100 Mpc in total and they were able to detect 446 (or 31%) out of them at an upper cut-off frequency of 29 Hz and a FAR = (30 days) −1 . From the dotted traces in Fig. 9, our NN can achieve a similar TAR=0.3 only at FAR = (7.1 hours) −1 , a FAR that is about 100 times higher than the GstLAL results. While in part the difference in the performance is due to the fact that we considered a lower cutoff frequency of 25 Hz and the integration time of the signal is thus 30 s shorter, it nonetheless indicates that the ML NN still has a large room for future improvement.
Nevertheless, a ML-based NN has a few advantages over the existing pipelines that warrent it future studying. First of all, as multiple authors have pointed out (see, e.g., Refs. [38,39,42]), an NN is highly efficient in prediction. Indeed, as we showed in Fig. 12, it takes the CNN-class only 30 ms to detect and classify a GW signal from a 256-s data segment. In comparison, the typical latency is about 6 s for GstLAL, indicating the possibility of accelerating the existing pipelines even further.
More importantly, we can input not only the strain readout but also auxiliary channels to the NN to enhance the detection of GW signal. Here we focused specifically on removing the excess and nonstationary contamination to the low-frequency band. In addition to help the early warning of BNSs and NSBHs, mitigating the nonstationarity could also help to reduce the false triggers of heavy BBHs due to the drift of background PSD [10]. Vetoing and/or mitigating glitches is another thing a NN could help with inputting also auxiliary witnesses [65,66]. In principle, one can combine multiple noise mitigation feedforwards and data quality checks with a signal detection routine into a single NN (potentially with a compound structure) that efficiently enhances LIGO's performance.
As a proof of concept, we used simulated data to mimic the O3 LIGO sensitivity and our auxiliary witnesses are designed to try to emulate realistic channels in LIGO. There is also a public data release containing 3-hours of outputs from all the major LIGO auxiliary channels available at [67]. We encourage interested readers to utilize the NN structures we proposed in this work or original NN structures to help the further improvements of the LIGO sensitivity.