of 28
Tests of General Relativity with the Binary Black Hole Signals from the LIGO-Virgo Catalog
GWTC-1
The LIGO Scientific Collaboration and the Virgo Collaboration
The detection of gravitational waves by Advanced LIGO and Advanced Virgo provides an opportunity to test
general relativity in a regime that is inaccessible to traditional astronomical observations and laboratory tests.
We present four tests of the consistency of the data with binary black hole gravitational waveforms predicted by
general relativity. One test subtracts the best-fit waveform from the data and checks the consistency of the residual
with detector noise. The second test checks the consistency of the low- and high-frequency parts of the observed
signals. The third test checks that phenomenological deviations introduced in the waveform model (including
in the post-Newtonian coe
ffi
cients) are consistent with zero. The fourth test constrains modifications to the
propagation of gravitational waves due to a modified dispersion relation, including that from a massive graviton.
We present results both for individual events and also results obtained by combining together particularly strong
events from the first and second observing runs of Advanced LIGO and Advanced Virgo, as collected in the
catalog GWTC-1. We do not find any inconsistency of the data with the predictions of general relativity and
improve our previously presented combined constraints by factors of
1
.
1
to
2
.
4
. In particular, we bound the mass
of the graviton to be
m
g
5
.
0
×
10
23
eV
/
c
2
(90% credible level), an improvement of a factor of 1
.
5 over our
previously presented results. Additionally, we check that the four gravitational-wave events published for the first
time in GWTC-1 do not lead to stronger constraints on alternative polarizations than those published previously.
I. INTRODUCTION
Einstein’s theory of gravity, general relativity (GR), has
withstood a large number of experimental tests [
1
]. With the
advent of gravitational-wave (GW) astronomy and the obser-
vations by the Advanced LIGO [
2
] and Advanced Virgo [
3
]
detectors, a range of new tests of GR have become possible.
These include both weak field tests of the propagation of GWs,
as well as tests of the strong field regime of compact binary
sources. See [
4
8
] for previous applications of such tests to
GW data.
We report results from tests of GR on all the confident bi-
nary black hole GW events in the catalog GWTC-1 [
9
], i.e.,
those from the first and second observing runs of the advanced
generation of detectors. Besides all of the events previously
announced (GW150914, GW151012, GW151226, GW170104,
GW170608, and GW170814) [
5
7
,
10
13
], this includes the
four new GW events reported in [
14
] (GW170729, GW170809,
GW170818, and GW170823). We do not investigate any of
the marginal triggers in GWTC-1, which have a false-alarm
rate (FAR) greater than one per year. Table I displays a com-
plete list of the events we consider. Tests of GR on the binary
neutron star event GW170817 are described in [8].
The search results in [
14
] originate from two modeled
searches and one weakly modeled search [
5
,
11
,
14
,
15
]. The
modeled searches use templates based on GR to find candidate
events and to assess their significance. However, detection
by such searches does not in itself imply full compatibility
of the signal with GR [
16
,
17
]. The weakly modeled search
relies on coherence of signals between multiple detectors, as
expected for an astrophysical source. While it assumes that the
morphology of the signal resembles a chirp (whose frequency
increases with time), as expected for a compact binary coales-
cence, it does not assume that the detailed waveform shape
agrees with GR. A transient signal strongly deviating from GR
would likely be found by the weakly modeled search, even if
missed by the modeled searches. So far, however, all signif-
icant [
FAR
<
(1
yr
)
1
] transient signals found by the weakly
modeled search were also found by at least one of the modeled
searches [14].
At present, there are no complete theories of gravity other
than GR that are mathematically and physically viable and
provide well defined alternative predictions for the waveforms
arising from the coalescence of two black holes (if, indeed,
these theories even admit black holes). Thus, we cannot test
GR by direct comparison with other specific theories. Instead,
we can (i) check the consistency of the GR predictions with the
data and (ii) introduce
ad hoc
modifications in GR waveforms
to determine the degree to which the values of the deviation
parameters agree with GR. These methods are agnostic to any
particular choice of alternative theory. For the most part, our
results should therefore be interpreted as observational con-
straints on possible GW phenomenologies, independent of the
overall suitability or well-posedness of any specific alternative
to GR. These limits are useful in providing a quantitative indi-
cation of the degree to which the data is described by GR; they
may also be interpreted more specifically in the context of any
given alternative to produce constraints, if applicable.
In particular, with regard to the consistency of the GR pre-
dictions (i), we (a) look for residual power after subtracting
the best-fitting GR waveform from the data, and (b) evaluate
the consistency of the high and low frequency components of
the observed signal. With regard to deviations from GR (ii),
we separately introduce parametrizations for (a) the emitted
waveform, and (b) its propagation. The former could be viewed
as representing possible GR modifications in the strong-field
region close to the binary, while the latter would correspond
to weak-field modifications away from the source. Although
we consider these independently, modifications to GW prop-
agation would most likely be accompanied by modifications
to GW generation in any given extension of GR. We have also
checked that none of the events discussed here provide stronger
constraints on models with purely vector and purely scalar GW
polarizations than those previously published in [
7
,
8
]. Our
analyses do not reveal any inconsistency of the data with the
arXiv:1903.04467v2 [gr-qc] 28 Mar 2019
2
predictions of GR.
Limits on deviations from GR for individual events are dom-
inated by statistical errors due to detector noise. These errors
can be reduced by appropriately combining results from mul-
tiple events. Sources of systematic errors, on the other hand,
include uncertainties in the detector calibration and power spec-
tral density (PSD) estimation and errors in the modeling of
waveforms in GR. Detector calibration uncertainties are mod-
eled as corrections to the measured detector response function
and are marginalized over. Studies on the e
ff
ect of PSD un-
certainties are currently ongoing. A full characterization of
the systematic errors due to the GR waveform models that we
employ is beyond the scope of this study; some investigations
can be found in [18–22].
This paper is organized as follows. Section II provides an
overview of the data sets employed here, while Sec. III details
which GW events are used to produce the individual and com-
bined results presented in this paper. In Sec. IV we explain the
gravitational waveforms and data analysis formalisms which
our tests of GR are based on, before we present the results in
the following sections. Section V contains two signal consis-
tency tests: the residuals test in V A and the inspiral-merger-
ringdown consistency test in V B. Results from parameterized
tests are given in Sec. VI for GW generation, and in Sec. VII
for GW propagation. We briefly discuss the study of GW polar-
izations in Sec. VIII. Finally, we conclude in Sec. IX. We give
results for individual events and some checks on waveform
systematics in the Appendix.
The results of each test and associated data products can be
found in Ref. [
23
]. The GW strain data for all the events con-
sidered are available at the Gravitational Wave Open Science
Center [24].
II. DATA, CALIBRATION AND CLEANING
The first observing run of Advanced LIGO (O1) lasted from
September 12
th
, 2015 to January 19
th
, 2016. The second ob-
serving run (O2) lasted from November 30
th
, 2016 to August
25
th
, 2017, with the Advanced Virgo observatory joining on
August 1
st
, 2017. This paper includes all GW events originat-
ing from the coalescence of two black holes found in these two
data sets and published in [5, 14].
The GW detector’s response to changes in the di
ff
erential
arm length (the interferometer’s degree of freedom most sensi-
tive to GWs) must be calibrated using independent, accurate,
absolute references. The LIGO detectors use photon recoil (ra-
diation pressure) from auxiliary laser systems to induce mirror
motions that change the arm cavity lengths, allowing a direct
measure of the detector response [
25
27
]. Calibration of Virgo
relies on measurements of Michelson interference fringes as the
main optics swing freely, using the primary laser wavelength
as a fiducial length. Subsequent measurements propagate the
calibration to arrive at the final detector response [
28
,
29
].
These complex-valued, frequency-dependent measurements
of the LIGO and Virgo detectors’ response yield the uncer-
tainty in their respective estimated amplitude and phase of
the GW strain output. The amplitude and phase correction
factors are modeled as cubic splines and marginalized over in
the estimation of astrophysical source parameters [
14
,
30
32
].
Additionally, the uncertainty in the time stamping of Virgo
data (much larger than the LIGO timing uncertainty, which is
included in the phase correction factor) is also accounted for
in the analysis.
Post-processing techniques to subtract noise contributions
and frequency lines from the data around gravitational-wave
events were developed in O2 and introduced in [
7
,
13
,
33
],
for the astrophysical parameter estimation of GW170608,
GW170814, and GW170817. This noise subtraction was
achieved using optimal Wiener filters to calculate coupling
transfer functions from auxiliary sensors [
34
]. A new, opti-
mized parallelizable method in the frequency domain [
35
] al-
lows large scale noise subtraction on LIGO data. All of the O2
analyses presented in this manuscript use the noise-subtracted
data set with the latest calibration available. The O1 data set is
the same used in previous publications, as the e
ff
ect of noise
subtraction is expected to be negligible. Reanalysis of the O1
events is motivated by improvements in the parameter estima-
tion pipeline, an improved frequency-dependent calibration,
and the availability of new waveform models.
III. EVENTS AND SIGNIFICANCE
We present results for all confident detections of binary
black hole events in GTWC-1 [
9
], i.e., all such events detected
during O1 and O2 with a FAR lower than one per year, as
published in [
14
]. The central columns of Table I list the FARs
of each event as evaluated by the three search pipelines used
in [
14
]. Two of these pipelines (
P
y
CBC
and
G
st
LAL
) rely on
waveform templates computed from binary black hole coales-
cences in GR. Making use of a measure of significance that
assumes the validity of GR could potentially lead to biases in
the selection of events to be tested, systematically disfavoring
signals in which a GR violation would be most evident. There-
fore, it is important to consider the possibilities that (1) there
were GW signals with such large deviations from GR that they
were missed entirely by the modeled searches, and (2) there
were events that were picked up by the modeled searches but
classified as marginal (and thus excluded from our analysis)
because of their significant deviations from GR.
These worries can largely be dispelled by considering the
third GW search pipeline, the coherent WaveBurst (
c
WB
)
weakly modeled search presented in [
14
]. This
c
WB
search [
15
,
36
,
37
] was tuned to detect chirping signals—like those that
would be expected from compact binary coalescences—but
was not tuned to any specific GR predictions.
1
c
WB
is most
sensitive to short signals from high-mass binary black holes.
It is still able to detect signals from lower-mass binaries (e.g.,
GW151226), though with reduced significance compared to
the modeled searches. Thus, a signal from a low-mass binary,
1
Chirping signals from compact binary coalescences are a feature of many
theories of gravity. All that is required is that the orbital frequency increases
as the binary radiates energy and angular momentum in GWs.
3
or a marginal event, with a significant departure from the GR
predictions (hence not detected by the GR modeled searches)
would not necessarily be detected by the
c
WB
search with a
FAR
<
(1
yr
)
1
. However, if there is a population of such sig-
nals, they will not all be weak and
/
or from low-mass binaries.
Thus, one would expect some of the signals in the population
to be detected by
c
WB
, even if they evade detection by the
modeled searches.
All signals detected by the
c
WB
search with
FAR
<
(1
yr
)
1
were also found by at least one modeled search with
FAR
<
(1
yr
)
1
. Given the above considerations, this is evidence that
our analysis does not exclude chirping GW signals that were
missed in the modeled searches because of drastic departures
from GR. Similarly, this is also evidence against the possi-
bility of marginal events representing a population of GR-
deviating signals, as none of them show high significance
[
FAR
<
(1
yr
)
1
] in the
c
WB
search only. Thus, we believe
that we have not biased our analysis by considering only the
ten events with FAR
<
(1 yr)
1
, as published in [14].
We consider each of the GW events individually, carrying
out di
ff
erent analyses on a case-by-case basis. Some of the
tests presented here, such as the inspiral-merger-ringdown
(IMR) consistency test in Sec. V B and the parameterized
tests in Sec. VI, distinguish between the inspiral and the post-
inspiral regimes of the signal. The separation between these
two regimes is performed in the frequency-domain, choosing a
particular cuto
ff
frequency determined by the parameters of the
event. Larger-mass systems merge at lower frequencies, pre-
senting a short inspiral signal in band; lower mass systems have
longer observable inspiral signals, but the detector’s sensitivity
decreases at higher frequencies and hence the post-inspiral
signal becomes less informative. Therefore, depending on the
total mass of the system, a particular signal might not provide
enough information within the sensitive frequency band of the
GW detectors for all analyses.
As a proxy for the amount of information that can be ex-
tracted from each part of the signal, we calculate the signal-to-
noise ratio (SNR) of the inspiral and the post-inspiral parts of
the signals separately. We only apply inspiral (post-inspiral)
tests if the inspiral (post-inspiral) SNR is greater than 6. Each
test uses a di
ff
erent inspiral-cuto
ff
frequency, and hence they
assign di
ff
erent SNRs to the two regimes (details provided in
the relevant section for each test). In Table I we indicate which
analyses have been performed on which event, based on this
frequency and the corresponding SNR.
2
In addition to the individual analysis of each event, we derive
combined constraints on departures from GR using multiple
signals simultaneously. Constraints from individual events are
largely dominated by statistical uncertainties due to detector
noise. Combining events together can reduce such statistical er-
rors on parameters that take consistent values across all events.
2
While we perform these tests on all events with SNR
>
6 in the appropriate
regime, in a few cases the results appear uninformative and the posterior
distribution extends across the entire prior considered. Since the results
are prior dependent, upper limits should not be set from these individual
analyses. See Sec. 3 of the Appendix for details.
However, it is impossible to make joint probabilistic statements
from multiple events without prior assumptions about the na-
ture of each observation and how it relates to others in the
set. This means that, although there are well-defined statistical
procedures for producing joint results, there is no unique way
of doing so.
In light of this, we adopt what we take to be the most straight-
forward strategy, although future studies may follow di
ff
erent
criteria. First, in combining events we assume that deviations
from GR are manifested equally across events, independent
of source properties. This is justified for studies of modified
GW propagation, since those e
ff
ects should not depend on the
source.
3
For other analyses, it is quite a strong assumption
to take all deviations from GR to be independent of source
properties. Such combined tests should not be expected to nec-
essarily reveal generic source-dependent deviations, although
they might if the departures from GR are large enough (see,
e.g., [
38
]). Future work may circumvent this issue by combin-
ing marginalized likelihood ratios (Bayes factors), instead of
posterior probability distributions [39].
Second, we choose to produce combined constraints only
from events that were found in both modeled searches
(P
y
CBC
[
40
42
] and
G
st
LAL
[
43
,
44
]) with a FAR of at
most one per one-thousand years. This ensures that there is
a very small probability of inclusion of a non-astrophysical
event. The events used for the combined results are indicated
with bold names in Table I. The events thus excluded from the
combined analysis have low SNR and would therefore con-
tribute only marginally to tightened constraints. Excluding
marginal events from our analyses amounts to assigning a null
a priori
probability to the possibility that those data contain
any information about the tests in question. This is, in a sense,
the most conservative choice.
In summary, we enforce two significance thresholds:
FAR
<
(1
yr
)
1
, for single-event analyses, and
FAR
<
(1000
yr
)
1
, for
combined results. This two-tiered setup allows us to produce
conservative joint results by including only the most significant
events, while also providing information about a broader (less
significant) set of triggers. This is intended to enable the inter-
ested reader to combine individual results with less stringent
criteria and under di
ff
erent statistical assumptions, according
to their specific needs and tolerance for false positives. In the
future, we may adapt our thresholds depending on the rate of
detections.
IV. PARAMETER INFERENCE
The starting point for all the analyses presented here are
waveform models that describe the GWs emitted by coalescing
black-hole binaries. The GW signature depends on the intrinsic
parameters describing the binary as well as the extrinsic param-
eters specifying the location and orientation of the source with
3
Propagation e
ff
ects do depend critically on source distance. However, this
dependence is factored out explicitly, in a way that allows for combining
events as we do here (see Sec. VII).
4
TABLE I. The GW events considered in this paper, separated by observing run. The first block of columns gives the names of the events and
lists some of their relevant properties obtained using GR waveforms (luminosity distance
D
L
, source frame total mass
M
tot
and final mass
M
f
,
and dimensionless final spin
a
f
). The next block of columns gives the significance, measured by the false-alarm-rate (FAR), with which each
event was detected by each of the three searches employed, as well as the matched filter signal-to-noise ratio from the stochastic sampling
analyses with GR waveforms. A dash indicates that an event was not identified by a search. The parameters and SNR values give the medians
and 90% credible intervals. All the events except for GW151226 and GW170729 are consistent with a binary of nonspinning black holes (when
analyzed assuming GR). See [
14
] for more details about all the events.
a
The last block of columns indicates which GR tests are performed on a
given event: RT
=
residuals test (Sec. V A); IMR
=
inspiral-merger-ringdown consistency test (Sec. V B); PI & PPI
=
parameterized tests of GW
generation for inspiral and post-inspiral phases (Sec. VI); MDR
=
modified GW dispersion relation (Sec. VII). The events with bold names are
used to obtain the combined results for each test.
Event
Properties
FAR
SNR
GR tests performed
D
L
M
tot
M
f
a
f
P
y
CBC
G
st
LAL
c
WB
RT IMR PI PPI MDR
[Mpc]
[
M
]
[
M
]
[yr
1
]
[yr
1
]
[yr
1
]
GW150914
b
430
+
150
170
66
.
2
+
3
.
7
3
.
3
63
.
1
+
3
.
3
3
.
0
0
.
69
+
0
.
05
0
.
04
<
1
.
5
×
10
5
<
1
.
0
×
10
7
<
1
.
6
×
10
4
25
.
3
+
0
.
1
0
.
2
33333
GW151012
b
1060
+
550
480
37
.
3
+
10
.
6
3
.
9
35
.
7
+
10
.
7
3
.
8
0
.
67
+
0
.
13
0
.
11
0
.
17
7
.
9
×
10
3
9
.
2
+
0
.
3
0
.
4
3
33
GW151226
b, c
440
+
180
190
21
.
5
+
6
.
2
1
.
5
20
.
5
+
6
.
4
1
.
5
0
.
74
+
0
.
07
0
.
05
<
1
.
7
×
10
5
<
1
.
0
×
10
7
0
.
02
12
.
4
+
0
.
2
0
.
3
3
3
3
GW170104
960
+
440
420
51
.
3
+
5
.
3
4
.
2
49
.
1
+
5
.
2
4
.
0
0
.
66
+
0
.
08
0
.
11
<
1
.
4
×
10
5
<
1
.
0
×
10
7
2
.
9
×
10
4
14
.
0
+
0
.
2
0
.
3
33333
GW170608
320
+
120
110
18
.
6
+
3
.
1
0
.
7
17
.
8
+
3
.
2
0
.
7
0
.
69
+
0
.
04
0
.
04
<
3
.
1
×
10
4
<
1
.
0
×
10
7
1
.
4
×
10
4
15
.
6
+
0
.
2
0
.
3
3
333
GW170729
d
2760
+
1380
1340
85
.
2
+
15
.
6
11
.
1
80
.
3
+
14
.
6
10
.
2
0
.
81
+
0
.
07
0
.
13
1
.
4
0
.
18
0
.
02
10
.
8
+
0
.
4
0
.
5
33
33
GW170809
990
+
320
380
59
.
2
+
5
.
4
3
.
9
56
.
4
+
5
.
2
3
.
7
0
.
70
+
0
.
08
0
.
09
1
.
4
×
10
4
<
1
.
0
×
10
7
12
.
7
+
0
.
2
0
.
3
33
33
GW170814
580
+
160
210
56
.
1
+
3
.
4
2
.
7
53
.
4
+
3
.
2
2
.
4
0
.
72
+
0
.
07
0
.
05
<
1
.
2
×
10
5
<
1
.
0
×
10
7
<
2
.
1
×
10
4
17
.
8
+
0
.
3
0
.
3
33333
GW170818
1020
+
430
360
62
.
5
+
5
.
1
4
.
0
59
.
8
+
4
.
8
3
.
8
0
.
67
+
0
.
07
0
.
08
4
.
2
×
10
5
11
.
9
+
0
.
3
0
.
4
33
33
GW170823
1850
+
840
840
68
.
9
+
9
.
9
7
.
1
65
.
6
+
9
.
4
6
.
6
0
.
71
+
0
.
08
0
.
10
<
3
.
3
×
10
5
<
1
.
0
×
10
7
2
.
1
×
10
3
12
.
1
+
0
.
2
0
.
3
33
33
a
The parameters given in this table di
ff
er slightly from those in v2 of the arXiv version of [
14
] since they correct for a small issue in the application of priors for
spin parameters.
b
The FARs for these events di
ff
er from those in [5] because the data were re-analyzed with the new pipeline statistics used in O2 (see [14] for more details).
c
At least one black hole has dimensionless spin
>
0
.
28 (99% credible level).
d
This event has a higher significance in the unmodeled search than in the modeled searches. Additionally, at least one black hole has dimensionless spin
>
0
.
27
(99% credible level).
respect to the detector network. The intrinsic parameters for
circularized black-hole binaries in GR are the two masses
m
i
of
the black holes and the two spin vectors
~
S
i
defining the rotation
of each black hole, where
i
∈{
1
,
2
}
labels the two black holes.
We assume that the binary has negligible orbital eccentricity,
as is expected to be the case when the binary enters the band of
ground-based detectors [
45
,
46
] (except in some more extreme
formation scenarios,
4
e.g., [
54
57
]). The extrinsic parameters
comprise four parameters that specify the space-time location
of the binary black-hole, namely the sky location (right ascen-
sion and declination), the luminosity distance, and the time of
coalescence. In addition, there are three extrinsic parameters
that determine the orientation of the binary with respect to
Earth, namely the inclination angle of the orbit with respect
to the observer, the polarization angle, and the orbital phase at
coalescence.
We employ two waveform families that model binary black
holes in GR: the e
ff
ective-one-body based
SEOBNR
v
4 [
18
]
waveform family that assumes non-precessing spins for the
black holes (we use the frequency domain reduced order
4
These scenarios could occur often enough, compared to the expected rate
of detections, that the inclusion of eccentricity in waveform models is a
necessity for tests of GR in future observing runs; see, e.g., [
47
53
] for
recent work on developing such waveform models.
model
SEOBNR
v
4
ROM
for reasons of computational e
ffi
-
ciency), and the phenomenological waveform family
IMRP
he
-
nom
P
v
2 [
19
,
58
,
59
] that models the e
ff
ects of precessing spins
using two e
ff
ective parameters by twisting up the underlying
aligned-spin model. We use
IMRP
henom
P
v
2 to obtain all the
main results given in this paper, and use
SEOBNR
v
4 to check
the robustness of these results, whenever possible. When we
use
IMRP
henom
P
v
2, we impose a prior
m
1
/
m
2
18 on the
mass ratio, as the waveform family is not calibrated against
numerical relativity simulations for
m
1
/
m
2
>
18. We do not
impose a similar prior when using
SEOBNR
v
4, since it in-
cludes information about the extreme mass ratio limit. Neither
of these waveform models includes the full spin dynamics
(which requires 6 spin parameters). Fully precessing waveform
models have been recently developed [
21
,
60
62
] and will be
used in future applications of these tests.
The waveform models used in this paper do not include the
e
ff
ects of subdominant (non-quadrupole) modes, which are
expected to be small for comparable-mass binaries [
63
,
64
].
The first generation of binary black hole waveform models
including spin and higher order modes has recently been de-
veloped [
62
,
65
67
]. Preliminary results in [
14
], using NR
simulations supplemented by NR surrogate waveforms, indi-
cate that the higher mode content of the GW signals detected
by Advanced LIGO and Virgo is weak enough that models
without the e
ff
ect of subdominant modes do not introduce sub-
5
stantial biases in the intrinsic parameters of the binary. For
unequal-mass binaries, the e
ff
ect of the non-quadrupole modes
is more pronounced [
68
], particularly when the binary’s ori-
entation is close to edge-on. In these cases, the presence of
non-quadrupole modes can show up as a deviation from GR
when using waveforms that only include the quadrupole modes,
as was shown in [
69
]. Applications of tests of GR with the new
waveform models that include non-quadrupole modes will be
carried out in the future.
We believe that our simplifying assumptions on the wave-
form models (zero eccentricity, simplified treatment of spins,
and neglect of subdominant modes) are justified by astrophysi-
cal considerations and previous studies. Indeed, as we show
in the remainder of the paper, the observed signals are consis-
tent with the waveform models. Of course, had our analyses
resulted in evidence for violations of GR, we would have had
to revisit these simplifications very carefully.
The tests described in this paper are performed within the
framework of Bayesian inference, by means of the
LALI
nfer
-
ence
code [
31
] in the LIGO Scientific Collaboration Algorithm
Library Suite (LALSuite) [
70
]. We estimate the PSD using the
B
ayes
W
ave
code [
71
,
72
], as described in Appendix B of [
14
].
Except for the residuals test described in Sec. V A, we use the
waveform models described in this section to estimate from the
data the posterior distributions of the parameters of the binary.
These include not only the intrinsic and extrinsic parameters
mentioned above, but also other parameters that describe pos-
sible departures from the GR predictions. Specifically, for the
parameterized tests in Secs. VI and VII, we modify the phase
Φ
(
f
) of the frequency-domain waveform
̃
h
(
f
)
=
A
(
f
)
e
i
Φ
(
f
)
.
(1)
For the GR parameters, we use the same prior distributions
as the main parameter estimation analysis described in [
14
],
though for a number of the tests we need to extend the ranges
of these priors to account for correlations with the non-GR
parameters, or for the fact that only a portion of the signal is
analyzed (as in Sec. V B). We also use the same low-frequency
cuto
ff
s for the likelihood integral as in [
14
], i.e., 20 Hz for all
events except for GW170608, where 30 Hz is used for LIGO
Hanford, as discussed in [
13
], and GW170818, where 16 Hz is
used for all three detectors. For the model agnostic residual test
described in Sec. V A, we use the
B
ayes
W
ave
code [
71
] which
describes the GW signals in terms of a number of Morlet-Gabor
wavelets.
V. CONSISTENCY TESTS
A. Residuals test
One way to evaluate the ability of GR to describe GW signals
is to subtract the best-fit template from the data and make
sure the residuals have the statistical properties expected of
instrumental noise. This largely model-independent test is
sensitive to a wide range of possible disagreements between
the data and our waveform models, including those caused
by deviations from GR and by modeling systematics. This
analysis can look for GR violations without relying on specific
parametrizations of the deviations, making it a versatile tool.
Results from a similar study on our first detection were already
presented in [4].
In order to establish whether the residuals agree with noise
(Gaussian or otherwise), we proceed as follows. For each event
in our set, we use
LALI
nference
and the
IMRP
henom
P
v
2 wave-
form family to obtain an estimate of the best-fit (i.e., maximum
likelihood) binary black-hole waveform based on GR.
5
This
waveform incorporates factors that account for uncertainty in
the detector calibration, as described in Sec. II. This best-fit
waveform is then subtracted from the data to obtain the resid-
uals. If the GR-based model provides a good description of
the signal, we expect the resulting residuals at each detector
to lack any significant coherent SNR beyond what is expected
from noise fluctuations. We compute the coherent SNR using
B
ayes
W
ave
, which models the multi-detector residuals as a
superposition of incoherent Gaussian noise and an elliptically-
polarized coherent signal. The residual network SNR reported
by
B
ayes
W
ave
is the SNR that would correspond to such a
coherent signal.
In particular, for each event,
B
ayes
W
ave
produces a distri-
bution of possible residual signals consistent with the data,
together with corresponding
a posteriori
probabilities. This
is trivially translated into a probability distribution over the
coherent residual SNR. We summarize each of these distri-
butions by computing the corresponding 90%-credible upper
limit (
SNR
90
). This produces one number per event that rep-
resents an upper bound on the coherent power that could be
present in its residuals.
We may translate the
SNR
90
into a measure of how well
the best-fit templates describe the signals in our data. We do
this through the fitting factor [
73
],
FF
B
SNR
GR
/
(
SNR
2
res
+
SNR
2
GR
)
1
/
2
, where
SNR
res
is the coherent residual SNR and
SNR
GR
is the network SNR of the best-fit template (see Table I
for network SNRs). By setting
SNR
res
=
SNR
90
, we produce a
90%-credible lower limit on the fitting factor (
FF
90
). Because
FF is itself a lower limit on the overlap between the true and
best-fit templates, so is
FF
90
. As in [
4
], we may then assert
that the disagreement between the true waveform and our GR-
based template is at most (1
FF
90
)
×
100%. This is interesting
as a measure of the sensitivity of our test, but does not tell us
about the consistency of the residuals with instrumental noise.
To assess whether the obtained residual
SNR
90
values are
consistent with detector noise, we run an identical
B
ayes
W
ave
analysis on 200 di
ff
erent sets of noise-only detector data near
each event. This allows us to estimate the
p
-value for the null
hypothesis that the residuals are consistent with noise. The
p
-
value gives the probability of noise producing coherent power
5
The best-fit waveform was obtained from preliminary parameter inference
results, so the residuals discussed here are slightly di
ff
erent than those that
would be obtained using the posterior samples in GWTC-1 [
9
]. However,
we have confirmed that the di
ff
erence in the best-fit waveforms is smaller
than the uncertainty due to statistical noise, with waveform overlaps always
greater than the fitting factor.
6
TABLE II. Results of the residuals analysis. For each event, this table
presents the 90%-credible upper limit on the reconstructed network
SNR after subtraction of the best-fit GR waveform (
SNR
90
), a cor-
responding lower limit on the fitting factor (
FF
90
in the text), and
the
SNR
90
p
-value.
SNR
90
is a measure of the maximum possible
coherent signal power not captured by the best-fit GR template, while
the
p
-value is an estimate of the probability that instrumental noise
produced such
SNR
90
or higher. We also indicate which interferome-
ters (IFOs) were used in the analysis of a given event, either the two
Advanced LIGO detectors (HL) or the two Advanced LIGO detectors
plus Advanced Virgo (HLV). See Sec. V A in the main text for details.
Event
IFOs Residual SNR
90
Fitting factor
p
-value
GW150914
HL
6.4
0
.
97
0.34
GW151012
HL
6.9
0
.
81
0.18
GW151226
HL
5.7
0
.
91
0.76
GW170104
HL
5.2
0
.
94
0.97
GW170608
HL
7.8
0
.
90
0.07
GW170729
HLV
6.5
0
.
87
0.72
GW170809
HLV
6.7
0
.
91
0.73
GW170814
HLV
8.6
0
.
90
0.19
GW170818
HLV
10.1
0
.
78
0.13
GW170823
HL
5.4
0
.
92
0.89
5
6
7
8
9
10
11
12
Residual SNR
90
0.1
1
Survival function
(
p
=
1
CDF
)
GW150914
GW151012
GW151226
GW170104
GW170608
GW170729
GW170809
GW170814
GW170818
GW170823
FIG. 1. Survival function (
p
=
1
CDF
) of the 90%-credible upper
limit on the network SNR (
SNR
90
) for each event (solid or dashed
curves), compared to the measured residual values (vertical dotted
lines). For each event, the value of the survival function at the mea-
sured
SNR
90
gives the
p
-value reported in Table II (markers). The
colored bands correspond to uncertainty regions for a Poisson pro-
cess and have half width
±
p
/
N
, with
N
the number of noise-only
instantiations that yielded SNR
n
90
greater than the abscissa value.
with
SNR
n
90
greater than or equal to the residual value
SNR
90
,
i.e.,
p
B
P
(
SNR
n
90
SNR
90
|
noise
). In that sense, a smaller
p
-value indicates a smaller chance that the residual power arose
from instrumental noise only. For each event, our estimate of
p
is produced from the fraction of noise instantiations that
yielded
SNR
n
90
SNR
90
(that is, from the empirical survival
function).
6
Our results are summarized in Table II. For each event,
we present the values of the residual
SNR
90
, the lower limit
on the fitting factor (
FF
90
), and the
SNR
90
p
-value. The
background distributions that resulted in those
p
-values are
shown in Fig. 1. In Fig. 1 we represent these distributions
through the empirical estimate of their survival functions, i.e.,
p
(
SNR
90
)
=
1
CDF
(
SNR
90
), with “CDF” the cumulative
distribution function. Fig. 1 also displays the actual value of
SNR
90
measured from the residuals of each event (dotted ver-
tical line). In each case, the height of the curve evaluated at
the
SNR
90
measured for the corresponding detection yields the
p
-value reported in Table II (markers in Fig. 1).
The values of residual
SNR
90
vary widely among events
because they depend on the specific state of the instruments
at the time of detection: segments of data with elevated noise
levels are more likely to result in spurious coherent residual
power, even if the signal agrees with GR. In particular, the
background distributions for events seen by three detectors are
qualitatively di
ff
erent from those seen by only two. This is both
due to (i) the fact that
B
ayes
W
ave
is configured to expect the
SNR to increase with the number of detectors and (ii) the fact
that Virgo data present a higher rate of non-Gaussianities than
LIGO. We have confirmed both these factors play a role by
studying the background
SNR
90
distributions for real data from
each possible pair of detectors, as well as distributions over
fabricated Gaussian noise. Specifically, removing Virgo from
the analysis results in a reduction in the coherent residual power
for GW170729 (
SNR
HL
90
=
6
.
0), GW170809 (
SNR
HL
90
=
6
.
3),
GW170814 (SNR
HL
90
=
5
.
9), and GW170818 (SNR
HL
90
=
6
.
6).
The event-by-event variation of
SNR
90
is also reflected in
the values of
FF
90
. GW150914 provides the strongest result
with
FF
90
=
0
.
97
, which corresponds to an upper limit of 3%
on the magnitude of potential deviations from our GR-based
template,
7
in the specific sense defined in [
4
] and discussed
above. On the other hand, GW170818 yields the weakest result
with
FF
90
=
0
.
78 and a corresponding upper limit on waveform
mismatch of 22%. The average FF
90
over all events is 0.89.
The set of
p
-values shown in Table II is consistent with all
coherent residual power being due to instrumental noise. As-
suming that this is indeed the case, we expect the
p
-values to
be uniformly distributed over [0
,
1], which explains the vari-
ation in Table II. With only ten events, however, it is di
ffi
cult
to obtain strong quantitative evidence of the uniformity of this
distribution. Nevertheless, we follow Fisher’s method [
74
] to
compute a meta
p
-value for the null hypothesis that the individ-
ual
p
-values in Table II are uniformly distributed. We obtain a
meta
p
-value of 0.4, implying that there is no evidence for co-
herent residual power that cannot be explained by noise alone.
6
Computing
p
-values would not be necessary if the noise was perfectly
Gaussian, in which case we could predict the noise-only distribution of
SNR
n
90
from first principles.
7
This value is better than the one quoted in [
4
] by 1 percentage point. The
small di
ff
erence is explained by several factors, including that paper’s use
of the maximum
a posteriori
waveform (instead of maximum likelihood)
and 95% (instead of 90%) credible intervals, as well as improvements in
data calibration.
7
TABLE III. Results from the inspiral-merger-ringdown consistency
test for selected binary black hole events.
f
c
denotes the cuto
ff
fre-
quency used to demarcate the division between the inspiral and post-
inspiral regimes;
ρ
IMR
,
ρ
insp
, and
ρ
post
insp
are the median values of
the SNR in the full signal, the inspiral part, and the post-inspiral part,
respectively; and the GR quantile denotes the fraction of the posterior
enclosed by the isoprobability contour that passes through the GR
value, with smaller values indicating better consistency with GR.
Event
f
c
[Hz]
ρ
IMR
ρ
insp
ρ
post
insp
GR quantile [%]
GW150914
132
25.3
19.4
16.1
55.5
GW170104
143
13.7
10.9
8.5
24.4
GW170729
91
10.7
8.6
6.9
10.4
GW170809
136
12.7
10.6
7.1
14.7
GW170814
161
16.8
15.3
7.2
7.8
GW170818
128
12.0
9.3
7.2
25.5
GW170823
102
11.9
7.9
8.5
80.4
All in all, this means that there is no statistically significant
evidence for deviations from GR.
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
1
2
3
4
5
6
P
(
M
f
/
̄
M
f
)
GW150914
GW170104
GW170729
GW170809
GW170814
GW170818
GW170823
1
2
3
4
5
P
(
a
f
/
̄
a
f
)
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
M
f
/
̄
M
f
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
a
f
/
̄
a
f
FIG. 2. Results of the inspiral-merger-ringdown consistency test for
the selected BBH events (see Table I). The main panel shows 90%
credible regions of the posterior distributions of (
M
f
/
̄
M
f
,
a
f
/
̄
a
f
),
with the cross marking the expected value for GR. The side panels
show the marginalized posteriors for
M
f
/
̄
M
f
and
a
f
/
̄
a
f
. The thin
black dashed curve represents the prior distribution, and the grey
shaded areas correspond to the combined posteriors from the five
most significant events (as outlined in Sec. III and Table I).
B. Inspiral-merger-ringdown consistency test
The inspiral-merger-ringdown consistency test for binary
black holes [
38
,
75
] checks the consistency of the low-
frequency part of the observed signal (roughly correspond-
ing to the inspiral of the black holes) with the high-frequency
part (to a good approximation, produced by the post-inspiral
stages). The cuto
ff
frequency
f
c
between the two regimes is
chosen as the frequency of the innermost stable circular or-
bit of a Kerr black hole [
76
], with mass and dimensionless
spin equal to the median values of the posterior distribution
of the remnant’s mass and spin. This determination of
f
c
is
performed separately for each event and based on parameter
inference of the full signal (see Table III for values of
f
c
).
8
The
binary’s parameters are then estimated independently from the
low (high) frequency parts of the data by restricting the noise-
weighted integral in the likelihood calculation to frequencies
below (above) this frequency cuto
ff
f
c
. For each of these inde-
pendent estimates of the source parameters, we make use of fits
to numerical-relativity simulations given in [
77
79
] to infer the
mass
M
f
and dimensionless spin magnitude
a
f
=
c
|
~
S
f
|
/
(
G M
2
f
)
of the remnant black hole.
9
If the data are consistent with GR,
these two independent estimates have to be consistent with
each other [
38
,
75
]. Because this consistency test ultimately
compares between the inspiral and the post-inspiral results,
posteriors of both parts must be informative. In the case of
low-mass binaries, the SNR in the part
f
>
f
c
is insu
ffi
cient
to perform this test, so that we only analyze seven events as
indicated in Tables I and III.
In order to quantify the consistency of the two di
ff
erent
estimates of the final black hole’s mass and spin we define
two dimensionless quantities that quantify the fractional di
ff
er-
ence between them:
M
f
/
̄
M
f
B
2 (
M
insp
f
M
post-insp
f
)
/
(
M
insp
f
+
M
post-insp
f
) and
a
f
/
̄
a
f
B
2 (
a
insp
f
a
post-insp
f
)
/
(
a
insp
f
+
a
post-insp
f
),
where the superscripts indicate the estimates of the mass
and spin from the inspiral and post-inspiral parts of the sig-
nal.
10
The posteriors of these dimensionless parameters, es-
timated from di
ff
erent events, are shown in Fig. 2. For
all events, the posteriors are consistent with the GR value
(
M
f
/
̄
M
f
=
0
,
a
f
/
̄
a
f
=
0). The fraction of the posterior
enclosed by the isoprobability contour that passes through
the GR value (i.e., the GR quantile) for each event is shown
in Table III. Figure 2 also shows the posteriors obtained by
combining all the events that pass the stronger significance
8
The frequency
f
c
was determined using preliminary parameter inference
results, so the values in Table III are slightly di
ff
erent than those that would
be obtained using the posterior samples in GWTC-1 [9]. However, the test
is robust against small changes in the cuto
ff
frequency [38].
9
As in [
6
], we average the
M
f
,
a
f
posteriors obtained by di
ff
erent fits [
77
79
]
after augmenting the fitting formulae for aligned-spin binaries by adding
the contribution from in-plane spins [
80
]. However, unlike in [
6
,
80
], we do
not evolve the spins before applying the fits, due to technical reasons.
10
For black hole binaries with comparable masses and moderate spins, as we
consider here, the remnant black hole is expected to have
a
f
&
0
.
5; see, e.g.,
[
77
79
] for fitting formulae derived from numerical simulations, or Table I
for values of the remnant’s spins obtained from GW events. Hence,
a
f
/
̄
a
f
is expected to yield finite values.
8
threshold
FAR
<
(1000
yr
)
1
, as outlined in Sec. III (see the
same section for a discussion of caveats).
The parameter estimation is performed employing uniform
priors in component masses and spin magnitudes and isotropic
priors in spin directions [
14
]. This will introduce a non-flat
prior in the deviation parameters
M
f
/
̄
M
f
and
a
f
/
̄
a
f
, which
is shown as a thin, dashed contour in Fig. 2. Posteriors are
estimated employing the precessing spin phenomenological
waveform family
IMRP
henom
P
v
2. To assess the systematic
errors due to imperfect waveform modeling, we also estimate
the posteriors using the e
ff
ective-one-body based waveform
family
SEOBNR
v
4 that models binary black holes with non-
precessing spins. There is no qualitative di
ff
erence between
the results derived using the two di
ff
erent waveforms families
(see Sec. 2 of the Appendix).
We see additional peaks in the posteriors estimated from
GW170814 and GW170823. Detailed follow-up investigations
did not show any evidence of the presence of a coherent signal
in multiple detectors that di
ff
ers from the GR prediction. The
second peak in GW170814 is introduced by the posterior of
M
post-insp
f
, while the extra peak in GW170823 is introduced by
the posterior of
M
insp
f
. Injection studies in real data around
the time of these events, using simulated GR waveforms with
parameters consistent with GW170814 and GW170823, sug-
gest that such secondary peaks occur for
10% of injections.
Features in the posteriors of GW170814 and GW170823 are
thus consistent with expected noise fluctuations.
VI. PARAMETERIZED TESTS OF GRAVITATIONAL
WAVE GENERATION
A deviation from GR could manifest itself as a modification
of the dynamics of two orbiting compact objects, and in par-
ticular, the evolution of the orbital (and hence, GW) phase. In
an analytical waveform model like
IMRP
henom
P
v
2, the de-
tails of the GW phase evolution are controlled by coe
ffi
cients
that are either analytically calculated or determined by fits to
numerical-relativity (NR) simulations, always under the as-
sumption that GR is the underlying theory. These coe
ffi
cients
have a functional dependence on the intrinsic parameters of the
binary that is specific to GR (masses and spins for binary black
holes). In this section we investigate deviations from the GR
binary dynamics by introducing shifts in each of the individ-
ual GW phase coe
ffi
cients of
IMRP
henom
P
v
2. We then treat
those shifts as additional unconstrained GR-violating parame-
ters, which we measure in addition to the standard parameters
describing the binary.
The early inspiral of compact binaries is well modeled by
the post-Newtonian (PN) approximation [
81
84
] to GR, which
is based on the expansion of the orbital quantities in terms of
a small velocity parameter
v/
c
. For a given set of intrinsic
parameters, coe
ffi
cients for the di
ff
erent orders in
v/
c
in the PN
series are uniquely determined. A consistency test of GR using
measurements of the inspiral PN phase coe
ffi
cients was first
proposed in [
85
87
], while a generalized parametrization was
motivated in [
88
]. Bayesian implementations based on such
parametrized methods were presented and tested in [
39
,
89
91
] and were also extended to the post-inspiral part of the
gravitational-wave signal [
92
,
93
]. These ideas were applied
to the first GW observation, GW150914 [
10
], yielding the first
bounds on higher-order PN coe
ffi
cients [
4
]. Since then, the
constraints have been revised with the binary black hole events
that followed, GW151226 in O1 [
5
] and GW170104 in O2 [
6
].
More recently, the first such constraints from a binary neutron
star merger were placed with the detection of GW170817 [
8
].
Bounds on parametrized violations of GR from GW detections
have been mapped, to leading order, to constraints on specific
alternative theories of gravity (see, e.g., [
94
]). In this paper, we
present individual constraints on parametrized deviations from
GR for each of the GW sources in O1 and O2 listed in Table I,
as well as the tightest combined constraints obtained to date
by combining information from all the significant binary black
holes events observed so far, as described in Sec. III.
The frequency-domain GW phase evolution
Φ
(
f
) in the
early-inspiral stage of
IMRP
henom
P
v
2 is described by a PN
expansion, augmented with higher-order phenomenological co-
e
ffi
cients. The PN phase evolution is analytically expressed in
closed form by employing the stationary phase approximation.
The late-inspiral and post-inspiral (intermediate and merger-
ringdown) stages are described by phenomenological analytical
expressions. In all cases, the phenomenological coe
ffi
cients
are calibrated with data from NR simulations of mass-ratios as
asymmetric as 1 : 18 and of dimensionless spin-magnitudes up
to 0.99, as well as the inspiral portion of (uncalibrated) EOB
waveforms. The transition frequency
11
from inspiral to inter-
mediate regime is given by the condition
G M f
/
c
3
=
0
.
018,
with
M
the total mass of the binary in the detector frame, since
this is the lowest frequency above which this model was cal-
ibrated with NR data [
19
]. Deviations from GR in all three
stages are expressed by means of relative shifts
δ
ˆ
p
i
in the corre-
sponding waveform coe
ffi
cients:
p
i
(1
+
δ
ˆ
p
i
)
p
i
, which are
used as additional free parameters in our extended waveform
models.
We denote the testing parameters corresponding to PN phase
coe
ffi
cients by
δ
ˆ
φ
i
, where
i
indicates the power of
v/
c
beyond
leading (Newtonian or 0PN) order in
Φ
(
f
). The frequency
dependence of the corresponding phase term is
f
(
i
5)
/
3
. In the
parametrized model,
i
varies from 0 to 7, including the terms
with logarithmic dependence at 2
.
5PN and 3PN. The non-
logarithmic term at 2
.
5PN (i.e.,
i
=
5) cannot be constrained,
because of its degeneracy with a constant reference phase (e.g.,
the phase at coalescence). These coe
ffi
cients were introduced
in their current form in Eq. (19) of [
89
]. In addition, we also
test for
i
=
2, representing an e
ff
ective
1PN term, which is
motivated below. The full set of inspiral parameters are thus
{
δ
ˆ
φ
2
ˆ
φ
0
ˆ
φ
1
ˆ
φ
2
ˆ
φ
3
ˆ
φ
4
ˆ
φ
5
l
ˆ
φ
6
ˆ
φ
6
l
ˆ
φ
7
}
.
Since the
1PN term and the 0
.
5PN term are absent in the GR
phasing, we parametrize
δ
ˆ
φ
2
and
δ
ˆ
φ
1
as
absolute
deviations,
with a pre-factor equal to the 0PN coe
ffi
cient.
11
This frequency is di
ff
erent than the cuto
ff
frequency used in the inspiral-
merger-ringdown consistency test, as was briefly mentioned in Sec. III.