Comparing Bayes factors and hierarchical inference for testing general relativity with gravitational waves

In the context of testing general relativity with gravitational waves, constraints obtained with multiple events are typically combined either through a hierarchical formalism or though a combined multiplicative Bayes factor. We show that the well-known dependence of Bayes factors on the analysis priors in regions of the parameter space without likelihood support can lead to strong confidence in favor of incorrect conclusions when one employs the multiplicative Bayes factor. Bayes factors $\mathcal{O}(1)$ are ambivalent as they depend sensitively on the analysis priors, which are rarely set in a principled way; additionally, combined Bayes factors $>\mathcal{O}(10^3)$ can be obtained in favor of the incorrect conclusion depending on the analysis priors when many $\mathcal{O}(1)$ Bayes factors are multiplied, and specifically when the priors are much wider than the underlying population. The hierarchical analysis that instead infers the ensemble distribution of the individual beyond-general-relativity constraints does not suffer from this problem, and generically converges to favor the correct conclusion. Rather than a naive multiplication, a more reliable Bayes factor can be computed from the hierarchical analysis. We present a number of toy models showing that the practice of multiplying Bayes Factors can lead to incorrect conclusions.

In the context of testing general relativity with gravitational waves, constraints obtained with multiple events are typically combined either through a hierarchical formalism or though a combined multiplicative Bayes factor. We show that the well-known dependence of Bayes factors on the analysis priors in regions of the parameter space without likelihood support can lead to strong confidence in favor of incorrect conclusions when one employs the multiplicative Bayes factor. Bayes factors O(1) are ambivalent as they depend sensitively on the analysis priors, which are rarely set in a principled way; additionally, combined Bayes factors > O(10 3 ) can be obtained in favor of the incorrect conclusion depending on the analysis priors when many O(1) Bayes factors are multiplied, and specifically when the priors are much wider than the underlying population. The hierarchical analysis that instead infers the ensemble distribution of the individual beyond-general-relativity constraints does not suffer from this problem, and generically converges to favor the correct conclusion. Rather than a naive multiplication, a more reliable Bayes factor can be computed from the hierarchical analysis. We present a number of toy models showing that the practice of multiplying Bayes Factors can lead to incorrect conclusions.

I. INTRODUCTION
With an increasing number of LIGO-Virgo [1,2] gravitational wave (GW) observations, we can leverage the collective set of measurements to study the properties of the astrophysical objects that generate GWs [3,4] and the validity of general relativity (GR) [5][6][7]. Two broad and complementary approaches exist for drawing inferences from sets of detections. The first relies on the posterior distribution for some model and its continuous parameters whose range of possible values encapsulates the different physics we would like to study. The second phrases a question of interest in the language of model selection between two discrete hypotheses to compute Bayes factors (BFs). The latter approach is common in the context of testing GR where one introduces a parametrized deviation of the signal as predicted by GR [8][9][10][11][12][13], but further examples relate to higher-order modes of the radiation [14], GW memory [15], the neutron star equation of state [16][17][18][19][20][21], gravitational lensing [22,23], the association between GWs and potential electromagnetic counterparts [24], GW ringdowns [25], and the signal detection problem in general [26][27][28][29][30][31][32].
Although posteriors and BFs are mathematically related, in practice approaches focusing on one or the other can come to seemingly different, and even contradictory, conclusions. For example, Ref. [33] considers the case of testing the no-hair theorem with GW ringdowns and shows that BFs can favor the incorrect conclusion even * misi@flatironinstitute.org † will.farr@stonybrook.edu ‡ kchatziioannou@caltech.edu in cases where the posterior has minimal support for it. The alternative of working directly with the posterior and hierachical inference has been introduced in the context of tests of GR [5][6][7] after the limitations of using BFs to combine information from multiple events were pointed out in [34]. Specifically, multipyling BFs corresponds to assuming that a GR deviation will manifest independently and distributed according to the underlying prior in each observation [34]. In this paper, we further this line of argument to highlight issues with the use of BFs in the context of nested models and show that the hierarchical modeling of population distributions offers a more flexible and reliable alternative. BFs, or marginalized-likelihood ratios, provide a succinct way to compare the likelihood of two models in light of some data. The BF comparing a hypothesis H 0 to another H 1 for some data d is given by where the likelihoods are marginalized over some (potentially different) sets of parameters θ 0/1 for the H 0/1 hypotheses respectively. The definition of the hypotheses encompasses the choice of parameter priors p(θ | H), as well as any other assumptions built into the functional form of the likelihoods p(d | θ, H). A larger value of B 0 1 (or, equivalently, its natural logarithm, ln B 0 1 ) indicates a preference for H 0 over H 1 . When enhanced with prior weights for each hypothesis, P (H 0/1 ), this returns the betting odds in favor of one model over the other, conditional on the observed data-namely, To avoid expressing an a priori preference for either model, it is common to set P (H 0 ) = P (H 1 ), and so O 0 1 = B 0 1 . BFs reduce the complicated problem of selecting between two models of reality to a single number-a feature which lies at the core of its appeal, but also of its shortcomings. Like other scalar statistics, interpreting this one number is usually far from straightforward, leading to somewhat ad hoc scales such as [35]. This difficulty is worsened by the fact that BFs necessarily are affected by all aspects of a model, including those corresponding to less interesting regions of the parameter space like regions of the prior space for which the likelihood offers no support. Consequently, BFs can vary wildly with different choices of prior bounds, which are often set arbitrarily. Since priors can rarely be set from first principles, calibrating BFs tends to require large scale injection campaigns [29]-although this is only possible when the injections themselves can be designed in a principled way (i.e., when we actually know how to simulate expected astrophysical distributions). Even when the model is specified correctly and we can take BFs at face value, the result offers no insight as to why exactly one model is to be preferred.
All these drawbacks compound when one attempts to combine multiple observations by multiplying BFs computed with a fixed prior, which enhances the sensitivity to prior choices. Moreover, naive BF computations from collections of events impose strong, generally unrealistic assumptions [34]. Multiplying BFs obtained from individual events results in a collective BF that assumes the targeted effect (say, a deviation from GR) manifests independently for each observation. On the other hand, generating a collective BF from the product of likelihoods from multiple measurements presumes that the effect manifests identically for all observations. Neither of these are valid assumptions in general [34]. In general the degree to which a targeted effect appears independently versus identically in each observation is something that should be leared from the data; this insight leads directly to hierarchical modeling [36][37][38][39].
Rather than assuming a fixed and known distribution (e.g., all events are the same, or all the events are different), a hierarchical analysis works by inferring the underlying distribution of the parameter whose values encode the targeted hypotheses. For example, in the context of tests of GR, a parameter x may represent the magnitude of a GR violation, so that the GR (non-GR) hypothesis implies x = 0 (x = 0); likewise, when searching for GW memory, this could be the amplitude of the effect in question. Once this parameter is identified, we can use hierarchical inference to characterize its values across the observed events-using the collection of measurements holistically to infer the distribution of x. The challenge here lies in choosing a suitable parametrization for this underlying distribution.
We can circumvent this through a moment expansion: as a first approximation, we will only be interested in recovering the mean and standard deviation of the un-known distribution, so we can parametrize it as a Gaussian, whose mean and standard deviation (µ, σ pop ) we are to measure from the data [5]; the null hypothesis will often be constructed so that it is recovered for µ = σ pop = 0. This approach allows us to study the population of measurements without assuming the targeted effect manifests either identically or distinctly for all events, learning more about the population distribution, which now plays the role of a prior, with each new observation.
In this paper, we compare the BF and hierarchical approaches directly in the context of multiple GW observations, and argue that BFs are unreliable in any context in which the prior does not adapt to the observations at hand. We show that in such context, BFs do not have the right scaling with the number of events, even in simple situations, due to their inherently strong dependence on the (fixed) priors in regions of no likelihood support. We show that hierarchical posteriors do not suffer from such limitations as they rely on "priors" that are inferred from the data, and have a weaker dependence on hyperparameter assumptions.
We begin in Sec. II by reviewing some basic properties of BFs obtained from single observations and their interplay with Occam penalties. Then, in Sec. III, we study the scaling of BFs when combining multiple observations under two example distributions for the true parameters under consideration; we show that this approach can often lead to the incorrect conclusion. We also consider the same scenarios under the hierarchical approach, showing that it does not suffer from the same drawbacks. In Sec. IV, we consider a final model that cannot be obtained as a special case of the distribution assumed by the hierarchical analysis, showing that the hierarchical approach yields the correct result even then. We conclude in Sec. V.

II. SINGLE EVENT: OCCAM PENALTY VS GOODNESS-OF-FIT
Before tackling the problem of multiple observations, we first review the behavior of BFs computed from a single event. A typical situation that is simple to understand analytically is that of nested models, i.e., two models constructed so that one can be recovered as a special case of the other. Parametrized tests of GR, for example, typically involve nested hypotheses wherein the non-GR model is characterized by all the usual GR signal parameters plus one or more additional variables x nGR that quantify the deviation from GR [40][41][42][43][44][45][46][47][48]. These parametrizations are usually constructed so that GR is recovered when the deviation parameters vanish. Then, to determine whether GR is favored, the data are analyzed with some broad (typically flat) prior on the deviations in order to compute BFs comparing x nGR = 0 to x nGR = 0.
In that spirit, consider some real-valued parameter x and two related hypotheses H x=0 and H x =0 , respectively defined to imply x = 0 and x = 0, with some prior over x and any other relevant parameters. Since x = 0 is a special case of the model in which x is allowed to vary over a broad range including the origin, we say the hypotheses are nested. 1 In that case, the BF comparing the two is given exactly by the Savage-Dickey ratio [49,50], where p(x = 0 | d) is the marginal posterior and p(x = 0) is the prior, both evaluated at x = 0. In other words, the BF in favor of x = 0 is simply the ratio of the posterior to the prior evaluated at the origin. We can elucidate the role of the prior in Eq. (3) by considering a specific functional form. For simplicity, assume the marginal posterior is given by a standard normal distribution, truncated symmetrically around the origin by some uniform prior of half-width ∆ (i.e., flat in −∆ < x < ∆). Then, the BF can be computed analytically to yield in terms of the standard cumulative distribution function Φ(x) ≡ 1 + erf(x/ √ 2) /2 and the error function 1 Here we are identifying H x =0 with the model in which x is allowed to vary freely; this is legitimate because the point x = 0 is a set of measure zero, so it does not need to be explicitly excised from the arbitrary-x model.
x 0 exp(−y 2 ) dy. Since erf(∆/ √ 2) → 1 for large ∆, the BF can be made to favor x = 0 with arbitrarily-high confidence by sufficiently broadening the prior-in fact, B x=0 x =0 ∝ ∆ in the large ∆ limit. We illustrate this in Fig. 1.
The dependence on the prior range is a general feature not specific to our example: the same data can produce arbitrary odds in favor of a specific value of a parameter (x = 0 here) relative to a model with increasing prior volume (proxied by ∆). This is related to the concept of the Occam penalty in Bayesian inference: BFs do not only favor the model that fits the data best, but also the one that is simplest-where simplicity is defined as a model's ability to fit the data without having to significantly constrain its parameters relative to their a priori allowed values. The interplay between goodness of fit and Occam penalty creates the possibility of a BF that strongly favors the incorrect conclusion. In the context of testing GR, if the theory is indeed violated, then some observation can give p(x = 0 | d) 1 in Eq. (3); yet, this can always be countered with a wide enough prior that makes the Occam penalty 1/p(x = 0) so large that goodness of fit cannot overcome it. This is the expected manifestation of the Occam penalty in BFs; the failure is symptomatic of a mismatch between the implemented prior and the observer's expectation.
Since we generally have little a priori information about the nature of possible beyond-GR effects, and these effects are not strongly constrained by the likelihood of an individual measurement, a natural inclination is to make the prior much wider than the likelihood. However, the argument above shows that a broad prior prevents us from detecting small deviations from GR, which is an important regime for tests of GR [51]-either because GR is close to correct or because of selection effects that disfavor the detection of signals with morphology far from GR. The same tension arises in other contexts where BFs are used without a principled prior. In the next section, we show that this behavior is not unique to single-event analyses but carries over to combined constraints.

III. COMBINING EVENTS UNDER A BROAD PRIOR
We now turn to collections of measurements and show that the "combined" multiplicative BF does not have the correct scaling in the regime of interest, with support accumulating in favor of the null hypothesis even when this conclusion is incorrect. Combining BFs from multiple events will only lead to the right conclusion when the deviation (e.g., the deviation from GR) is large enough to be apparent in individual posteriors, negating the need for combining observations in the first place. We then show that the hierarchical approach is not susceptible to this issue.
Again, consider a single parameter, x, that stands in for the magnitude of the GR deviation or any other effect of interest, and let x = 0 be our null hypothesis (e.g., GR is correct). We conduct experiments to measure x, and assume additive Gaussian noise so that the observed value x obs is normally distributed about the true value x with standard deviation (measurement noise) σ obs : In the previous section, we considered a simplified case of this model, in which we had a single measurement with x obs = 0 and σ obs = 1. We consider two "populations" for the true values of x under repeated measurements, i.e., the true magnitude of the GR deviation for each observed GW event. In the first the true value of x is fixed to some value x 0 for each measurement so that the population distribution is a Dirac delta, In the second, the value of x is randomly distributed with mean zero and standard deviation σ 0 for each measurement, so that the population distribution is In the language of tests of gravity, the two models recover GR (our null hypothesis) when x 0 = 0 or σ 0 = 0, respectively. In the second model, the mean deviation from GR vanishes, but the actual variation in a given measurement fluctuates. In both cases the deviation parameter x is assumed independent and identically distributed (iid) for each measurement, so there is some prior choice such that multiplying BFs for repeated measurements is optimal [34]. 2 However, that prior is unknown in realistic situations because the true distribution of GR deviations is not known a priori. Rather, in all cases we assume a measurement is analyzed with a flat prior on −∆ < x < ∆, following common practice.

A. Bayes factors
With that prior, the BF between the null and generalized hypotheses (for concreteness, "GR" and "non-GR") for a given observation x obs is a generalization of Eq. (4) The optimal prior is the actual population from which the true parameters controlling the measurements are iid draws.
When the prior is very broad, ∆ x obs , and the Gaussian posterior on x is not meaningfully truncated by the prior, then the BF simplifies to corresponding to the B x=0 x =0 ∝ ∆ limit discussed in the previous section. Ensuring ∆ x obs is a common analysis choice because this prior permits the true value of x to correspond to the observed value x obs which is the value of x that maximizes the likelihood in each observation. A broad prior is also desired when combining multiple observations in order to accommodate the expected scatter in the individual likelihoods.
If we choose to combine observations by adding log BFs 3 (equivalently, multiplying BFs), it is sufficient to compute the expected value of an individual log BF under the true deviations; the expected total log BF will then be the expected individual log BF times the number of events. The expected value of the log of Eq. (8) is not expressible in closed form, but can be straightforwardly computed numerically. In the limit that ∆ x obs , the expected log BFs under our two populations of GR deviations become and From the above, we can see that, whenever the GR deviation is nonzero but small enough to be undetectable in a single observation (0 < x 0 , σ 0 σ obs ), then for ∆ 2σ obs the expected log BF is positive, and evidence accumulates in favor of GR even though there is a deviation. For deviations that are marginally detectable in a single observation x 0 , σ 0 σ obs , choosing a wide, uninformative prior ∆ 3.5σ obs (which, recall, is necessary to ensure that all observations x obs in a modest-sized catalog of observations are within the prior range) will result in evidence that accumulates against modifications to GR.
An exact calculation of the expected log BF without the assumption that ∆ x obs appears in Fig. 2. Interestingly, any value of ∆ will accumulate evidence for GR when x 0 = 0 or σ 0 = 0; but if x 0 > 0 or σ 0 > 0, so that the GR model is incorrect, prior choices that encompass deviation parameter values comparable to those observed, i.e., ∆ few × σ obs , accumulate evidence for the incorrect GR model unless the deviation parameter is comparable to or larger than the observational uncertainty. In this regime, either the BF fails to select the correct theory (non-GR) on average (with more and more certainty as the number of observations grows); or the deviation is so large that it is marginally detectable (i.e., "∼1σ") with a single observation. Multipliying BFs in this situation is counter-productive, and the best constraint is achieved with a single measurement.
Even when the log BF is expected to take the correct sign on average, the result will vary for any given set of detections. In the regime where ∆ x obs , we can quantify this through the variance associated with the means in Eqs. (10) and (11). For the former, this is just while for the latter we have var ln B GR nGR σ0 These scalings are illustrated in Fig. 3 for ∆ = 5σ obs . On the left hand side, the variance asymptotes to 1/2 for decreasing σ obs and fixed intrinsic scatter σ 0 ; on the right hand side, the BF variance keeps increasing as we make the individual measurements more precise (decreasing σ obs ) because for narrower posteriors the value at x = 0 varies more drastically from event to event. In the case where GR is correct (x 0 = σ 0 = 0), the scatter in the single-event log BF is large enough that it is not extremely uncommon for moderately large catalogs to yield evidence for the wrong hypothesis, even when the null hypothesis is correct (Fig. 4). The qualitative behavior here can be understood in the context of [34]. The flat prior on x implicitly assumes that each observation has a true x parameter that is independent of other observations and uniformly distributed on (−∆, ∆). If one chooses ∆ large enough to include all event likelihoods and not truncate them, then the flat prior for the true parameter implicitly demands that most of the true deviation parameters are comparable to ∆. If, instead, the deviation parameters are considerably smaller than the observational uncertainty σ obs -which is the regime where stacking multiple events should be the most beneficial-then the assumption is so badly violated that pooling observations prefers the incorrect model where the deviation parameters are zero to the even-less-correct model where the deviation parameters are iid uniform on (−∆, ∆).
The solution to this problem is to allow the assumed population to adapt its properties to the stacked set of observations, for example by allowing its location and scale to fit the set, as suggested in [5]. This effectively constructs a model that better represents our beliefs about the combined data set.

B. Hierarchical treatment
We now revisit the two experiments above using a hierarchical model for the distribution of GR deviations, instead of computing BFs with a uniform prior. Assuming we are only interested in the first two moments of the distribution, we parametrize the true deviations as drawn from a Gaussian such that x ∼ N (µ, σ pop ) [5]. The goal will be to infer the µ and σ pop hyperparameters from the collection of measurements, and to quantify agreement with the null hypothesis µ = σ pop = 0 based on the corresponding 2D posterior. The two non-GR models we considered above are encompassed within this parametrization when (µ = x 0 , σ pop = 0) and (µ = 0, σ pop = σ 0 ). 4 As before, we assume that the observed value x obs in an individual event is normally distributed around the true value per Eq. (5), and we take the true values themselves to be distributed normally given µ and σ pop . In that case, the likelihood for a given observation becomes (see Appendix A) where σ 2 tot ≡ σ 2 obs + σ 2 pop is the total variance arising from the combination of statistical uncertainty and the intrinsic population scatter. Unlike in Eq. (5), the true value x for the individual measurement does not appear in this likelihood because we have marginalized over it and replaced it with the hyperparameters µ and σ pop .
Consider the same two true distributions for x above, which depart from GR as given by Eqs. (6) and (7). We simulate catalogs of detections following these distributions and analyze them hierarchically with the likelihood of Eq. (14), and Gaussian priors on µ and σ pop (with zero mean and standard deviation equal to σ obs , restricting to positive values for σ pop ). Although other choices are possible, it is natural to tie the scale of the hyperprior to the measurement uncertainty because this is the only scale built into the problem a priori. Furthermore, this choice is guaranteed to be sufficiently broad in the small-deviation regime (x 0 , σ 0 < σ obs ) in which we are interested, and smooth enough to accommodate larger deviations if needed.
We quantify agreement with the null hypothesis through the marginal posteriors for µ and σ pop , as well as the credible level at which GR is recovered in the 2D posterior for those two quantities (the 2D quantile, Q GR in [6]), defined as . Both quantities are shown relative to the measurement uncertainty on the deviation parameter σ obs . For any non-zero value of the deviation parameter or its scatter, there is a prior width that results in incorrectly accumulating evidence for GR (above dashed line); moreover, for reasonable prior widths of a few times the measurement uncertainty (so that observed values of the deviation parameter lie within the prior range and the prior is therefore broad and uninformative) the deviation parameter must be large enough to be marginally detectable in a single measurement (x0, σ0 ∼ σ obs ) before evidence against the incorrect GR model accumulates on average over many stacked measurements. Fig. 3 demonstrates this effect for ∆ = 5σ obs .  where the shorthand "p < p(0, 0)" stands in for values of µ and σ pop such that p(µ, σ pop | x obs , σ obs ) < p(µ = 0, σ pop = 0 | x obs , σ obs ). Given this definition, a value of Q GR = 1 means that the posterior peaks at the origin µ = σ pop = 0, while Q GR = 0 means that the posterior offers no support for that point (i.e., a higher value of Q GR implies better agreement with GR). In all cases, we simulate each catalog of N observations 50 times and report medians over the ensemble. Unlike with BFs, there are no values of x 0 or σ 0 for which the hierarchical analysis converges to the wrong answer given enough observations. This is apparent from Fig. 5, which shows the value of Q GR as a function of the deviation magnitude (x 0 /σ obs or σ 0 /σ obs ) and the number of observations: even for deviations small relative to the individual-measurement uncertainty, Q GR approaches zero for large N -indicating that the posterior offers little support for µ = σ pop = 0, in tension with GR. For small catalogs (N ≤ 10), the value of Q GR is more strongly influenced by the prior, to a greater or lesser extent depending on the magnitude of the deviation.
Besides indicating that the data are inconsistent with the null hypothesis, the hierarchical analysis provides descriptive information about the nature of the deviation. With enough observations, the measurements of µ and σ pop converge to the x 0 and σ 0 values respectively for the two models with increasing precision for larger N . In Fig. 6, we show this behavior explicitly for two example magnitudes of the GR deviation. As expected, we can detect larger deviations with fewer detections, and need more observations to notice a nonzero scatter than a nonzero mean.
Prior choices play a lesser role in the hierarchical approach than in BF computations. The hierarchical anal-ysis takes as input the likelihoods for individual events, so the prior used to initially analyze the data is largely irrelevant as long as it offers the likelihood ample support (e.g., ∆ σ obs in the notation of the previous section). The choice of prior is, instead, transferred to the µ and σ pop hyperparameters; however, any reasonably smooth choice will work assuming we have a enough events for the hierarchical measurement to be informative. If observations are not sufficiently numerous, the result will be influenced by the µ and σ pop prior.
In the case of our idealized examples, we can analytically predict the number of detections needed for the hierarchical measurement to be informative. As we show in Appendix A, the variance of the marginalized hierarchical likelihood is expected to scale as for the inferred population mean and for the inferred population variance, assuming the true population variance is σ 2 0 and irrespective of the true mean. As a rule of thumb, the hierarchical measurement will become informative once the characteristic width of the likelihood of Eq. (14) becomes smaller than the scale imposed by the prior. With hyperpriors of scale σ prior , this implies that the µ measurement should start becoming informative (in the sense that we obtain a hyperposterior narrower than the prior) once we accumulate as we show in Appendix A. We demonstrate these scalings in Fig. 7, where we show the variance in the inferred population mean and variance from simulated populations of measurements with σ 0 = σ obs /2 and no mean, and setting σ prior = σ obs for concreteness. For increasing number of measurements N , the posterior converges to the true values (µ = 0, σ pop = σ obs /2). For small catalogs, i.e., for N comparable or smaller than the thresholds above, the average uncertainty in these measurements is broad but smaller than expected simply from Eqs. (16) and (17) because it is dominated by the prior. As the number of detections increases, the posterior variance becomes well described by Eqs. (16) and (17), meaning that the data become informative and the likelihood dominates over the prior.
FIG. 5. The GR quantile QGR (color) obtained in a hierarchical analysis of data in which there is a fixed deviation for all events (x0, ordinate left) or a scatter across events (σ0, ordinate right), as a function of the number of events (N , abscissa). The reported value of QGR is the median over 50 simulated catalogs with N -events. Low values of QGR indicate that the null hypothesis (µ = σpop = 0) is disfavored; for reference, the dashed line marks the point at which GR is disfavored at 1σ, i.e., QGR = exp(−1/2) ≈ 0.61. Unlike in Fig. 2, given enough N we always detect the deviation, even when x0, σ0 < σ obs .
FIG. 6. Recovered population mean (µ, left) and standard deviation (σpop, right) as a function of catalog size (N , abscissa), for our two toy models: a fixed deviation x0 for all events (left) and a deviation scatter σ0 across events (right). In each case, we show two values for the true deviation: x0, σ0 = 0.1 σ obs (blue) and x0, σ0 = 0.8 σ obs (red). The measurement is represented by posterior median (solid lines) surrounded by 68% and 90% highest-density credible bands (shading); we also show the true value (dotted lines) and the null expectation (dashed gray line). Smaller deviations require more observations to be detected-for example, we only need N 3 events to notice µ > 0 at 1σ (68% credibility) if x0 = 0.8 σ obs , but N 100 if x0 = 0.1 σ obs .
Alternately, we may ask how many measurements would be required to establish a non-vanishing population mean or variance. This requires var (µ) x 2 0 or var σ 2 pop σ 4 0 . The former would imply and the latter

IV. FENCEPOST MODEL
In the examples above, the simulated populations could be perfectly reproduced as special cases of the hierarchical population model-that is, there existed a choice of µ and σ pop for which the hierarchical model reduced exactly to the true distribution we simulated. Of course, we do not necessarily expect this to be the case in reality: a deviation from GR could manifest as a nontrivial function of the source parameters and the coupling constants in the theory; similarly, for other effects such as memory, it is not realistic to expect the true population of parameters to be fully described by a simple Gaussian if the null hypothesis is incorrect. In light of that, one might worry that the hierarchical method only outperformed BFs in  14) to become narrower than the prior, as dictated by Eqs. (18) and (19) with σprior = σ obs , but rounded up to the closest integer. Fig. 6, except the true population follows the fencepost model of Sec. IV, by which the true deviation is x = ±σ obs /2 with equal probability for either sign. Even though this distribution cannot be expressed as the limit of a Gaussian, the hierarchical analysis infers the correct values of µ = 0 and σpop = σ obs /2.

FIG. 8. As in
the above examples because the hierarchical model was able to match the true population exactly, and that this gain would fail to materialize in realistic situations. Yet, as we show in this section, this is not the case: the hierarchical method is more robust than products of BFs even when the underlying population cannot be fit exactly by a Gaussian.
Consider a situation in which the true deviation parameter is either x = ±x 0 for some x 0 and with equal probability for both signs. The true distribution in this "fencepost" model is simply the sum of two delta functions, and, therefore, has zero mean and standard deviation σ 0 = |x 0 |. This population cannot be described as a limiting case of a Gaussian distribution; nevertheless, we can always analyze it hierarchically with the Gaussian likelihood of Eq. (14), and expect to recover the correct values for the population mean and spread (namely, µ = 0 and σ pop = |x 0 |) given enough detections [5]. In Fig. 8, we show this explicitly for simulated measurements in which x 0 = σ obs /2.
On the other hand, when presented with the fencepost model, BF computations suffer from the same problems already identified above: with a broad prior relative to the measurement uncertainty (∆ > σ obs ), the combined BF for multiple observations will necessarily converge to the wrong answer (i.e., favor of the null hypothesis) unless x 0 σ obs , in which case the deviation is detectable in individual observations. Similarly to Fig. 2, in Fig. 9 we show the scaling of the expected BF accumulated from fencepost-model observations, as a function of x 0 and ∆.
In Fig. 10, we compare the hierarchical and Bayes-factor results for a progressively-higher number of simulated measurements from a fencepost population with x 0 = σ obs /2. As the number of measurements increases, the hierarchical model disfavors the null hypothesis more strongly, with Q GR vanishing rapidly for increasing N . On the other hand, a BF computed with ∆ = 5σ obs grows exponentially in favor of the null hypothesis, yielding a spectacularly incorrect result.

V. CONCLUSIONS
Although conceptually appealing in idealized situations, the use of BFs to aggregate information from multiple observations presents difficulties in practice. Their apparent simplicity in reducing a complex model selection problem to a single number hides an opaque dependence strict and unrealistic population assumptions. Unless priors (aka, the "population") adapt to the observations at hand, BFs are difficult to interpret-a problem that is compounded when multiplying such BFs from a catalog of observations. Even when priors are adequate, the result on its own provides no insight as to why a model is to be preferred over another. This and related problems have been widely discussed in the statistics literature [e.g., 52, and references therein], but not extensively in the context of GWs and testing GR.
In this paper, we have examined BFs and hierarchical posteriors as two commonly-used alternatives to derive information from collections of GW detections in order to decide between two models, e.g., the presence or absence of beyond-GR effects in the detected waveforms. We furthered arguments in [5,34] to show that, without a principled way to set priors, BFs are an unreliable tool for this task. We demonstrated this with three examples in which the value of some parameter x encoding the effect in question (e.g., a deviation from GR) follows different distributions deviating from the null hypothesis.
We have found that, when the truth does not conform to the null model, the usual approach of multiplying single-event BF converges to the incorrect answer for an increasing number of observations, except in a regime where the targeted effect is discernible in individual observations (thus negating the need for combining events in the first place). On the other hand, hierarchical modeling of the underlying population leads to identification of the appropriate priors (aka, the "population model") and converges to the correct answer. We established this in the context of nested models for which GR can be recovered as a special case of the beyond-GR model (i.e., x 0 = 0); however, the issue of sensitivity to the prior width will still be present in non-nested models [e.g., 53] where it will require a different solution.
In principle, BFs could be computed after the hierarchical population inference [52] or between different population models [3], but we here show that they are unreliable without this step. Even then, it is not possible to evade the core problem of prior dependence when computing BFs, no matter how many levels of inference are applied: the BF computation based on the highest level of inference in a hierarchical model will still depend on the choice of priors on that level, reducing the problem once again to the choice of a prior distribution that is difficult to establish in a principled way. This issue is devistatingly acute for the approach that multiplies Bayes factors with a simple, fixed prior because each observation contributes an additional prior factor.  Fig. 9. The solid line gives the median over 100 catalog realizations at each catalog size N while the band shows the range of the 16th to 84th percentile values. The credibility of GR (left) is the fraction of posterior mass for µ and σ in our hierarchical model that lies at a lower posterior density than the GR values µ = σ = 0 when the model is fit to a catalog of observations whose true and observed deviations are drawn from the "fencepost" model described in Sec. IV with x = ±σ obs /2. The log Bayes factor (right) is the sum of the log BFs for each observation in the catalog using a flat prior on the true deviation −∆ < x < ∆ with ∆ = 5σ obs . The hierarchical model correctly finds that there is little credibility for GR once the catalog size is a few hundred; the accumulated BF, on the other hand, becomes very confident in the incorrect GR model even at small catalog sizes.
Gaussian distribution such that x i ∼ N (µ i , σ i ), where σ i is the measurement uncertainty. 5 The joint likelihood for µ and σ pop conditional on the N uncertainties σ i can be obtained by marginalizing over the true values µ i : where σ 2 tot,i = σ 2 i + σ 2 pop is the total variance for the ith measurement. For the full set of N measurements {x i }, then, the hierarchical likelihood is The maximum-likelihood estimators for µ and σ pop , denotedμ andσ pop respectively, can be found by enforcing 5 In the main text, we used slightly different notation: instead of (x i , µ i , σ i ) we had (x obs , x, σ obs ); the former is slightly more succinct, which will be helpful here given the increased number of mathematical expressions.
We cannot solve this most general (heteroskedastic) case forμ andσ pop in closed form, so we specialize to the (homoskedastic) case where all measurements have similar error, σ i = σ obs for all i. With this simplification, the above relations reduce tô As might be expected, the maximum-likelihood estimate of the population mean is simply the sample mean, and the inferred population variance corresponds to the variance in the data that cannot be accounted for by the statistical uncertainty in each individual measurement. We can go one step further and compute the uncertainty in these estimators, which we can take as a proxy for the width of the marginal likelihoods. The uncertainty inμ is straightforward to compute, since this is just a linear combination of independent random variables x i with known variance σ 2 tot ≡ σ 2 obs + σ 2 pop , hence var (μ) = σ 2 obs + σ 2 pop N . (A8) Obtaining var σ 2 pop is less straightforward, but we can do so by writing var σ 2 pop = F −1 , in terms of the corresponding Fisher element 6 F ≡ − ∂ 2 In the main text, we quoted these results for var (μ) and var σ 2 pop in Eqs. (16) and (17) respectively. We can compare the width of the hierarchical likelihood, as proxied by the estimator variances above, to some typical scale of interest in the problem. Below we consider the scale imposed by the µ and σ pop hyperpriors to estimate the number of events before the likelihood become informative with respect to the prior. We chose the hyperpriors to be Gaussians with scale σ prior , restricting to positive σ pop values, i.e., for the standard deviation, where the difference in normalization arises from the σ pop ≥ 0 truncation. The prior variances in our example are, thus, var (µ) = σ 2 prior for the mean, and var σ 2 pop = 2σ 4 prior for the variance (obtained through direct computation). For concreteness, in the main text we set σ prior = σ obs , since that is the only scale intrinsic to the measurement.
We can now directly compute the number of observations required for the likelihood to achieve comparable widths by equating these variances to the likelihood variances from Eqs. (A8) and (A10) above. The result is for σ pop . We quoted these results in Eqs. (18) and (19) in the main text. We require at least this many measurements before the uncertainty in the population variance can be smaller than the measurement uncertainty. Until that point, the σ pop posterior will be dominated by the prior. The results for the N thresholds quoted above hinge on the specific choice of prior for µ and σ pop . In the main text, we justified our decision to set the scale of those priors based on σ obs by noting that this is the only intrinsic scale to the problem, and should always be sufficiently broad as long as the deviation from GR is not visible in a single detection-the regime in which we are interested in the first place. Had we chosen to increase the prior variance by some factor, then the N thresholds would decrease by the same factor, i.e., we need fewer detections to gain information relative to a broad (less informative) prior than a narrower prior. Either way, the result converges to the right answer as we accumulate more observations.