A Unified Analysis of Four Cosmic Shear Surveys

In the past few years, several independent collaborations have presented cosmological constraints from tomographic cosmic shear analyses. These analyses differ in many aspects: the datasets, the shear and photometric redshift estimation algorithms, the theory model assumptions, and the inference pipelines. To assess the robustness of the existing cosmic shear results, we present in this paper a unified analysis of four of the recent cosmic shear surveys: the Deep Lens Survey (DLS), the Canada-France-Hawaii Telescope Lensing Survey (CFHTLenS), the Science Verification data from the Dark Energy Survey (DES-SV), and the 450 deg$^{2}$ release of the Kilo-Degree Survey (KiDS-450). By using a unified pipeline, we show how the cosmological constraints are sensitive to the various details of the pipeline. We identify several analysis choices that can shift the cosmological constraints by a significant fraction of the uncertainties. For our fiducial analysis choice, considering a Gaussian covariance, conservative scale cuts, assuming no baryonic feedback contamination, identical cosmological parameter priors and intrinsic alignment treatments, we find the constraints (mean, 16% and 84% confidence intervals) on the parameter $S_{8}\equiv \sigma_{8}(\Omega_{\rm m}/0.3)^{0.5}$ to be $S_{8}=0.94_{-0.045}^{+0.046}$ (DLS), $0.66_{-0.071}^{+0.070}$ (CFHTLenS), $0.84_{-0.061}^{+0.062}$ (DES-SV) and $0.76_{-0.049}^{+0.048}$ (KiDS-450). From the goodness-of-fit and the Bayesian evidence ratio, we determine that amongst the four surveys, the two more recent surveys, DES-SV and KiDS-450, have acceptable goodness-of-fit and are consistent with each other. The combined constraints are $S_{8}=0.79^{+0.042}_{-0.041}$, which is in good agreement with the first year of DES cosmic shear results and recent CMB constraints from the Planck satellite.


INTRODUCTION
The large-scale structure of the Universe bends the light rays emitted from distant galaxies according to General Relativity (Einstein 1936). This effect, known as weak (gravitational) lensing, introduces coherent distortions in galaxy shapes, which carry information of the cosmic composition and history.
One of the most common statistics used to extract this inforc 0000 The Authors arXiv:1808.07335v2 [astro-ph.CO] 28 Aug 2018 where J 0/4 is the 0th/4th-order Bessel functions of the first kind. The i and j indices specify the two samples of galaxies (or in the case of i = j, the galaxy sample) from which the correlation function is calculated. Usually these samples are defined by a certain redshift selection. Under the Limber approximation (Limber 1953;Loverde & Afshordi 2008) and in a spatially flat universe 1 , the lensing power spectrum encodes cosmological information through where χ is the radial comoving distance, χ H is the distance to the horizon, P NL is the nonlinear matter power spectrum, and q(χ) is the lensing efficiency defined via where Ω m is the matter density today, H 0 is the Hubble parameter today, a is the scale factor, and n i (χ) is the redshift distribution of the galaxy sample i.
Since the first detection of cosmic shear in Bacon et al. (2000); Kaiser et al. (2000); Wittman et al. (2000); Schneider et al. (2002), the field has seen a rapid growth. In particular, a number of large surveys have delivered cosmic shear results with competitive cosmological constraints in the past few years (Heymans et al. 2013;Becker et al. 2016;Jee et al. 2016;Joudaki et al. 2017a;Troxel et al. 2017;Hildebrandt et al. 2017;DES Collaboration et al. 2017), while ongoing and future surveys will deliver data in much larger volumes and better quality [e.g. the Dark Energy Survey (DES, Flaugher 2005), the Hyper SuprimeCam Survey (HSC, Aihara et al. 2017), the Kilo-Degree Survey (KiDS, de Jong et al. 2015) and the Large Synoptic Survey Telescope (LSST, Ivezić et al. 2008;Abell et al. 2009)].
One of the surprises that has emerged in the past couple of years is that there seems to be a modest level of discordance between different cosmological probes (MacCrann et al. 2015;Freedman 2017;Raveri & Hu 2018). Even though in many of these cases, the level of tension between the different probes still needs to be quantified more rigorously, one consequence has been that the cosmology community has started to more carefully scrutinize how the datasets are analyzed. This is especially important as we expect the statistical power of the datasets to be orders of magnitude better in the near future. If there is indeed a tension between the different probes, it could point to an exciting new direction where the simple ΛCDM cosmology cannot explain all the observables and new physics is needed.
A variety of studies have been carried out to understand systematic effects in weak lensing measurements. This includes systematics from the instrument and the environment, from modeling the point-spread function (PSF) and measuring galaxy shapes, from estimating the redshift of each galaxy, from the theoretical modeling, and many more (see Mandelbaum 2017, and references therein for a comprehensive list of studies). In this work, we focus on understanding the steps between the shear catalog and cosmological constraints: measuring the shear two-point correlation function [Eq. (1) ], estimating the covariance, modeling of the signal, and inferring cosmological parameters. We build a modular and robust pipeline using the PEGASUS workflow engine (Deelman et al. 2015) to analyze the datasets in a streamlined and transparent fashion -this pipeline will serve as the first step towards building up cosmological analysis pipelines for the LSST Dark Energy Survey Collaboration (DESC).
In this paper we apply the pipeline to four publicly available datasets that are precursors to ongoing and future cosmic shear surveys: the Deep Lens Survey (DLS, Jee et al. 2016), the Canada-France-Hawaii Telescope Lensing Survey (CFHTLenS, Joudaki et al. 2017a), the Science Verification data from the DES (DES-SV, DES Collaboration et al. 2016), and the 450 deg 2 release of the KiDS (KiDS-450, Hildebrandt et al. 2017). All four surveys were carried out fairly recently and have comparable statistical power, so a uniform pipeline is a powerful way to identify any discrepancies and to understand their origin. A detailed look at the consistency between the four datasets can also inform us about potential systematic issues in the processing that produces the catalogs from which our pipeline begins. It is, however, not the scope of this paper to investigate these issues upstream to our pipeline, where a thorough pixel-level study for each survey may be required.
The paper is organized as follows. In Sec. 2 we describe the details of the four datasets used in this work. In Sec. 3 we describe the pipeline that is used to process the data. We then outline in Sec. 4 the framework in which we compare the datasets and the elements in the pipeline that are allowed to vary. Our results are shown and discussed in Sec. 5 and we conclude in Sec. 6.

PRECURSOR SURVEYS
We describe briefly the four datasets used in this work. In Fig. 1 we show the estimated redshift distribution for each dataset. The number of tomographic bins in each case was chosen by the collaboration (and we keep that number fixed throughout), but the range does convey information about the depth of the surveys. For example, the DLS is much deeper and therefore is sensitive to shear at higher redshift. In Fig. 2 we show the footprint of the four datasets on the sky. Since the footprints of these surveys are largely nonoverlapping, they can be treated as independent. In Table 1 we list the main parameters used in each of the cosmic shear analyses.

DLS: the Deep Lens Survey
The DLS (Wittman et al. 2000) consists of five ∼ 2 × 2 deg 2 fields that add up to ∼ 18 deg 2 . Two fields were observed by the Kitt Peak Mayall 4m telescope/Mosaic Prime-Focus Imager (Muller et al. 1998), and the other two by the Cerro Tololo Blanco 4m telescope/Mosaic Prime-Focus Imager. The total DLS dataset was taken over 140 nights of B, V , R and z imaging. The approximate limiting magnitudes for each band (at 5σ ) are 26,26,27,26 in B, V , R and z, respectively. The average seeing is ∼ 0.9 in R.
The cosmic shear cosmology analysis from DLS was first presented in Jee et al. (2013), and later updated with Jee et al. (2016), which is the analysis we focus on in this paper. The shear measurement method is described in Jee et al. (2013), where an elliptical Gaussian galaxy model is used and image simulations (Jee & Tyson 2011) were employed for calibration of the shear estimate. The photometric redshift (or, photo-z) estimation uses the BPZ code (Benítez 2000) and is validated against the PRIsm MUlti-object Survey (PRIMUS, Coil et al. 2011) in Jee et al. (2013.
DLS is the deepest survey with the smallest area of all the four datasets used in this work. As we do not have access to the shape catalogs for DLS, we start from the pre-measured two-point correlation functions provided by the collaboration.

CFHTLenS: the Canada-France-Hawaii Telescope
Lensing Survey The CFHTLenS data (Erben et al. 2013;Heymans et al. 2012) spans four distinct contiguous fields of approximately 63.8, 22.6, 44.2 and 23.3 deg 2 . Images are taken via the Canada-France-Hawaii 3.6m Telescope/MegaCam Imager in six filter bands: u * , g , r , i , y , z. The limiting magnitudes for each band (at 5σ in 2 aperture) are 25.24, 25.58, 24.88, 24.54, 24.71, 23.46 in the six bands, respectively, while the average seeing is 0.68 in i , where the shapes are measured. The cosmic shear cosmology analysis from CFHTLenS was presented first in Fu et al. (2008) and later updated in Heymans A is the area of the survey while w i is the weight for source galaxy i. The summation runs over all source galaxies. † Only for ξ 23 + and ξ 33 + , the small-scale cutoff is 2 arcmin. † † Only for ξ 11 − and ξ 12 − , the small-scale cutoff is 60 arcmin.
et al. (2013); Kilbinger et al. (2013) and then Joudaki et al. (2017a), which is the focus of this paper. The shear measurement was based on the LENSFIT package (Miller et al. 2007), which is a likelihoodbased model-fitting approach that allows for joint-fitting over multiple observations of the same galaxy. A two-component (disk plus bulge) model is used to fit the galaxy shape and to extract the galaxy ellipticity. The method marginalizes over nuisance parameters such as galaxy position, size, brightness and bulge fraction. Miller et al. (2013) describes the simulation-based calibrations that are applied to the shear catalog. The photo-z estimation was based on the BPZ code (Benítez 2000;Hildebrandt et al. 2012). The catalogs are publicly available 2 . As seen in Fig. 1, the CFHTLenS analysis uses the largest number of tomographic bins. In addition, in Joudaki et al. (2017a) extensive explorations of the impact of different intrinsic alignment (IA) models, baryonic feedback models, and photo-z uncertainties were performed. When considered independently, only the IA amplitude was found to be substantially favored by the CFHTLenS data. However, with a 2σ negative amplitude, this could be a sign of either simplistic modeling or unaccounted systematics. The CFHTLenS analysis further considered joint accounts of the systematic uncertainties, where the "MIN", "MID" and "MAX" cases included successively conservative treatments of the systematics modeling and scale cuts (along with a "fiducial" case that included no systematics). Joudaki et al. (2017a) found that the S 8 constraints were sensitive to the specific treatment of the systematic uncertainties, where the level of concordance with Planck ranged from decisive discordance (MIN) to substantial concordance (MAX). As a result, when quoting the nominal constraints from the collaboration, we show all three cases for CFHTLenS.

DES-SV: the Dark Energy Survey Science Verification Data
The DES-SV dataset was taken before the official DES run began and was designed to cover a smaller area (∼ 250 deg 2 ) to the full depth expected for DES. The area used in the cosmology analysis is a contiguous area of 139 deg 2 . Images were taken with the Dark Energy Camera (Flaugher et al. 2015) on the Cerro Tololo Blanco 4m telescope. Five filter bands: g, r, i, z, Y were used to a median depth of g ∼ 24.0, r ∼ 23.9, i ∼ 23.0 and z ∼ 22.3, respectively. The average seeing is 1.11 in r, 1.08 in i and 1.03 in z -the DES-SV galaxy shapes used information from all three bands. The cosmology analysis from weak lensing was presented in DES Collaboration et al. (2016), while the details and testing of the measurements were recorded in Becker et al. (2016). Two independent shear catalogs were produced from the DES-SV data and have been extensively tested in Jarvis et al. (2016). In this work we use the catalog produced by the shear measurement algorithm NGMIX (Sheldon 2014), which is a fast Bayesian fitting algorithm that models galaxies as a mixture of Gaussian profiles. The Gaussian profiles are chosen to approximate an exponential disk. Several photo-z algorithms were tested in Becker et al. (2016) and Bonnett et al. (2016) including SKYNET (Bonnett 2013) and BPZ (Benítez 2000). In DES Collaboration et al. (2016), results from all shear and photo-z catalogs were presented and shown to be consistent. In this work we use only the NGMIX catalog and the SKYNET photoz, as these were recommended by DES as the fiducial catalogs with the best performance. All catalogs are publicly available 3 .
The analysis pipeline used in DES Collaboration et al. (2016) is based on COMOSIS , which is the same cosmology inference framework we use in this paper, so we expect very good agreement between our analysis and DES Collaboration et al. (2016).  Figure 3. Flow chart of steps used in pipeline that goes from survey data to cosmology. The arrow pointing towards the left from the "n(z) & Metadata" box bypasses the covariance calculation -it refers to the route taken when the survey-provided covariances are used in the inference.

KiDS-450: the 450 deg 2 Kilo-Degree Survey
The KiDS-450 dataset consists of five separate patches covering a total effective area of ∼ 360 deg 2 . Data was taken using the OmegaCAM CCD Mosaic camera mounted at the Cassegrain focus of the VLT Survey Telescope (VST). There are four SDSS-like filter bands, u, g, r, i, and the image depth is approximately 24.3, 25.1, 24.9, 23.8 in each band, respectively (5σ limit in 2 aperture). The median seeing is 0.66 in r, and no r-band images have seeing greater than 0.96 .
The cosmology analysis from cosmic shear using KiDS-450 data was presented in Hildebrandt et al. (2017). The cosmological inference pipeline was largely based on that used in CFHTLenS (Joudaki et al. 2017a), while several updates were made to the measurement pipeline. First, the shear calibration to the LENSFIT shear catalog was based on more sophisticated image simulations (Fenech Conti et al. 2017). Second, a new approach for estimating photo-z and propagating photo-z uncertainties into cosmological inferences was implemented, which we briefly describe below.
The n(z) estimation in KiDS-450 is based on ideas presented in Lima et al. (2008) and implemented in Bonnett et al. (2016). This approach is referred to in Hildebrandt et al. (2017) as the "weighted direct calibration (DIR)" method. The n(z) is taken directly from the redshift distribution of a spectroscopic sample with appropriate re-weighting in the color-magnitude space to correct for the incompleteness and selection effects in both the shear catalog and the spectroscopic sample. Since the n(z)'s are derived from a small number of spectroscopic galaxies, they appear more noisy than the other surveys in Fig. 1, where more traditional photo-z methods (stacked redshift probability distribution functions, or PDFs) are used.

PIPELINE
A directed acyclic graph (DAG) representing the modular pipeline developed for this analysis is shown in Fig. 3. The pipeline is implemented using the PEGASUS (Deelman et al. 2015) workflow management system. The individual components in the DAG are explained in more detail in Sec. 4, but we outline below the basic structure of the pipeline. Starting from the top, catalogs from each survey are fed into the first two branches of the pipeline which are run in parallel. The first branch (the left half of the DAG) starts with performing sample selection and tomographic binning by sorting catalog data into N t redshift bins and applying appropriate quality cuts, producing one intermediate catalog file per bin. Next, N c = N t (N t + 1)/2 jobs are launched in parallel to calculate the two-point shear correlation functions using the TREECORR 4 code. The output of all the parallel jobs are collected to form the data vector for the analysis. The second branch (the right half of the DAG) starts with estimating the full redshift distribution n(z) by summing the redshift PDFs for each individual galaxy 5 . This approach of stacking the redshift PDFs for cosmological inference is not mathematically correct, but is consistent with the implementation of the four surveys under study. The n(z), together with other metadata from each survey (the effective number densities for each tomographic bin, the total shape noise, the survey area) are fed into the calculation of the analytic covariance corresponding to the data vector using the code COSMOLIKE (Krause & Eifler 2016). A to-tal of N c (2N c + 1) COSMOLIKE jobs are launched to calculate each submatrix of the full covariance matrix in parallel. The results for all submatrices are then combined to form the full covariance.
Finally, the outputs from the two branches -the data vector and the covariance matrix -are fed into COSMOSIS  for inference of the cosmological model. The last step also involves choosing the appropriate theory models, priors and scale cuts within COSMOSIS.
This pipeline is written in a modular and generic fashion that strings together the three main codes that are used: TREECORR, COSMOLIKE and COSMOSIS, so that it is easy to substitute different input catalogs, covariances and theory models. Building on this pipeline, it is easy to incorporate other cosmological probes, though that is beyond the scope of this paper. We note also that WLPIPE serves as a test ground for experimenting on different pipeline architecture for future DESC cosmology analyses. For example, we have tested WLPIPE using other workflow engines such as PARSL 6 (Babuji et al. 2018). A similar pipeline was previously constructed for the recent DES Year 1 weak lensing and large-scale structure analyses (DES Collaboration et al. 2017), the cosmic shear part (Troxel et al. 2017) of which was made available to this project. The DES pipeline, however, did not employ any formal workflow management engine. The two pipelines have since been validated against one another to ensure they produce consistent results.
All plots of the cosmological constraints from COSMOSIS chains are plotted using the software package CHAINCONSUMER 7 with setting kde=1.5.

COMPARISON FRAMEWORK
The focus of this paper is to compare the cosmic shear analyses of the four precursor surveys in multiple aspects, both within the same dataset and across the four datasets. We describe below the different elements that we consider in this work. We note that as our goal was to investigate and compare the various existing (published) datasets, there was no attempt of blinding throughout the analysis.

Two-Point Correlation Functions
An important intermediate output of our pipelines is a set of twopoint correlation functions for different redshift bins: (1) ]. These together form the data vector for the cosmological parameter fitting. Except for DLS, whose shape catalogs are not in the public domain yet, we can compare the two-point functions output from WLPIPE with those obtained by the different survey collaborations. For this work, we use the code TREECORR to measure the two-point shear correlation function. TREECORR is a fast tree-based method that allows one to estimate a variety of two-and three-point correlation functions. To measure the two-point shear correlation function, we calculate where e t a is the tangential component of the ellipticity of galaxy a with respect to the vector ( θ a − θ b ), and e × a is the cross component; θ α is the mean angular separation between all galaxy pairs in bin α; W is the weight associated with each galaxy; and S is an (algorithmdependent) calibration factor defined by each of the different shear catalogs. The last factor, Θ α (| θ a − θ b |), is 1 when | θ a − θ b | is inside angular bin α and 0 elsewhere. When using TREECORR, we set the parameter binslop=0, which means there are no approximations in calculating the angular separation between two galaxies. We note also that DES-SV uses TREECORR to calculate its two-point correlation functions, while for DLS, CFHTLenS and KiDS-450, the measurements are obtained via ATHENA 8 .
One final subtle point to note is thatθ is the weighted mean of the logarithmic angular separation between all pairs of galaxies in a given angular bin 9 , or The choice ofθ is important because it will be the positions at which the model is evaluated and compared to the ξ i j ± (θ ) measurements during the parameter inference process. For DLS, CFHTLenS and KiDS-450, this was not taken into account and the geometric mean of the logarithmic angular bins were used 10 . This will result in a small shift in the parameter inference as we discuss later in Sec. 5.1 and Appendix A. This effect has also been pointed out previously in Joudaki et al. (2018) and Troxel et al. (2018).

Covariance Matrices
The covariance matrix is an essential element in the pipeline. The full covariance matrix receives contributions from two terms (Cooray & Hu 2001;Sato et al. 2009;Takada & Hu 2013): the Gaussian covariance and the non-Gaussian covariance. The non-Gaussian covariance includes the super-sample covariance (Takada & Hu 2013), which describes the uncertainty induced by largescale density modes outside the survey window. In this work we use two sets of covariance matrices for each analysis: First, we use the covariance matrices used in the four papers (Jee et al. 2016;DES Collaboration et al. 2016;Joudaki et al. 2017a;Hildebrandt et al. 2017), which were provided by the collaborations. Next, we use a theoretical Gaussian covariance matrix produced by the COS-MOLIKE (Krause & Eifler 2016) code. We note that the Gaussian covariance may not be sufficient, especially for DLS, given the smaller area and lower shape noise in this dataset. For further details of the covariance calculation, see Krause et al. (2017). The COSMOLIKE covariance calculation requires the following information from each survey: • n(z): estimate of redshift distribution for each tomographic bin (see Fig. 1) • n eff : the effective number of source galaxies used in each bin as defined in Heymans et al. (2012) (see Table 1) • σ e : standard deviation of the galaxy shape (or, shape noise) for the whole catalog (see Table 1) • A sky : area of footprint (see Table 1).
Recently, Krause et al. (2017) and Troxel et al. (2018) also pointed out the importance of accounting for the geometry of the footprint, not just its area A sky , by using the survey window function when calculating the analytic covariance. Briefly, one can estimate the effect of the survey geometry by actually counting the number of source galaxy pairs as a function of separation or via an analytic integration of the survey mask. One then uses this information to calculate the shape noise contribution to the covariance instead of the simple geometric calculation based only on the area and mean source number density. We have incorporated this correction to our analytic covariances. Note that this correction does not include the survey geometry correction to the cosmic variance piece of the covariance, which may be important for surveys with low shape noise, such as DLS. The cosmological parameters used to generate all COSMOLIKE Gaussian covariances in this work are: Ω m = 0.286, We first use the survey-provided covariance to check whether we can reproduce the results from the papers. Next we compare the cosmological constraints derived using the survey-provided and the theoretical Gaussian covariance. The four surveys have different approaches to estimate the covariance: for DLS, DES-SV and CFHTLenS, the covariance was estimated via simulations (which are also different between the three cases). For KiDS-450, both simulation and analytic covariances were used and shown to be broadly consistent (though can cause a 1σ shift in the S 8 constraints, see Hildebrandt et al. 2017). The final results were based on the analytic covariance.
For covariances estimated via simulations, we need to apply the Hartlap correction factor H (Hartlap et al. 2007) when inverting the covariance to approximately correct for the bias in the inverse covariance estimate coming from the noise in the simulated covariance C sim . That is, where N is the number of independent simulations and p is the length of the data vector. We note that this gives an unbiased but still noisy estimate of the inverse covariance. In all our analyses, N is large enough so that the noise associated with the resulting inverse covariance is reasonable (Sellentin & Heavens 2016).

Cosmological/Nuisance Parameters
In all four survey analyses, of order ten cosmological parameters and parameters modeling systematic effects are varied. The parameters each survey chooses to vary are slightly different and the corresponding priors are also different. In Table 2 we summarize the free cosmological/nuisance parameters and priors for the four analyses under the ΛCDM framework. We note that for CFHTLenS we have chosen to study the "fiducial" setting in Joudaki et al. (2017a), which does not consider any systematic effects. In later analyses when we unify the analysis choices across surveys, the shear calibration bias, photo-z bias and IA amplitude will be allowed to vary. We also note that in Table 2 there are two classes of parametrization of the free cosmological parameters. For DLS and DES-SV, [Ω m , Ω b , h, σ 8 , n s ] was used, whereas for CFHTLenS and KiDS-450, [Ω c h 2 , Ω b h 2 , h, ln(10 10 A s ), n s ] was used. Here, Ω b is the baryon density today, h is the unitless Hubble constant (H 0 = 100h km/s/Mpc), σ 8 is the amplitude of the (linear) power spectrum on the scale of 8 h −1 Mpc, Ω c is the cold dark matter density today, A s is the amplitude of the matter power spectrum, and n s is the spectral index. Since the priors on the varied parameters are taken to be flat, choosing Ω b h 2 for example instead of Ω b translates to choosing a differently shaped prior on the Ω b − h parameter space. Furthermore, for CFHTLenS and KiDS-450, the h prior is an indirect one that depends on θ MC , defined as 100 times the ratio of the sound horizon to the angular diameter distance, and is imposed at an intermediate stage. The effective prior on h is therefore not flat. As discussed in Hildebrandt et al. (2017) and Troxel et al. (2018), however, this only has a small effect on the tails of the parameter constraints.
The three classes of nuisance parameters considered here are defined as follows.
• Intrinsic Alignment: Most current lensing surveys use the nonlinear alignment model (NLA) proposed by Hirata & Seljak (2004); Bridle & King (2007); Joachimi et al. (2011). The model assumes that the IA power spectra P II and P GI scale with the nonlinear power spectrum P δ and can be redshift and luminositydependent: where Here A IA is a free parameter that dictates the amplitude of the effect, C 1 = 5 × 10 −14 h −2 M −1 Mpc 3 is a constant, ρ crit is the critical density at redshift zero, and D + (z) is the linear growth factor that is normalized to 1 today. The power laws η and β determine the redshift and luminosity evolution of the IA effect with z 0 and L 0 chosen as the anchoring redshift and luminosity. L is the mean luminosity of the sample. In the four surveys considered in this work, DLS varied A IA , η and β , the "MID" case of CFHTLenS varied A IA , η, while the "MIN" case of CFHTLenS, DES-SV and KiDS-450 only varied A IA 11 .
• Photo-z Uncertainty: The n(z) estimation can be uncertain and one should marginalize over this uncertainty. We parametrize this uncertainty following the approach used in DLS, CFHTLenS and DES-SV (also see Huterer et al. 2006). That is, we assume the true redshift distribution n(z) has the same shape as the measured redshift distribution n obs (z), but has an uncertain shift in the mean of the distribution, b z,i , for each redshift bin i so that The approach used in KiDS-450 is slightly different, where the variation in the n(z) itself and the correlation between the errors is accounted for directly. This is done by running a large number (750 is used in Hildebrandt et al. 2017) of chains for each cosmological inference, where each chain uses a different bootstrap sample Table 2. Free parameters in the cosmology inference used in Sec. 5.1, i.e. matching certain cases of the published results as closely as possible. The brackets indicate flat priors with [min, max] and the parentheses indicate Gaussian priors with (mean, standard deviation). We note that for CFHTLenS we choose to use the "fiducial" setting in Joudaki et al. (2017a) as the Baseline, which does not consider any systematic effects. In later analyses when we unify the analysis choices across surveys, the shear calibration bias, photo-z bias and IA amplitude will be allowed to vary.  of the n(z), and combining all the chains at the very end. As the current WLPIPE is not able to implement this operation, we calculate the standard deviation of the mean redshift for each of the 1000 bootstrap n(z)'s provided by the collaboration to be [0.036, 0.015, 0.010, 0.006] for each of the redshift bins, and use these values as the priors on the photo-z uncertainty the same way as the other surveys. We find that this approximation gives consistent results to the KiDS-450 approach. The one other subtle point is that in the DLS analysis, the photo-z biases are assumed to be 100% correlated across redshift bins.

DLS
• Shear Calibration Uncertainty: The shear measurements in each catalog can be uncertain due to imperfect calibration (Mandelbaum et al. 2015). A common way of parametrizing this uncertainty is assuming the true shear γ scales linearly with the measured shear γ obs by a factor (1 + m i ) for each redshift bin i, plus an additive term c i (Heymans et al. 2006). That is As we will discuss in Sec. 5.1.4, the uncertainty in m i can either be incorporated at the parameter level or directly in the covariance matrix. We choose the former approach but show that the resulting cosmological constraints are identical (see Fig. B1 in Appendix B). The one other subtle point is that in the DLS analysis, the shear calibration uncertainties are assumed to be 100% correlated across redshift bins. Finally, all surveys we analyzed assume that any residual additive shear biases, c i , are negligible for the scales used.
In Sec. 5.4, we compare the cosmological constraints from the four datasets using the same priors on cosmological parameters and IA parameters. To see the effect of varying different combinations of the cosmological parameters discussed above, we run analyses for both the DES-SV priors and the KiDS-450 priors. For the photoz and shear calibration parameters, however, we do not attempt to match between the surveys, as these are parameters that are charac-terized using the specific datasets. It would be incorrect to assume they have identical priors.
One final subtlety on the modeling side concerns the nonlinear matter power spectrum. Amongst the surveys considered here, DLS uses the Smith et al. (2003) HALOFIT power spectrum, CFHTLenS and KiDS-450 use the HMCODE power spectrum, which is based on Mead et al. (2015), and DES-SV uses the Takahashi et al. (2012) HALOFIT power spectrum. The difference in these power spectrum models can result in slightly shifted cosmological constraints, as discussed in Jee et al. (2016); Joudaki et al. (2017a,b); Mac-Crann et al. (2015). In this work we use the Takahashi et al. (2012) HALOFIT power spectrum.

Scale Cuts
In the four cosmic shear analyses, choices were made for which scales will be used for the cosmological inference. The choices were often based on considerations of systematic effects and model uncertainties. In general, the minimum scale is determined by model uncertainties such as baryonic physics and the accuracy of the nonlinear power spectrum. The maximum scale cuts are usually related to survey-specific considerations such as the size of the footprint, additive shear bias, and super-sample covariance. For the four surveys considered, different choices of scale cuts were used and listed in Table 1. A few things to point out: For DLS, the same scale cuts were chosen for ξ + and ξ − , though a discussion of how the scale cuts would change the cosmological constraints was presented in Jee et al. (2016). For CFHTLenS and KiDS-450, in addition to the motivations described above, scales with low signal-tonoise were also removed. Also, the use of smaller scales was justified since the effect of baryonic effects were modeled and marginalized over. For DES-SV, the scale cuts are redshift-dependent and the most conservative.
In our final joint analysis we aim for a uniform scale cut across all four datasets to remove the difference in the four analyses com-ing from this decision. Since the different surveys have different redshift binning strategies, a unified set of scale cuts is not straightforward. We take the approach of choosing a set of scale cuts in physical units and propagating it into the corresponding angular scale cuts for all of the shear correlation functions. This choice is motivated by the fact that the main consideration that goes into the scale cuts is the uncertainties in the model on small scales (nonlinear power spectrum, baryonic effects etc.). The scales on which these effects are important are usually related to the physical size of, for example, dark matter halos. In addition, for cosmic shear measurements, one is not measuring the matter distribution at the redshift of the source galaxies. Instead, it is the matter distribution in the foreground of the source galaxies that we are probing -in specific, matter at the redshift where the lensing efficiency is high [Eq. (3) ]. As a result, we choose the scale cuts by calculating the corresponding angular scale cut θ min,± for some given physical scale R min,± at the redshift of the peak of the lensing kernel z p . That is, for ξ ± , we use only angular scales where D A (z p ) is the angular diameter distance to redshift z p . The physical scale cuts R min,± chosen in our common analysis are R min,+ =1.3 Mpc for ξ + and R min,− =11.4 Mpc for ξ − . These choices are equal to the most conservative scale cuts amongst the four datasets and very similar to the DES-SV scale cuts. We note that we use R min,± to translate the angular scale cuts between different redshift ranges. The reason for a larger R min,− is mainly reflecting the difference between the J 0 and J 4 Bessel functions in Eq. (1). We also note that for a more rigorous approach of using truly "linear scale" cuts, see Sec. 3.5 of Joudaki et al. (2017a).

Cosmological Constraints and Comparisons Metrics
To obtain cosmological constraints, we vary the full set of cosmological and nuisance parameters p using a Monte Carlo approach where we assume a Gaussian likelihood, which is the prior multiplied by e −χ 2 /2 , where d i ranges over all data points; t i ( p) is the theoretical prediction given the set of parameters; and C the covariance matrix. We use the MULTINEST Monte Carlo sampler (Feroz et al. 2009) implemented in COSMOSIS, which has been shown in DES Collaboration et al. (2017) and Krause et al. (2017) to agree very well with other sampling methods such as EMCEE as well as the COSMOLIKE inference code (Krause & Eifler 2017). This cosmic shear experiments studied in this paper effectively constrains one or at most two cosmological parameters, depending on choices to be discussed below. The parameter that is most tightly constrained is (Jain & Seljak 1997) where α ∼ 0.5 denotes the degeneracy direction in the Ω m -σ 8 plane so that S 8 gives the most constraining direction of the dataset. The particular value of α depends somewhat on the details of the data and modeling choices. In most existing cosmological analyses, a customary choice is to set α = 0.5. However, this could lead to slightly misleading results when comparing different datasets, as not all of them would yield the most constraining S 8 with this choice of α. In the following analysis, we will use α = 0.5 as our fiducial value but discuss in Sec. 5.6 the effect of changing α.
In the next section, we will focus our comparison discussions surrounding four quantities: • Signal-to-noise (S/N): This is simply and it quantifies the statistical significance of the observables.
• Goodness of fit (χ 2 /ν, p.t.e.): For the best-fit data vectorD, we can calculate the χ 2 per effective number of degree of freedom ν, and the corresponding probability-to-exceed (p.t.e.). It is important to evaluate the goodness of fit for each of the chains in parallel to check for consistency. One disadvantage for using the goodnessof-fit is that the determination of the degree-of-freedom in a high dimensional space is not straightforward. However, for this work the length of the data vector usually dominates over the number of model parameters.
• 1D distance in S 8 (∆S 8 ): We calculate the ratio of the absolute difference between the mean parameter values in the two experiments and the uncertainty in the difference. For two experiments a and b, we thus have ∆S 8 can roughly be interpreted as an n-σ difference in S 8 for the two experiments. This metric inherently assumes Gaussianity in the S 8 posterior and ignores possible tensions in other parameter projections. It can also overestimate the inferred disagreement when there are strong degeneracies in other parameter dimensions.
• Logarithmic Bayes Factor (BF): Based on Marshall et al. (2006), we consider the logarithmic ratio of the evidence for the two hypotheses: first that the two experiments are measuring the same cosmological parameters and second that they are measuring different cosmological parameters. That is, we calculate .
Here the posteriors (including the priors) for each experiment P a,b are integrated over all parameters p. To properly interpret the BF values, one should evaluate it for cases where the two datasets share the same priors. As a result, we only calculate this at the end of the paper when all analysis choices are unified. We use the criteria BF > −1 to determine whether two surveys are consistent and can be combined. When BF < −1, the Jeffrey scale (Jeffreys 1961) suggests that there is effectively no evidence that the two datasets can be described by the same model. We note, however, that the BF metric is sensitive to the priors on the constrained parameters, and is usually biased towards consistency (Raveri & Hu 2018).

RESULTS
In this section we present the main results of this paper. In Sec. 5.1 we present results from the Baseline case: we set out to reproduce the results from the four published papers and discuss in detail the remaining differences between our reproduction and the published results, which we refer to as the Published Baseline. We also calculate several comparison metrics in order to understand the internal (external) consistency within (between) the four datasets. In  Figure 4. The 2-point function θ ξ ± (θ ) as measured by WLPIPE from the catalogs provided by the collaborations compared with the results obtained by the collaborations themselves. For visualization purpose we only show the auto-correlation functions for the lowest and the highest redshift bins, and the colored data points are slightly displaced from the black points. From left to right in each panel is ξ + for the lowest redshift bin, ξ + for the highest redshift bin, ξ − for the lowest redshift bin, and ξ − for the highest redshift bin. From top to bottom are the four surveys: DLS, CFHTLens, DES-SV, and KiDS-450. Since the catalogs from DLS are not public, only the collaboration 2-point functions are shown in the top panel. We also note that the difference in the angular binning discussed in Sec. 5.1 is not shown in this plot, but explained more clearly in Appendix A.
Sec. 5.2, Sec. 5.3 and Sec. 5.4, we investigate individually the effect of changing the covariance estimation, the scale cuts, and the priors on cosmological parameters, and intrinsic alignment treatment. In Sec. 5.5 we unify the analysis choices and reexamine the comparison metrics. In Sec. 5.6 we discuss how the definition of S 8 may affect the comparison between the surveys.
Throughout, we will also use the term Nominal Baseline to refer to the nominal analysis results that each collaboration uses as their most representative cosmological constraints, which for the case of CFHTLenS and KiDS-450 can be slightly different from the Published Baseline in terms of the treatment of systematic effects.

Baseline: Reproducing Literature Results
The most basic test is the comparison of the literature results with the WLPIPE's measurements using the same catalogs under the same assumptions.
First, we examine the intermediate output of the measured ξ ± functions. Fig. 4 shows ξ + (θ ) and ξ − (θ ) produced by WLPIPE using the same binning and angular scales chosen by the collaborations, overlaid on top of results obtained by the collaborations for comparison. We find excellent agreement in all cases for the values of ξ ± 12 . Note that for DLS, CFHTLenS and KiDS-450, the angular values for each data point assigned by WLPIPE differ from the paper-provided data vectors. This, as we discussed in Sec. 4.1, is due to the fact that those paper-provided data vectors used the center of each angular bin instead of the area-weighted center. We show how this propagates into a bias in the cosmological constraints in Appendix A. Fig. 5 shows the constraints obtained from WLPIPE for each experiment compared with those obtained by the collaborations themselves using the same binning, parameters, priors, and covariance matrices used to obtain the published results. In doing this we aim to reproduce the published results. However, Fig. 5 shows that there are differences between the published results and the WLPIPE results, which we discuss in detail in the following subsections. The COSMOSIS configuration files and data files for these Baseline results are publicly available (Chang 2018).

DLS
From the upper left panel of Fig. 5, we see that the Published Baseline constraints from DLS are about 0.7σ higher in S 8 and 0.5σ higher in Ω m than the Baseline constraints obtained via WLPIPE.
per-object per-patch multiplicative bias correction instead of a constant for each tomographic bin used in Hildebrandt et al. (2017). We have checked that this does not affect the rest of the analysis. Differences in angular binning cannot be an issue here, since we are using the collaboration-computed ξ ± . Two differences in the analysis explain the offset: First, the nonlinear power spectrum used in the original DLS analysis of Jee et al. (2016) comes from an older version of HALOFIT (Smith et al. 2003), while in COSMOSIS we use the nonlinear power spectrum of Takahashi Jee et al. (2016) in WLPIPE, we are assuming no IA in the WLPIPE case. According to Figure 12 of Jee et al. (2016), this results in a ∼ 0.02 lower Ω m (with approximately the same S 8 ). Accounting for these two factors brings the two contours to better agreement -where the WLPIPE reproduction gives a slightly lower Ω m , but almost exactly the same S 8 compared to the published results.

CFHTLenS
From the upper right panel of Fig. 5, we see that the published constraints from CFHTLenS are consistent with WLPIPE in both the Ω m and S 8 directions. We note that we have chosen to compare the "fiducial" chain in Joudaki et al. (2017a), which does not include IA, baryons, photo-z uncertainties or shear calibration un-certainties. Three factors need to be accounted for here: First, the angular values used in the paper-provided chains (the center of the bin) are different from that in the WLPIPE chain [Eq. (5) ]. As we show in Appendix A, using the area-weighted angular values would shift the contours up by about 0.4σ . Second, whereas COSMOSIS uses the Takahashi et al. (2012) model in HALOFIT, Joudaki et al. (2017a) used the slightly more accurate HMCODE (Mead et al. 2016) for the nonlinear power spectrum. As shown in Figure 10 of Joudaki et al. (2017a), the HMCODE version used at that time moves the contour higher in S 8 by about 0.4σ compared to HALOFIT. These first two effects cancel, bringing the paperprovided chains and the WLPIPE reproduction to perfect agreement. The final difference in our approaches is more subtle. As we noted in Sec. 4.3, CFHTLenS and KiDS-450 uses COSMOMC, which does not sample h directly. Instead, it samples a wide flat prior in θ MC (which is connected to h) and imposes the h priors after the fact. This means that the real h prior in the paper-provided chains is not exactly flat. This difference has been found to be small Troxel et al. 2018).
In Sec 5.1-5.4, we compare with the "fiducial" case in Joudaki et al. (2017a) for simplicity. This assumes no systematic uncertainties, which according to

DES-SV
We expect the WLPIPE reproduction of the DES-SV Published Baseline results to be perfect up to noise in the sampling, since the analysis pipeline is almost identical in the two analyses (WLPIPE uses slightly updated versions of TreeCorr, COSMOLIKE and COS-MOSIS compared to that used in DES Collaboration et al. 2016). As shown in the lower left panel of Fig. 5, this is indeed the casethe two contours agree very well.

KiDS-450
From the lower right panel of Fig. 5, we see that the Published Baseline constraints from KiDS-450 agree with the Baseline constraints from WLPIPE in the Ω m direction and are about 0.9σ higher in the S 8 direction. Several factors contribute to this discrepancy at different levels. First, the angular values used in the paperprovided chains (the center of the bin) are different from those in the WLPIPE chain [Eq. (5) ]. Changing the bin values shifts the paper-provided chains up by about 0.4σ as shown in Fig. A1. Second, similar to CFHTLenS, the paper-provided chain uses HM-CODE for the nonlinear power spectrum while WLPIPE uses HALOFIT. However, while Joudaki et al. (2017a) used the original version of HMCODE (Mead 2015), a newer version of HM-CODE (Mead et al. 2016) was used in Hildebrandt et al. (2017). In this newer version, the fitting parameters were updated to allow for better fits when considering massive neutrino cosmologies, at the expense of slightly worse fits in standard ΛCDM. This newer version of HMCODE agrees more strongly with HALOFIT, and the resulting parameter constraints from KiDS-450 when using either prescription are almost identical (when excluding baryonic feedback). Third, similar to CFHTLenS, θ MC is varied in the analysis while h is a derived parameter. Fourth, the covariance used in Hildebrandt et al. (2017) is designed to include the marginalization Table 3. Comparison metrics for all pairs of surveys in the Baseline analysis case: WLPIPE chains that are designed to match the published analyses, or the Published Baseline case. For the S 8 values, we list the mean and the 16% and 84% confidence intervals. We note that here we have used the different analysis choices based on each of the collaborations, so these metrics are not on equal footing. Later in Table 6 we show similar metrics that can be compared directly.
(1) DLS (2) CFHTLenS (3) Hildebrandt et al. (2017). We have checked, however, that this does not generate any noticeable effect in the cosmological constraints. In summary, we are able to reproduce the Hildebrandt et al. (2017) results in both Ω m and S 8 when considering these factors. We note here that the fiducial analysis of Hildebrandt et al. (2017) includes modeling of the baryonic effects on small scales whereas we do not here. As a result we compare with their DIR chain, which as shown in Fig. 8 of Hildebrandt et al. (2017) and Fig. 13, gives a S 8 value 0.3σ lower than the Nominal Baseline case. Later when we unify the analysis choices, since we make much more conservative scale cuts, we do not expect the effect of baryons to be important.

Comparison of all four surveys
The right panel of Fig. 6 shows the Baseline results from the four experiments using WLPIPE in one plot, i.e., we overlay the colored contours in Fig. 5 together. We note that here we have used the different analysis choices based on each of the collaborations, therefore the four contours cannot be compared on an equal footing. In this picture, we find good agreement between the four surveys in the Ω m − S 8 plane, with CFHTLenS slightly lower than the other three surveys. DES-SV has the largest contour (weakest constraining power), whereas the other three surveys have contours of similar sizes. The degeneracy directions of the four surveys are somewhat different, as expected from the different redshift ranges they probe. For comparison, we also show in the left panel of Fig. 6 the Published Baseline results from the corresponding survey-provided chains, or the four grey dashed contours in Fig. 5. The main difference from Fig. 5 is (1) the shifting of the KiDS-450 contours in the S 8 direction, which comes from the change in the angular bin values and the covariance, as we discussed in Sec. 5.1.4 above, and (2) the DLS contours shifted to lower S 8 due to the change in the nonlinear power spectrum and the IA model, as we discussed in Sec. 5.1.1 above. This can also be seen more clearly comparing the Published Baseline and Baseline cases in Fig. 13.
We list the comparison metrics (as described in Sec. 4.5) for all the surveys as well as combinations of survey pairs for the chains in the Baseline case in Table 3. First, looking at the S/N, we notice  that in the data configuration used in the individual surveys, the raw statistical power of the measurement is similar for DLS and CFHTLenS, while DES-SV is about half the S/N and KiDS-450 is in between. One interesting observation is that DLS achieves the high S/N even with a significantly smaller area -this highlights the power of having high-redshift data. A slightly worrying point is that the goodness-of-fits for DLS and CFHTLenS are quite low. For the pair-wise ∆S 8 , we find trends reflecting what is seen from the figures -all four surveys are broadly consistent with Table 3 showing some low-level discrepancies (1.5σ ) in S 8 between CFHTLenS and DLS. For the Published Baseline chains, we list the S 8 constraints and ∆S 8 values in Table 4. We do not list the goodness-of-fit here since they are not all available in the papers, and are not directly comparable with the values in Table 3. We just quote two numbers that available: in Joudaki et al. (2017a), the reduced χ 2 for the fiducial CFHTLenS analysis best-fit is 1.5, whereas in Hildebrandt et al. (2017), the reduced χ 2 for the fiducial KiDS-450 analysis Table 5. S 8 constraints, S/N and goodness of fit when we change one analysis choice at a time in the analysis pipeline from the Baseline case (see Table 3). For the S 8 values, we list the mean and the 16% and 84% confidence intervals. The sections of this table correspond to discussions in Sec. 5.2, Sec. 5.3 and Sec. 5.4.

Effect of the Covariance Matrix
Now we investigate the effect of the covariance matrix estimation. As discussed in Sec. 4.2, the four surveys have different approaches to covariance estimation. We eliminate these differences by generating a Gaussian analytical COSMOLIKE covariance matrix for each survey. Fig. 7 shows the changes in the contours in the four experiments when analytic covariance matrices are used in place of those provided by the collaborations. The corresponding comparison metrics are listed in Table 5. We notice a shifts of the contours in the S 8 constraints for some of the surveys. Overall, the Gaussian analytic covariance leads to slightly tighter constraints compared to covariance matrices estimated from simulations. This could be partially due to the fact that we have not accounted for the non-Gaussian piece of the analytic covariance.
For DLS, we see a significant shift in the mean of the constraints towards higher S 8 values; DES-SV and CFHTLenS also show some shifts in S 8 , but less significant. We note that, since the data vector is noisy, we do not expect the contours to agree exactly. However, we believe the shift for DLS is more than what is expected from statistical fluctuation. The DLS field is much smaller and contains a lower level of shape noise compared to the other surveys. In addition, one of the fields contains a galaxy cluster. These factors mean that the covariance is challenging to model and the simple Gaussian covariance used here may not be a good approximation for the dataset. It is possible that neither the survey-provided covariance nor the Gaussian COSMOLIKE covariance from WLPIPE captures these complications. We also note that for the three cases where simulation covariance is used, DES-SV has the smallest Hartlap factor (H DLS =0.88, H CFHTLenS = 0.86, H DES−SV =0.7). This means that the inverse of the simulation covariance in DES-SV is expected to be noisier (but unbiased) compared to the other two simulation covariances (Dodelson & Schneider 2013).
Finally, it is also worth noting that since the survey-provided covariance from KiDS-450 is also an analytic covariance matrix, the agreement between the dashed and the solid contours in the bottom right of Fig. 7 is a good check on the analytic calculation for the covariance. We have checked that the slightly smaller contours from WLPIPE is partially reflecting the difference between the Gaussian and non-Gaussian covariance.

Effect of Scale Cuts
In this section we investigate the effect of scale cuts. Following Sec. 4.4, we choose to match all scale cuts to the most conservative scale cuts in the four datasets (R min,+ >1.3 Mpc and R min,− > 11.4 Mpc, see Eq. (12)). The results are shown in Fig. 8, with the corresponding metrics listed in Table 5. The exact cuts used in each bin are tabulated in Appendix C, Table C1. In all these tests, everything else in the analysis stays the same as the Baseline case in Sec. 5.1.
In Fig. 8, the first thing that draws the eye is the DLS contours, which shift to very large Ω m values, as well as a higher S 8 . All the other surveys appear consistent with the original case in Fig. 6, but with looser constraints due to the fact that we have removed information.
We note that the goodness-of-fit for DLS improved significantly when applying the conservative scale cuts compared to the Baseline case. After a more careful look at the DLS measurements, it appears that the small-scale data points for ξ − is the source of the contour shift -those data points prefer a lower amplitude compared to the rest of the data points. Therefore, when applying the conservative scale cuts, the model amplitude increases (so does S 8 ), and the goodness-of-fit improves. This could also be a hint that the small-scale covariance is underestimated, as already discussed in Sec. 5.2, that the characteristics of the DLS data makes it difficult to model the covariance. We note that some of these issues were discussed in Jee et al. (2013) and Jee et al. (2016), and a similar trend in S 8 was seen in Fig.13 of Jee et al. (2016). Here we caution that since the DLS contours are far from Ω m = 0.3 and clipped by the priors (Ω m < 1), the S 8 values quoted are not very meaningful.

Impact of Cosmological Priors and IA Treatment
Next, we consider the impact of different cosmological priors and IA treatments. To address this, we impose identical priors on all surveys, first using those from DES-SV and then from KiDS-450 (see Table 2) since they roughly represent the two approaches of handling the parameters: DES-SV has priors that are relatively conservative, and in the parametrization of [Ω m , Ω b , h, σ 8 , n s ], whereas KiDS-450 has more restrictive priors and uses the parametrization [Ω c h 2 , Ω b h 2 , h, ln(10 10 A s ), n s ]. We moreover allow for intrinsic alignments in the case of CFHTLenS and DLS. For all surveys, we consider either the IA amplitude prior −5 < A IA < 5 used by DES-SV or the prior −6 < A IA < 6 used by KiDS-450. Note that aside Baseline + DES-SV priors Figure 9. Impact of cosmological priors and IA treatments compared to Baseline (right panel of Fig. 6) -we show the marginalized constraints for Ω m and S 8 ≡ σ 8 (Ω m /0.3) 0.5 when unifying the priors on the cosmological and IA parameters. We first unify to the KiDS-450 priors (top), then to the DES-SV priors (bottom). The constraints in the Ω m direction is heavily affected by the priors, while in the S 8 direction, there is a larger effect for surveys with a strong degeneracy in the Ω m − S 8 plane. We note that the four contours should not be compared directly here, as the analysis choices are not unified.
from these changes to the cosmological priors and IA treatments, we keep all other analysis choices the same as in the Baseline case of Sec. 5.1. The two panels of Fig. 9 show the effect of unifying the cosmological priors and IA treatments from that chosen as Baseline, with the corresponding metrics listed in Table 5. Looking at CFHTLenS, DES-SV and KiDS-450, it is apparent that the constraints in the Ω m direction are largely dominated by cosmological priors. Specifically, the prior on h, which is wider for DES-SV compared to KiDS-450, leads to large changes in the Ω m posterior. The . Constraints on IA amplitude -we show the marginalized constraints for Ω m and A IA when unifying the priors to the KiDS-450 priors (keeping all other analysis choices as in the Baseline case). We find a degeneracy between A IA and Ω m for DLS and CFHTLenS. We also find that both these surveys prefer negative IA amplitudes in this set-up.
constraints on S 8 , on the other hand, are relatively robust to cosmological priors, consistent with previous findings (Kilbinger et al. 2013;Joudaki et al. 2017a). This again is showing that cosmic shear measurements for these four datasets are mainly constraining only the amplitude of the power spectrum and not the detailed shape of it. The uncertainty on S 8 decreases for CFHTLenS when moving to tighter cosmological priors, however, this is largely due to the fact that the S 8 definition here is not optimal for the CFHTLenS dataset. We will discuss this point in Sec. 5.6. A few other effects of the cosmological priors and IA treatment are visible in Fig. 9. First, for DLS, when imposing the DES-SV priors, Ω m moves to high values while S 8 remains roughly the same. When imposing KiDS-450 priors, the Ω m constraints appear similar to the Baseline case. This behavior, together with what is shown in Sec. 5.3, suggests that the DLS constraints on Ω m are sensitive to the scales used and the priors. For CFHTLenS, the S 8 constraints move to lower values using both DES-SV and KiDS-450 priors. This comes from the fact that compared to the Baseline case, here there is additional freedom in the IA amplitude. We examine the IA amplitude when using the KiDS-450 priors, as shown in Fig. 10, and find that the CFHTLenS favors a negative IA amplitude at the 2σ level. This, based on previous work in measurements of IA, suggests that we may be fitting to some systematic effects that appear to behave like IA (Kilbinger et al. 2017;Choi et al. 2016;van Uitert et al. 2018). This is consistent with Fig. 8 and Fig.  9 of Joudaki et al. (2017a), where they show that this negative IA shifts the S 8 constraints to lower values. There is also a similar (but less severe) trend in the DLS data.

Common Covariances, Angular Scale Cuts, Cosmological Priors, and IA Treatments
After investigating the individual effects in Sec. 5.2, Sec. 5.3 and Sec. 5.4, we now combine all of them and perform a uniform anal-ysis on all four surveys. We study two cases, both using COSMO-LIKE Gaussian covariances, conservative scale cuts, the same IA treatments, and we use two sets of priors: (i) KiDS-450 priors and (ii) DES-SV priors.
As discussed in Sec. 5.1.2, in this subsection we incorporate the photo-z and shear calibration bias uncertainties for CFHTLenS. As summarized in Sec. 4.3 of Kilbinger et al. (2017), a number of improvements to CFHTLenS have been identified since the public release of the catalogues in 2013. Of importance to this study is the analysis by Choi et al. (2016) who showed that significant biases existed in the reported photo-z distributions, the result from Kuijken et al. (2015) that the CFHTLenS shear calibration corrections were in general underestimated and the finding by Fenech Conti et al. (2017) that the previously unexplored area of galaxy selection bias results in a few percent overestimation of the shear calibration correction. The conclusion of all these works was that any future analyses of CFHTLenS should include conservative systematic error terms to account for these effects. In this section, we therefore marginalize over an uncertainty in the mean redshift of each bin with zero-mean top-hat prior of full-width 0.2, and an uncertainty in the shear calibration correction zero-mean top-hat prior of full-width 0.1.
In Fig. 11 we show the comparison between the Published Nominal contours and case (i) listed above. The Published Nominal contours present the view one would have on the four cosmic shear surveys after reading the individual papers (Jee et al. 2016;Joudaki et al. 2017a;DES Collaboration et al. 2016;Hildebrandt et al. 2017), while the Matched contours present what the cosmological constraints are when analysed through a unified analysis framework. One can also compare with Fig. 6 to understand the nature of the different changes in the contours. The contours for (ii) are shown in Fig. 12. We choose to focus on (i) here as it is not as affected by the S 8 definition (see Sec. 5.6) as (ii). The four surveys in the right panel of Fig. 11 can now be compared on equal footing. The comparison metrics are shown in Table 6.
In general, we observe the same effects seen individually in Sec. 5.2,Sec. 5.3 and Sec. 5.4. But when put together, the discrepancies between the different surveys coming from the different effects accumulate and become larger. Looking at Fig. 11 and the ∆S 8 statistics in Table 6, we find that essentially none of the surveys have S 8 constraints that agree within 1σ and the extreme cases differ more than 3σ . If we look at the change in the S 8 constraints for the individual surveys from the left panel to the right panel, it is clear that the main effect is that DLS moves to larger S 8 . CFHTLenS is consistent with the MIN and MAX case but not the MID case, which is expected given that the choice of IA models we use is the same as the MIN case, and that the MAX case has little constraining power. DES-SV and KiDS-450 stay roughly the same.
Next we turn to the other statistics in Table 6. We note that the signal-to-noise for the four datasets change slightly, but the relative power stays roughly the same, with DLS being the highest and DES-SV being the lowest. We note that the goodness-of-fit for DLS and CFHTLenS improved from the Baseline case but is still quite low. The largest ∆S 8 is about 3.4σ between DLS and CFHTLenS, which is also apparent from Fig. 11. Next we look at the BF statistic [Eq. (17) ] between pairs of surveys. Here when evaluating the numerator in Eq. (17) for BF, we only require the cosmological parameters to be shared amongst the two experiments being compared and keep the IA amplitude, shear calibration parameter and photo-z uncertainty separate. We find that the message from the BF statis- Matched (ii) Figure 12. Same as the upper right panel of Fig. 11, but now using DES-SV priors. tics is similar to that captured by the ∆S 8 metric in this case, though the message of consistency/inconsistency is somewhat weaker -the −0.013 , we find reasonably consistent results with roughly 1σ lower S 8 . These results are in good agreement with that found in Troxel et al. (2018).

A side note on the S 8 definition
As discussed briefly in Sec. 4.5, S 8 is defined as σ 8 (Ω m /0.3) α , where α is designed to remove the degeneracy between σ 8 and Ω m . That is, if α is chosen optimally, it characterizes the direction orthogonal to the Ω m − σ 8 contours. For datasets of different redshift distribution, the optimal α is different.
Throughout our analysis, we have fixed α to be 0.5, which may not be optimal for all datasets. This implies that for datasets where α is further from 0.5, the projected uncertainties on S 8 = σ 8 (Ω m /0.3) 0.5 are going to be slightly larger than if the optimal α were used, and that when comparing the different surveys they will tend towards being consistent. This can be seen clearly in- Fig. 12, where the contours for CFHTLenS and KiDS-450 are tilted leading to larger uncertainties in the S 8 direction. The effect is much reduced when a tighter prior is imposed as in the right-hand panels of Fig. 11. Roughly, we find that with the DES-SV priors (corresponding to Fig. 12), the optimal α values are 0.56 (DLS), 0.71 (CFHTLenS), 0.51 (DES-SV) and 0.67 (KiDS-450). With the KiDS-450 priors (corresponding to the right panels of Fig. 11), the optimal α values are 0.52 (DLS), 0.52 (CFHTLenS), 0.52 (DES-SV) and 0.58 . That is, we expect the discrepancies between the surveys in the single parameter that quantifies the amplitude to be sensitive to the priors and likely larger if an optimal α is used. On the other hand, the BF metric is insensitive to the choice of α so is a more robust measure of consistency.

SUMMARY AND DISCUSSION
In this paper we use a generic cosmic shear pipeline, WLPIPE, that takes in galaxy shear catalogs, calculates the two-point shear-shear correlation function via the software package TREECORR and the associated covariance matrix via the software package COSMO-LIKE, and finally carries out cosmological parameter inference via the software package COSMOSIS. The WLPIPE framework is constructed using the PEGASUS workflow engine, which takes care of data and code transfer between different computing resources seamlessly. This pipeline also serves as a prototype pipeline for future analysis pipelines in DESC. We apply this pipeline to four existing cosmic shear surveys: the Deep Lens Survey (DLS), the Canada-France-Hawaii Telescope Lensing Survey (CFHTLenS), the Science Verification data from the Dark Energy Survey (DES-SV), and the 450 deg 2 release of the Kilo-Degree Survey (KiDS-450). The goal is to first reproduce the literature results, investigate the effect of different analysis choices adopted in each of the surveys, and finally unifying these different analysis choices in order to perform an apples-to-apples comparison of the survey results. In Fig. 13 we summarize the constraints on S 8 ≡ σ 8 (Ω m /0.3) 0.5 from all the cases studied in this work. We summarize our main findings below: • We are able to reproduce a specific set of the published results from the four collaborations when following the same analysis choices to well within the uncertainties. In this Baseline case, the four surveys appear to be broadly consistent in terms of their constraints on S 8 : S 8 = 0.80 +0.032 −0.032 (DLS), 0.73 +0.028 −0.028 (CFHTLenS), 0.80 +0.059 −0.058 (DES-SV), and 0.77 +0.033 −0.034 (KiDS-450). However, we note that not all the model fits are good descriptions of the datafor DLS and CFHTLenS, the p-values for the fits are low, while for KiDS-450, the p-value is acceptable, but only after incorporating recent improvements for the covariance.
• In reproducing the published results, we investigate several issues in the published results: the angular bin values used in the data vector, the incorporation of nuisance parameters in the covariance, and the nonlinear power spectrum model and others. We find these details can shift the cosmological constraints by ∼ 0.5σ . Most of these issues are known, but analyzing all four experiments systematically in this work gives a big picture view of how the four analyses agree and differ.
• Effect of the covariance matrix: constraints based on simulation-based covariances can be shifted from analytic covariances due to noise. In addition, the DLS covariance may not be well approximated by a Gaussian covariance due to the complexity of the data, the small area and the low shape noise.
• Effect of scale cuts: sensitivity of the cosmological constraints to scale cuts could indicate internal inconsistency of datasets or further issues with the covariance. It could also point to potential failures in the models at small scales (e.g. IA, nonlinear matter power spectrum, baryonic physics).
• Effect of priors: for parameters that are not constrained (e.g. Ω m ), the priors have an effect on the constraints, but for parameters that are constrained (e.g. S 8 ), the effect of priors is smaller, but not negligible. A wide prior on the IA amplitude can absorb other sources of systematic issues, which could explain the slightly negative IA amplitude constrained by CFHTLenS.
• When unifying all analysis choices discussed above, the four surveys give the following constraints: we find S 8 ≡ σ 8 (Ω m /0.3) 0.5 to be 0.94 +0.046 The change in the DLS constraints is primarily due to the scale cuts and covariance, while the change in CFHTLenS is due to the change in the IA treatment, and could be an indication of residual issues in the photo-z estimation. The goodness-of-fit values for DLS and CFHTLenS improved but is still low.
• We calculate the ∆S 8 statistics and the Bayesian evidence ratio (BF) between each of the two surveys (when analysis choices are unified). The S 8 constraints from the two most discrepant cases (DLS and CFHTLenS) differ by 3.4σ . The S 8 constraints for DES-SV and KiDS-450 in the final matched analysis appear consistent with the Baseline analysis as well as with each other. They also seem to be robust to the various analysis choices tested. Together with the more reasonable goodness-of-fit values and IA constraints, this is an encouraging indication for the field given that DES-SV and KiDS-450 are the most recent work amongst the four surveys.
• Based on all the above information, we decide to combine the DES-SV and KiDS-450 datasets (based on the goodness-offit, IA constraints and consistency). The combined constraint is S 8 = 0.79 +0.042 −0.041 , which is in agreement with both the cosmic shear constraints from the first year of DES data in Troxel et al. (2017), and the CMB constraints from Planck Collaboration et al. (2018).
Cosmic shear measurements hold great promise in terms of the constraining power in cosmology. In order to fully exploit this power in upcoming and future cosmic shear surveys (DES, KiDS, HSC, Euclid, LSST, WFIRST), it is important to learn from the experiences accumulated over the past years in the community across the different collaborations and datasets. We have demonstrated that a number of analysis choices can result in significant changes in the cosmological constraints and should therefore be treated with care for the future analyses. . This plot illustrates the change in the cosmological constraints when we change the angular values used in the data vector: the grey dashed contour shows the constraints using the bin centers, while the colored contour shows the constraints using the weighted centers. The DESC acknowledges ongoing support from the Institut National de Physique Nucléaire et de Physique des Particules in France; the Science & Technology Facilities Council in the United Kingdom; and the Department of Energy, the National Science Foundation, and the LSST Corporation in the United States. DESC uses resources of the IN2P3 Computing Center (CC-IN2P3-Lyon/Villeurbanne -France) funded by the Centre National de la Recherche Scientifique; the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; STFC DiRAC HPC Facilities, funded by UK BIS National E-infrastructure capital grants; and the UK particle physics grid, supported by the GridPP Collaboration. This work was performed in part under DOE Contract DE-AC02-76SF00515.
Below we acknowledge the data sources: CFHTLenS: This work is based on observations obtained with MegaPrime/MegaCam, a joint project of CFHT and CEA/IRFU, at the Canada-France-Hawaii Telescope (CFHT) which is operated by the National Research Council (NRC) of Canada, the Institut National des Sciences de l'Univers of the Centre National de la Recherche Scientifique (CNRS) of France, and the University of Hawaii. This research used the facilities of the Table C1. Top: The redshift of the peak of the lensing efficiency for each redshift bin in each survey. z p is used to calculate the scale cuts in the bottom table. Bottom: Scale cuts applied in each survey. Each row is for a combination of bins. The bins are ordered as (bin1, bin1), (bin1, bin2) ... (bin1, binN), (bin2, bin2) ... (bin2, binN) ... (binN, binN), where N is the maximum number of bins and binN is the highest redshift bin. The maximum scale cuts are fixed to the survey-specified scale cuts. "-" indicates no data points are used after the scale cut. The last row lists the remaining number of data points after the scale cut. Canadian Astronomy Data Centre operated by the National Research Council of Canada with the support of the Canadian Space Agency. CFHTLenS data processing was made possible thanks to significant computing support from the NSERC Research Tools and Instruments grant program. DES-SV: This project used public archival data from the Dark Energy Survey (DES). Funding for the DES Projects has been provided by the U.S. Department of Energy, the U.S. National Sci-ence Foundation, the Ministry of Science and Education of Spain, the Science and Technology FacilitiesCouncil of the United Kingdom, the Higher Education Funding Council for England, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, the Kavli Institute of Cosmological Physics at the University of Chicago, the Center for Cosmology and Astro-Particle Physics at the Ohio State University, the Mitchell Institute for Fundamental Physics and Astronomy at Texas A&M University, Financiadora de Estudos e Projetos, Fundação Carlos Chagas Filho de Amparoà Pesquisa do Estado do Rio de Janeiro, Conselho Nacional de Desenvolvimento Científico e Tecnológico and the Ministério da Ciência, Tecnologia e Inovação, the Deutsche Forschungsgemeinschaft, and the Collaborating Institutions in the Dark Energy Survey. The Collaborating Institutions are Argonne National Laboratory, the University of California at Santa Cruz, the University of Cambridge, Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas-Madrid, the University of Chicago, University College London, the DES-Brazil Consortium, the University of Edinburgh, the Eidgenössische Technische Hochschule (ETH) Zürich, Fermi National Accelerator Laboratory, the University of Illinois at Urbana-Champaign, the Institut de Ciències de l'Espai (IEEC/CSIC), the Institut de Física d'Altes Energies, Lawrence Berkeley National Laboratory, the Ludwig-Maximilians Universität München and the associated Excellence Cluster Universe, the University of Michigan, the National Optical Astronomy Observatory, the University of Nottingham, The Ohio State University, the OzDES Membership Consortium, the University of Pennsylvania, the University of Portsmouth, SLAC National Accelerator Laboratory, Stanford University, the University of Sussex, and Texas A&M University. Based in part on observations at Cerro Tololo Inter-American Observatory, National Optical Astronomy Observatory, which is operated by the Association of Universities for Research in Astronomy (AURA) under a cooperative agreement with the National Science Foundation.

DLS CFHTLenS DES-SV
KiDS-450: This work is based on data products from observations made with ESO Telescopes at the La Silla Paranal Observatory under programme IDs 177.A-3016, 177.A-3017 and 177.A-3018. We use cosmic shear measurements from the Kilo-Degree Survey (Kuijken et al. 2015, Hildebrandt & Viola et al. 2017, Fenech Conti et al. 2016, hereafter referred to as KiDS. The KiDS data are processed by THELI (Erben et al. 2013) and Astro-WISE (Begeman et al. 2013, de Jong et al 2015. Shears are measured using lensfit (Miller et al. 2013), and photometric redshifts are obtained from PSF-matched photometry and calibrated using external overlapping spectroscopic surveys (see Hildebrandt et al. 2016).
The contributions from the primary authors are listed below. C.C. led the main analysis and writing of this paper. M.W. wrote the main software package WLPipe which integrates the python scripts using the Pegasus workflow engine. S.D. helped with the pipeline development, covariance assessment, comparison metrics, pipeline testing, and editing.