A Caltech Library Service

SVFX: a machine learning framework to quantify the pathogenicity of structural variants

Kumar, Sushant and Harmanci, Arif and Vytheeswaran, Jagath and Gerstein, Mark B. (2020) SVFX: a machine learning framework to quantify the pathogenicity of structural variants. Genome Biology, 21 . Art. No. 274. ISSN 1465-6906. PMCID PMC7650198.

[img] PDF - Published Version
Creative Commons Attribution.

[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial.

[img] MS Excel (Supplementary datasets) - Supplemental Material
Creative Commons Attribution.

[img] PDF (Supplementary Figures) - Supplemental Material
Creative Commons Attribution.

[img] MS Word (Review history) - Supplemental Material
Creative Commons Attribution.


Use this Persistent URL to link to this item:


There is a lack of approaches for identifying pathogenic genomic structural variants (SVs) although they play a crucial role in many diseases. We present a mechanism-agnostic machine learning-based workflow, called SVFX, to assign pathogenicity scores to somatic and germline SVs. In particular, we generate somatic and germline training models, which include genomic, epigenomic, and conservation-based features, for SV call sets in diseased and healthy individuals. We then apply SVFX to SVs in cancer and other diseases; SVFX achieves high accuracy in identifying pathogenic SVs. Predicted pathogenic SVs in cancer cohorts are enriched among known cancer genes and many cancer-related pathways.

Item Type:Article
Related URLs:
URLURL TypeDescription Paper CentralArticle
Kumar, Sushant0000-0002-2294-3988
Harmanci, Arif0000-0002-9696-1118
Vytheeswaran, Jagath0000-0002-5250-7714
Gerstein, Mark B.0000-0002-9746-3719
Additional Information:© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data. Received 04 September 2019; Accepted 12 October 2020; Published 09 November 2020. We are thankful to the members of the PCAWG SV working group for generating the variant calls. We are also grateful to the Center for Common Disease and the Genome Sequencing Program consortium members for creating SV calls for the CVD and IBD cohort used in this study. In particular, the Mount Sinai BioMe Biobank has been supported by The Andrea and Charles Bronfman Philanthropies and in part by Federal funds from the NHLBI and NHGRI (U01HG00638001; U01HG007417; X01HL134588). We thank all participants in the Mount Sinai Biobank. We also thank all our recruiters who have assisted and continue to assist in data collection and management and are grateful for the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. Similarly, IBD cohort data was generated as part of the The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) IBD Genetics Consortium (IBDGC) and International IBD Genetics Consortium (IIBDGC) supported by The Helmsley Charitable Trust and the Centers for Common Disease Genomes Program (NHGRI). DNA samples were obtained from the following collections: The Lunenfeld-Tanenbaum Research Institute Mount Sinai Hospital (PI: Mark Silverberg), The University of Pittsburgh School of Medicine (PI: Richard Duerr), The Emory University School of Medicine (PI: Subra Kugathasan), The Johns Hopkins Hospital (PI: Steven Brant), The Icahn School of Medicine at Mount Sinai (PI: Judy Cho), The Washington University School of Medicine (PI: Rodney Newberry), The University of Miami Miller School of Medicine (PI: Maria Abreu, Jake McCauley), and Cedars Sinai (PI: Dermot McGovern, Stephan Targan). Peer review information: Andrew Cosgrove was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The review history is available as Additional file 3. This work was supported by the National Institutes of Health (U24HG007497) grant and the AL Williams Professorship funds. Author information: Sushant Kumar and Arif Harmanci contributed equally to this work. Author Contributions: Conceptualization: MG, SK, and AH; methodology: SK and AH; investigation: SK, AH, and JV; writing—original draft: SK and MG; writing—review and editing: SK, AH, JV, and MG; supervision: SK and MG. All authors have read and approved the final manuscript. Ethics approval and consent to participate: Not applicable. The authors declare that they have no competing interests.
Funding AgencyGrant Number
Andrea and Charles Bronfman PhilanthropiesUNSPECIFIED
Helmsley Charitable TrustUNSPECIFIED
Centers for Common Disease Genomes Program (NHGRI)UNSPECIFIED
A. L. Williams ProfessorshipUNSPECIFIED
PubMed Central ID:PMC7650198
Record Number:CaltechAUTHORS:20190819-105323235
Persistent URL:
Official Citation:Kumar, S., Harmanci, A., Vytheeswaran, J. et al. SVFX: a machine learning framework to quantify the pathogenicity of structural variants. Genome Biol 21, 274 (2020).
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:97998
Deposited By: Tony Diaz
Deposited On:19 Aug 2019 20:58
Last Modified:18 Nov 2020 22:58

Repository Staff Only: item control page