Machine-learning-aided precise prediction of deletions with next-generation sequencing

Rui Guan , Jing-yang Gao

Journal of Central South University ›› 2017, Vol. 23 ›› Issue (12) : 3239 -3247.

PDF
Journal of Central South University ›› 2017, Vol. 23 ›› Issue (12) : 3239 -3247. DOI: 10.1007/s11771-016-3389-1
Mechanical Engineering, Control Science and Information Engineering

Machine-learning-aided precise prediction of deletions with next-generation sequencing

Author information +
History +
PDF

Abstract

When detecting deletions in complex human genomes, split-read approaches using short reads generated with next-generation sequencing still face the challenge that either false discovery rate is high, or sensitivity is low. To address the problem, an integrated strategy is proposed. It organically combines the fundamental theories of the three mainstream methods (read-pair approaches, split-read technologies and read-depth analysis) with modern machine learning algorithms, using the recipe of feature extraction as a bridge. Compared with the state-of-art split-read methods for deletion detection in both low and high sequence coverage, the machine-learning-aided strategy shows great ability in intelligently balancing sensitivity and false discovery rate and getting a both more sensitive and more precise call set at single-base-pair resolution. Thus, users do not need to rely on former experience to make an unnecessary trade-off beforehand and adjust parameters over and over again any more. It should be noted that modern machine learning models can play an important role in the field of structural variation prediction.

Keywords

next-generation sequencing / deletion prediction / sensitivity / false discovery rate / feature extraction / machine learning

Cite this article

Download citation ▾
Rui Guan, Jing-yang Gao. Machine-learning-aided precise prediction of deletions with next-generation sequencing. Journal of Central South University, 2017, 23(12): 3239-3247 DOI:10.1007/s11771-016-3389-1

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

McveanG A, AltshulerD M, DurbinR M, AbecasisG R, BentleyD R, ChakravartiA, ClarkA G, DonnellyP, EichlerE E, FlicekP, GabrielS B, GibbsR A, GreenE D, HurlesM E, KnoppersB M, KorbelJ O, LanderE S, LeeC, LehrachH, MardisE R, MarthG T, McveanG A, NickersonD A, SchmidtJ P, SherryS T, WangJ, WilsonR K, GibbsR A, DinhH, KovarC, Etal. An integrated map of genetic variation from 1,092 human genomes [J]. Nature, 2012, 491(7422): 56-65

[2]

MooreL E, BarisD R, FigueroaJ D, Garcia-ClosasM, KaragasM R, SchwennM R, JohnsonA T, LubinJ H, HeinD W, DagnallC L, ColtJ S, KidaM, JonesM A, SchnedA R, CheralaS S, ChanockS J, CantorK P, SilvermanD T, RothmanN. GSTM1 null and NAT2 slow acetylation genotypes, smoking intensity and bladder cancer risk: results from the New England bladder cancer study and NAT2 meta-analysis [J]. Carcinogenesis, 2011, 32(2): 182-189

[3]

LeeM Y, WonH S, BaekJ W, ChoJ H, ShimJ Y, LeeP R, KimA. Variety of prenatally diagnosed congenital heart disease in 22q11.2 deletion syndrome [J]. Obstetrics & Gynecology Science, 2014, 57(1): 11-16

[4]

BlaydonD C, BiancheriP, DiW L, PlagnolV, CabralR M, BrookeM A, VAN HeelD A, RuschendorfF, ToynbeeM, WalneA, O’TooleE A, MartinJ E, LindleyK, VulliamyT, AbramsD J, MacdonaldT T, HarperJ I, KelsellD P. Inflammatory skin and bowel disease linked to ADAM17 deletion [J]. New England Journal of Medicine, 2011, 365(16): 1502-1508

[5]

MillsR E, WalterK, StewartC, HandsakerR E, ChenK, AlkanC, AbyzovA, YoonS C, YeK, CheethamR K, ChinwallaA, ConradD F, FuY, GrubertF, HajirasoulihaI, HormozdiariF, IakouchevaL M, IqbalZ, KangS, KiddJ M, KonkelM K, KornJ, KhuranaE, KuralD, LamH Y K, LengJ, LiR, LiY, LinC-Y, LuoR, et al.. Mapping copy number variation by population-scale genome sequencing [J]. Nature, 2011, 470(7332): 59-65

[6]

AlkanC, CoeB P, EichlerE E. Genome structural variation discovery and genotyping [J]. Nature Reviews Genetics, 2011, 12(5): 363-376

[7]

BuczkowiczP, HoemanC, RakopoulosP, PajovicS, LetourneauL, DzambaM, MorrisonA, LewisP, BouffetE, BartelsU, ZuccaroJ, AgnihotriS, RyallS, BarszczykM, ChornenkyyY, BourgeyM, BourqueG, MontpetitA, CorderoF, Castelo-BrancoP, MangerelJ, TaboriU, HoK C, HuangA, TaylorK R, MackayA, BendelA E, NazarianJ, FangusaroJ R, KarajannisM A, et al.. Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecular subgroups and recurrent activating ACVR1 mutations [J]. Nature Genetics, 2014, 46(5): 451-456

[8]

MarschallT, HajirasoulihaI, SchönhuthA. MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels [J]. Bioinformatics (Oxford, England), 2013, 29(24): 3143-3150

[9]

AbyzovA, UrbanA E, SnyderM, GersteinM. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing [J]. Genome Research, 2011, 21(6): 974-984

[10]

GnerreS, MaccallumI, PrzybylskiD, RibeiroF J, BurtonJ N, WalkerB J, SharpeT, HallG, SheaT P, SykesS, BerlinA M, AirdD, CostelloM, DazaR, WilliamsL, NicolR, GnirkeA, NusbaumC, LanderE S, JaffeD B. High-quality draft assemblies of mammalian genomes from massively parallel sequence data [J]. Proceedings of the National Academy of Sciences of the United States of America, 2011, 108(4): 1513-1518

[11]

PavlopoulosG A, OulasA, IacucciE, SifrimA, MoreauY, SchneiderR, AertsJ, IliopoulosI. Unraveling genomic variation from next generation sequencing data [J]. BioData Mining, 2013, 6(1): 1-25

[12]

YeK, SchulzM H, LongQ, ApweilerR, NingZ-min. Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads [J]. Bioinformatics, 2009, 25(21): 2865-2871

[13]

ZhangZ-d, DuJ, LamH, AbyzovA, UrbanA E, SnyderM, GersteinM. Identification of genomic indels and structural variations using split reads [J]. BMC Genomics, 2011, 12(1): 1-12

[14]

ZhangJ, WuY-feng. SVseq: An approach for detecting exact breakpoints of deletions with low-coverage sequence data [J]. Bioinformatics, 2011, 27(23): 3228-3234

[15]

ZhangJ, WangJ-y, WuY-feng. An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data [J]. BMC bioinformatics, 2012, 13(6): 1

[16]

LiuQ, PengK, LiuW, XieQ, LiZ-y, LanH, JinYao. Fingerprint singular points extraction based on orientation tensor field and Laurent series [J]. Journal of Central South University, 2014, 21(5): 1927-1934

[17]

CortesC, VapnikV. Support-vector networks [J]. Machine Learning, 1995, 20(3): 273-297

[18]

FreundY, SchapireR E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139

[19]

GaoJ-y, ChenC-l-z, ZhuQ-xiong. A new variant Boosting algorithm: Update sample’s weight according to standard deviation of Error-Right statistics [J]. Journal of Central South University: Science and Technology, 2012, 43(11): 4355-4360

[20]

BreimanL. Random forests [J]. Machine Learning, 2001, 45(1): 5-32

[21]

LevyS, SuttonG, NgP C, FeukL, HalpernA L, WalenzB P, AxelrodN, HuangJ, KirknessE F, DenisovG, LinY, MacdonaldJ R, PangA W C, ShagoM, StockwellT B, TsiamouriA, BafnaV, BansalV, KravitzS A, BusamD A, BeesonK Y, McintoshT C, RemingtonK A, AbrilJ F, GillJ, BormanJ, RogersY H, FrazierM E, SchererS W, StrausbergR L, et al.. The diploid genome sequence of an individual human [J]. PLoS biology, 2007, 5(10): e254

[22]

HuangW-c, LiL-p, MyersJ R, MarthG T. ART: A next-generation sequencing read simulator [J]. Bioinformatics, 2012, 28(4): 593-594

[23]

LamH Y K, ClarkM J, ChenR, ChenR, NatsoulisG, O’HuallachainM, DeweyF E, HabeggerL, AshleyE A, GersteinM B, ButteA J, JiH P, SnyderM. Performance comparison of whole-genome sequencing platforms [J]. Nature Biotechnology, 2011, 30(1): 78-82

[24]

LiHengAligning sequence reads, clone sequences and assembly contigs with BWA-MEM [J], 2013

AI Summary AI Mindmap
PDF

136

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/