Integrative modeling of transmitted and de novo variants identifies novel risk genes for congenital heart disease

Mo Li , Xue Zeng , Chentian Jin , Sheng Chih Jin , Weilai Dong , Martina Brueckner , Richard Lifton , Qiongshi Lu , Hongyu Zhao

Quant. Biol. ›› 2021, Vol. 9 ›› Issue (2) : 216 -227.

PDF (329KB)
Quant. Biol. ›› 2021, Vol. 9 ›› Issue (2) : 216 -227. DOI: 10.15302/J-QB-021-0248
RESEARCH ARTICLE

Integrative modeling of transmitted and de novo variants identifies novel risk genes for congenital heart disease

Author information +
History +
PDF (329KB)

Abstract

Background: Whole-exome sequencing (WES) studies have identified multiple genes enriched for de novo mutations (DNMs) in congenital heart disease (CHD) probands. However, risk gene identification based on DNMs alone remains statistically challenging due to heterogenous etiology of CHD and low mutation rate in each gene.

Methods: In this manuscript, we introduce a hierarchical Bayesian framework for gene-level association test which jointly analyzes de novo and rare transmitted variants. Through integrative modeling of multiple types of genetic variants, gene-level annotations, and reference data from large population cohorts, our method accurately characterizes the expected frequencies of both de novo and transmitted variants and shows improved statistical power compared to analyses based on DNMs only.

Results: Applied to WES data of 2,645 CHD proband-parent trios, our method identified 15 significant genes, half of which are novel, leading to new insights into the genetic bases of CHD.

Conclusion: These results showcase the power of integrative analysis of transmitted and de novo variants for disease gene discovery.

Graphical abstract

Keywords

rare variants / gene-level association test / congenital heart disease / de novo mutation

Cite this article

Download citation ▾
Mo Li, Xue Zeng, Chentian Jin, Sheng Chih Jin, Weilai Dong, Martina Brueckner, Richard Lifton, Qiongshi Lu, Hongyu Zhao. Integrative modeling of transmitted and de novo variants identifies novel risk genes for congenital heart disease. Quant. Biol., 2021, 9(2): 216-227 DOI:10.15302/J-QB-021-0248

登录浏览全文

4963

注册一个新账户 忘记密码

1 INTRODUCTION

Congenital heart disease (CHD) is a common birth defect affecting 0.8% of live births. It accounts for one-third of all major congenital abnormalities [1,2]. CHD patients have structural abnormalities in the heart, including septal defects, valve defects, and lesions affecting the outflow tract. CHD is the most important cause of death in neonates and infants before the advent of cardiac surgery, with fewer than 15% reaching adulthood. With the introduction of corrective heart surgery in the late 1950s and improvements in long term care of CHD patients, over 90% of children who survive the first year of life now live into adulthood [3]. Both genetic and environmental factors are known to play important roles in CHD, with one study performed on Danish twins estimating the genetic heritability in this population to be close to 0.5 [4]. However, the lack of comprehensive understanding of the genetic underpinnings of CHD presents a major obstacle to the reproductive counseling of CHD patients [1]. The difficulties in elucidating the genetic underpinnings of CHD stem from the fact that CHD is a genetically heterogeneous disease. Classic genetic methodologies have linked certain forms of CHD to chromosomal abnormalities, such as 22q11 deletion and trisomy 21 [5]. Targeted gene-deletion studies in mice have also identified over 300 genes which, when disrupted, lead to heart defects [6]. However, in the vast majority of CHD patients, no chromosomal abnormality or causal mutation have been identified [1].

Whole-exome sequencing (WES) has been deployed as an important tool to study the genetic contributions to disease in the past decade, partly due to substantial decrease in cost and increase in the accuracy and throughput of the technology [7]. WES studies have successfully identified novel causal genes, not just in Mendelian disorders but also for heterogeneous monogenic disorders like hearing loss and for complex disorders like cardiovascular disease [8]. Since WES often identifies tens of thousands of genetic variants in each exome, most of which are irrelevant to the disease of interest, researchers need to narrow the pool of variants being considered. For example, with sequenced exomes of healthy parents and their affected offspring, only variants that have extremely low frequencies in the general population or de novo mutations (DNMs) in the children are further studied, thereby greatly reducing the number of variants under consideration [1]. WES studies have achieved some success in CHD. In one notable study, Zaidi et al. found an excess of protein-altering DNMs in genes expressed in the developing heart, and identified disruptions to genes in the H3K4me-H3K27 pathway as contributors to CHD [1]. In a more recent study, Jin et al. studied rare transmitted variants in addition to DNMs, and found that DNMs accounted for 8% of cases, whereas rare transmitted mutations were implicated in 1.8% of the CHD cases [9].

Studies on DNMs often lack statistical power because of the low number of mutations. There are an estimated 1.2 DNMs per exome [10], therefore the difference between the number of DNMs in the cases and controls is generally small even in cases where particular genes contribute to a disease phenotype [11]. Statistical power is further limited since many of these studies do not incorporate the large number of inherited variants inferred through WES. The transmission and de novo association (TADA) framework is a hierarchical Bayesian method to identify disease genes [12] by drawing information from both inherited and de novo variants in the exome. Despite some success [13,14], TADA has several methodological limitations. It does not incorporate the recessive mode of inheritance nor account for factors affecting de novo mutability of each gene (e.g., local sequence context [15]), both of which have been demonstrated to be critical for inferring risk genes for CHD [9].

In this work, we introduce TADA-R, a generalized model built upon TADA to include the recessive disease model, i.e., the offspring has the recessive genotype, including homozygotes, where the affected child inherited two identical mutations from each of the two parents, and compound heterozygotes, where the affected child inherited two different mutations of the same gene, one from each parent. It is important to incorporate the recessive model when studying CHD. For example, two risk genes for human CHD (i.e.,GDF1 and MYH6) were identified purely by recessive effect [9]. Besides, recessive inheritance for CHD was observed in mouse models [16]. Beyond CHD, recessive inheritance has been implicated in a few diseases, including autism and early-onset Parkinson’s disease [1719]. By taking both dominant effects and recessive effects into consideration, our model is adaptive to diverse genetic architecture. We also incorporate gene-level annotations (e.g., gene length and sequence context) and data from a population reference panel (e.g., gnomAD [20]) to more accurately characterize the expected frequencies of both de novo and transmitted rare variants, which further improved the statistical power of our method. Through simulations, we demonstrate that TADA-R consistently outperforms TADA under diverse settings. We applied our model to WES data of 2,645 CHD proband-parent trios. In total, we identified 15 significant genes, many of which are novel and were not implicated in published studies based on simple meta-analysis of de novo and inherited variants on the same dataset. These findings shed important light on the genetic etiology of CHD.

2 RESULTS

2.1 Model overview

As detailed in the Methods section, the key for the TADA-R model is the probability of rare deleterious mutations for a gene in proband-parent trios. Following TADA, we collapse all rare mutations in a gene as a binary indicator variable for carrier status, and consider three possible genotypes for each gene: without deleterious allele (AA), with one deleterious allele (Aa), and with both alleles being deleterious (aa). TADA-R models four possible genotype configurations in a proband-parent trio where proband carries deleterious allele: (1) both parents are homozygous AAand child has a DNM (de novo trio); (2) one parent is heterozygous and did not transmit mutation a to child (non-transmitted trio); (3) one parent is heterozygous and transmitted the mutation a to child (transmitted trio); (4) both parent are heterozygous and child is homozygous aa (recessive trio). Let μ denote the de novo mutability of a gene, q/2 denote the frequency of allele a in the population, and γ=(γd,γt,γr) denote the relative risks for de novo, transmitted, and non-transmitted variants. The probabilities for the four cases are Pois (2μγd),Pois(q),Pois (q γt) and Poi s( q 2 γr), respectively. All the statistical details are discussed in the Methods section.

If a gene is not associated with a trait ( H0), then γd=γt=γr=1. Otherwise, if gene is associated with a trait ( H1), then any of ( γd, γ t,γr) 1. To quantify the magnitude of gene-disease associations, we use the following Bayes factor as the test statistic:

B= P( X|H 1)P(X | H0)= (X |γ,q) Pr(γ)Pr (q)dγ dq (X |γ=1,q)Pr(q)dγdq .
where X= {Xd ,Xn,Xt ,Xr}are the counts for de novo trios, non-transmitted trios, transmitted trios, and recessive trios, respectively. B=1= under H0 and B>1 under H1.

2.2 Variant calling

Running our variant calling pipeline on the sequencing data and using a minor allele frequency (MAF) filter of 0.001, we found an enrichment of de novo deleterious missense (D-Mis) and loss-of-function (LoF) mutations in the CHD case trios as compared to the control trios. Among the 2,645 case trios, we called 369 LoF DNMs and 448 D-Mis DNMs, which was a rate of 0.14 and 0.17 mutations per trio, respectively. This is compared to 0.08 and 0.12 mutations per trio for the control trios, translating into an enrichment factor of 1.66 for de novo LoF mutations and 1.38 for de novo D-Mis mutations (Table 1).

In contrast, we observed no enrichment in tolerated missense (T-Mis) and synonymous DNMs in the cases as compared to the controls. The overall rate of DNMs per proband exome was 1.11 for CHD cases and 1.01 for controls, which is consistent with prior work [11]. The DNM enrichment in deleterious annotation categories but not in benign categories confirms that CHD probands carry more damaging, protein-altering DNMs compared to healthy controls, while the burden of non-deleterious mutations is similar between two groups. Therefore, we only consider LoF and D-Mis mutations in our following analysis.

We also observed an enrichment of transmitted LoF variants in CHD trios with one parental carrier. There were 24,311 transmitted variants in the CHD case trios, corresponding to 9.19 variants per proband, as compared to 7.85 variants per control sample. This represents an enrichment factor of 1.17. By comparison, the enrichment factor for D-Mis variants was 1.06.

We also investigated the enrichment for recessive trios, where the proband had mutations in both copies of the gene. We observed a comparable enrichment for recessive trios with de novo trios. An enrichment of 1.39 and 1.45 was seen for LoF and D-Mis mutations, respectively. The higher enrichment for D-Mis mutation may be attributed to the relatively low counts.

2.3 Simulation study

We first evaluated the type-I error of our method. We compared the statistical power of four methods, including (1) TADA-Denovo, (2) TADA, (3) TADA-R, and (4) TADA-R with gene-specific prior (see Methods). No method showed type-I error inflation, though they were likely to be more conservative than expected (Fig.1). This is because mutations with very low frequencies had few observations. Statistical power is exactly zero when no mutation is observed. Next, we evaluated the statistical power of our method. We compared the statistical power of the four methods in the three simulation settings:

Setting 1: γd=20,γt=4, γ r=1. This is the case where there is only dominant disease mode.

Setting 2: γd=1,γt=1, γ r=16. This is the case where there is only recessive disease mode.

Setting 3: γd=20,γt=4, γ r=16. This is the case where there is both dominant and recessive disease modes.

TADA-R with gene-specific prior had the best performance in all simulation settings (Fig. 1). In particular, for a dominant trait, the latter three methods had similar statistical power. However, for a recessive trait, TADA could not effectively identify risk genes, while TADA-R was consistently more powerful. Incorporating gene-specific prior further improved statistical power. We observed a similar pattern for traits with both dominant and recessive genes, with 15.6% and 25.4% improvement in statistical power for TADA-R without/with gene-specific prior in comparison to TADA.

2.4 Real data analysis

Finally, we performed the TADA-R analysis on 2,645 CHD proband-parent trios. 15 genes reached the genome-wide level of significance (Table 2). We decomposed the Bayes factors of significant genes into contributions from dominant trios (i.e.,de novo, non-transmitted, and transmitted trios) and recessive trios (Fig. 2). 11 out of 15 significant genes only showed dominant inheritance, including CHD7, KMT2D, PTPN11, RBFOX2, POGZ, ACTB, CYP21A2, RPL5, AKAP12, NOTCH1, and SMAD2. Eight have been previously reported as human CHD genes in association analysis or gene expression analysis [9]. We identified four genes showing both dominant and recessive associations, including GDF1, SULF1, NSD1, and ADIPOQ. Homozygosity of a mutation in GDF1 was identified as a major cause of severe CHD among Ashkenazim [9]. We annotated these genes with the percentile rank of gene expression in developing mouse heart [1], and their intolerance to LoF mutation (pLI) in the gnomAD database [20]. Most of the genes are extremely intolerant to LoF mutations (pLI>0.85) and highly expressed in mouse developmental hearts (expression quantile>75%).

Majority (n = 5/7) of the genes identified by Jin et al. are also significant in our study, including CHD7, KMT2D, PTPN11, RBFOX2, and NOTCH1. We observed fewer mutations in the other two genes due to differences in the variant filtering criteria, therefore they were not significant in our analysis. All five significantly associated genes recapitulated in our study are highly intolerant to LoF and D-Mis mutations, with pLI≥0.99 and Z≥2.5. CHD7 (chromodomain helicase DNA binding protein 7) is essential for the formation of multipotent migratory neural crest, which is required for the development of cardiac structures [21]. Damaging mutations in KMT2D (lysine methyltransferase 2D) have been strongly associated with Kabuki syndrome; CHDs are present in 70% of KMT2D positive Kabuki syndrome patients [22]. Mutations in PTPN11 (protein tyrosine phosphatase non-receptor Type 11) are identified in over 50% of Noonan syndrome patients as well as patients with LEOPARD Syndrome 1; CHDs are featured in both syndromes. NOTCH1 (notch receptor 1) is a central cardiac development factor that controls fetal cardiac development and cardiomyocyte proliferation [23]. Mutations in NOTCH1 have been associated with a spectrum of aortic valve anomalies in human [1619,2426]. RBFOX2 (RNA binding fox-1 homolog 2) is highly expressed in heart, liver, and pancreas. LoF mutations in RBFOX2 have been associated with hypoplastic left heart syndrome (HLHS) [27]. A study in HLHS patient positive for RBFOX2 mutations showed that Rbfox2 regulates mRNA levels of targets with 3′UTR binding sites contributing to aberrant gene expression in CHD patients [28].

Among the seven novel CHD risk genes identified in our study, POGZ, ACTB, and SMAD2 are extremely intolerant to LoF and D-Mis mutations and are highly expressed in developing human heart. Additionally, all damaging variants identified in these three genes in our cohort have not been observed in ExAC [29] and gnomAD [20]. All identified D-Mis mutations alter highly conserved amino acid loci, supporting the pathogenicity of these variants. POGZ (pLI= 1.00, Z score= 3.58) encodes Pogo transposable element derived with ZNF domain. De novo protein-truncating mutations in POGZ have been reported in multiple cases of White-Sutton syndrome [30,31], among which CHD has been reported in one patient. ACTB (pLI= 0.99, Z score= 5.1) encodes actin beta. Heterozygous mutation in ACTB is associated with Baraitser-Winter syndrome 1 [32], where CHDs have been reported in multiple unrelated ACTB mutation positive patients [33]. This suggests ACTB plays a role in cardiac development. SMAD2 (pLI= 1.00, Z score= 3.8) encodes SMAD family member 2. Mouse knockout of SMAD2 has abnormal dorsal aorta and heart tube morphology. Additionally, a transmitted LoF mutation in SMAD2 has been reported as a disease-causing mutation in a familial case of CHD [1].

Additionally, we identified two de novo and one transmitted frameshift mutations in AKAP12 (pLI= 0.05, Z score= 0.74), a gene encoding A-kinase anchoring protein 12. All identified AKAP12 mutations are novel in gnomAD and ExAC. AKAP12 is highly expressed in developing human heart and mouse knockout of AKAP12 exhibits abnormal heart left ventricle and heart septum morphology, suggesting a role of AKAP12 in CHDs.

Significant results were also observed in CYP21A2 (pLI= 0, Z score= 1.62; cytochrome P450 family 21 subfamily A member 2), ADIPOQ (pLI= 0, Z score= ‒0.08; Adiponectin), and SULF1 (pLI= 0, Z score= 1.58; Sulfatase 1). A large number of risk alleles (n = 76) that are relatively more common were observed in our cohort in these three genes, with 38% and 63% of the alleles having MAF≥1E‒4 and 1E‒5, respectively. SULF1 is highly expressed in developing human heart. Additionally, a microdeletion encompassing two genes, SULF1 and SLCO5A1, on chromosome 8q13 has been associated with the mesomelia-synostosis syndrome (MSS); CHDs are one of the complications of MSS. These results provide potential target genes for functional validation in the future.

We tested for type I errors in the WES pipeline and TADA analysis by evaluating data from 1,789 control trios, which were the healthy siblings of autism probands in the simons simplex collection. No gene was found to be statistically significant. Besides, there is no known CHD gene among the top gene list (Table 3). This further suggests that our method does not have type I error inflation.

3 DISCUSSION

In this paper, we have introduced TADA-R, a novel method to perform gene-level association analysis in WES studies. This method integrates signals from DNMs and transmitted variants, and models both dominant and recessive inheritance to further improve statistical power. This is helpful for the study of CHD, where recessive genes play an important role [9]. Besides, TADA did not consider the difference of mutability among genes. We address this limitation by integrating gene-specific prior on mutation frequency learned from a large external database. Through simulations, we demonstrated that TADA-R achieves better performance under various genetic architectures.

To further demonstrate the performance of TADA-R, we conducted a case study on CHD. In total, we identified 15 significant genes for CHD and successfully replicated five significant genes as discussed in the previous paper [9]. More importantly, our model identified seven novel CHD risk genes that were not implicated in previous analysis using the same dataset. One example is POGZ, a gene that is intolerant to loss-of-function variants (pLI=1.00), highly expressed in developing human heart, and has been reported in CHD cases. Our analysis provides novel insight into the disease mechanism of CHD.

Despite the success, our method has several limitations. Although including transmitted variants in analysis may improve statistical power, it may bring difficulty in interpreting the results. It may be difficult to pinpoint the driving source of genetic risk given multiple mutations scattered around the coding regions, especially for those that are unlikely to be pathogenic. In our analysis, we decomposed the contribution of mutations into dominant effect and recessive effect, which makes it easier to interpret the disease mechanism. However, methods for further decomposition are needed. Another limitation is that our method does not account for inbreeding when estimating the prior of parental genotype frequencies in recessive trios. Modeling the elevated rate of inbreeding in the case cohort has the potential to further improve the performance of our model.

In summary, TADA-R is a powerful and flexible framework to perform gene-level association analysis for de novo and transmitted variants. By integrating both dominant and recessive effects of genes, TADA-R can achieve better performance in all simulation scenarios. We have successfully applied TADA-R to analyze a CHD dataset, and shown that TADA-R is able to replicate previous findings, as well as identify novel significant genes. Further explorations suggest that these genes likely play important roles in CHD mechanism. Besides CHD, TADA-R can be applied to analyze WES data for other disease trios. As WES data continue to be generated for more traits and more individuals, we hope that TADA-R can lead to more gene identifications and biological insights.

4 METHODS

4.1 Whole-exome sequencing data

WES data for CHD case trios were downloaded from the database of Genotypes and Phenotypes (dbGaP) website (phs000571.v1.p1) [9]. The study included a total of 2,645 families that were recruited to the Congenital Heart Disease Network Study of the Pediatric Cardiac Genomics Consortium. Each family includes unaffected parents and one CHD offspring. Details on the inclusion criteria, sequencing protocol, and variant calling pipelines have been previously reported [9]. WES data for control trios were downloaded from (https://ndar.nih.gov/study.html?id=353) [34]. The data contain 1,789 families with two unaffected parents, one offspring with autism, and one unaffected sibling. Only the unaffected sibling and parents were analyzed in this study [9,34,35].

We considered rare homozygous and compound heterozygous variants supported by high quality sequence reads. We defined rare variants as those that had an allele frequency less than 0.1% across all the samples in the 1000 Genomes, EVS, and ExAC datasets [29,36,37], while high quality sequence reads are those that passed GATK Variant Score Quality Recalibration, had a minimum 8 total reads for both proband and parents, and had a genotype quality (GQ)≥20. Rare variants were annotated by ANNOVAR [38]. Only LoF mutations (nonsense, canonical splice-site, frameshift indels, and start loss) and D-Mis variants were considered potentially damaging to the disease.

4.2 Probabilistic model

TADA-R models four possible genotype configurations in a proband-parent trio where proband carries deleterious allele: (1) both parents are homozygous AA and child has a de novo mutation (de novo trio); (2) one parent is heterozygous and did not transmit mutation a to child (non-transmitted trio); (3) one parent is heterozygous and transmitted the mutation a to child (transmitted trio); (4) both parent are heterozygous and child is homozygous aa (recessive trio). The probability of observing each trio conditional on unaffected parents is the multiplication of the probability of parent genotype, child genotype, and child phenotype.

If we denote the frequency of allele a in the population as q/2, the frequencies of parent genotype being AA, Aa, and aa genotypes are 1−q+q2/4, q−(1−q/2), and q2/4, respectively. As q is very small in the population, we ignore the possibility of aa genotype for unaffected parents and approximate the frequencies of homozygous (AA) and heterozygous (Aa) genotypes as 1−q and q, respectively. Let μ denote the de novo mutability of a gene which quantifies the expected DNM counts in the gene per individual and per chromosome. Conditioning on parental genotypes, the probabilities for observing the affected child’s genotype for these four cases are 1−2μ, 2μ, (1−2μ)/2, and (1+2μ)/2, respectively. We use the tri-nucleotide sequence context approach to estimate de novo mutability μ [15]. For phenotype probability, let f denote the penetrance of AA genotype, and let
γ =( γd,γ t,γ r)
denote the relative risk values for de novo, transmitted, and non-transmitted variants. The penetrance for the proband in de novo, non-transmitted, transmitted, and recessive trio is γd f,f, γ tf, and γrf, respectively (Fig. 3).

As q and μ are small for rare variant, we ignore terms 1q,12μ, and 1+2μ. The probabilities of the four types of trios can be approximated by 2μ γd, q, qγt and q2 γr, and for a cohort of N trios, their counts can be approximated by the following distribution:

Xd Pois(2μγdN) ,XnPois(qN),Xt Pois(q γ tN), XrPois(q2 γrN),

where X d,Xn ,Xt,Xr are the counts for de novo trios, non-transmitted trios, transmitted trios, and recessive trios, respectively.

The full likelihood of our model is then

L(Θ )=i=1M (X i| γ,q i )Pr (γ)Pr (qi) dγd qi,
where M is the number of genes, Xi =( Xd_i, X n_i,X t_i ,Xr_i) are the counts of different family genotype configurations for gene i, qi is frequency of having rare variants in gene i in the population, and γ=(γd,γt,γr) are the relative risk values for de novo, transmitted, and non-transmitted variants across all genes. Prior selection of parameters γd,γt,γr and qi will be discussed in the next session.

4.3 Prior estimation

The model involves four parameters, including cross-gene parameters γd,γt, γr, and gene-specific parameter qi . We use the conjugate prior for these parameters:

γdGamma( ρd,νd), γt Gamma (ρt ,νt), γr Gamma (ρr ,νr),qi Gamma( ω i,τ),
where ρd ,νd,νt ,ρr,νr are the hyperparameters for the relative risks. We use an empirical Bayes approach to estimate these hyperparameters. The details about hyperparameter estimation for γ were discussed in the TADA model [12]. ω i and τ are the hyperparameters for the rare allele frequency qi, where ωi controls the expectation of the mutation frequency for gene i, and τ controls the variance of frequency distributions across all genes. We estimated ω i and τ from the gnomAD database. Since we want to focus on extremely rare variants, there is high variance for the observed counts of these rare variants, and it is possible that a gene may lack any rare variant. The point estimates of allele frequencies in gnomAD cannot be plugged in directly. Instead, we ‘smoothen’ the point estimates via an empirical Bayes prior that integrates information across all the genes in the genome.

Specifically, we consider an additional hierarchical layer on the distribution of qi . We propose the following prior distribution for qi

f( qi )=G amma( αiβ,β),
where αi is the expectation of this distribution and is calculated as the re-weighted frequency based on the total counts of rare variants in gnomAD. We use mutability scores mi estimated based on sequence context by Samocha et al. as the weights for mutations [15]. We have
αi= mi imi× izi 2 NgnomAD,
where zi is the observed counts of rare variants for gene i in gnomAD, NgnomAD is the sample size for gnomAD, and zi follows a Poisson distribution given the allele frequency qi
ziqi~ Poisson (2 NgnomAD qi).

The other parameter β controls the variance of the prior distribution (variance= αi/β) and is shared between all genes. We estimate the unknown parameters β by maximizing the likelihood of the observed variant counts in gnomAD.

β^ =argmax β L(z)=argmaxβi0 1P( ziq i)f( qi)d qi
Then, the posterior distribution of qi given the variant counts in gnomAD is

P(q i zi)= P (zi qi)f (qi) 01P (zi qi)f (qi)dq i=Gamma( αi β ^+zi, β ^+2Ngnomad )
We use this distribution as the prior distribution of qi in our model, where ωi=αiβ^+zi, and τ= β ^+2Ngnomad.

4.4 Simulation settings

We conducted simulations to evaluate the performance of our model. Simulation data were generated under both the null and alternative hypotheses to evaluate the type-I error rate and statistical power. We fixed the sample size of our simulation study as 5,000, and randomly selected 200 genes for simulation. For each gene, we further replicated 5 times. In each repeat, we sampled the allele frequency of each gene based on the prior distribution P(qi | Zi)estimated from gnomAD. Then, we sampled the number of de novo trios, non-transmitted trios, transmitted trios, and recessive trios for each gene from the following Poisson distribution. We adjusted mutation rates by considering inbreeding, with inbreeding factor F=0.002. This is the same as that reported in the CHD dataset [9]. After the adjustment, we have:
Xd ~ Pois (2μγdN), Xn~ Pois ((qqF )N),X t~ Pois ((qqF)γtN),X r~ Pois (qF γrN/2).

To generate mutation counts under the alternative hypothesis, we considered the following three simulation settings:

Setting 1: γd= 20,γ t=4 ,γr=1. This is the case where there is only dominant disease mode.

Setting 2: γd= 1,γ t=1 ,γr=16. This is the case where there is only recessive disease mode.

Setting 3: γd= 20,γ t=4 ,γr=16. This is the case where there is both dominant and recessive disease modes.

We note that the values of the relative risks chosen here are consistent with previous estimations [12].

Using the same mutation counts as input, we compared the performance of four methods: (1) TADA-Denovo only takes DNMs as input; (2) TADA takes both de novo and transmitted dominant mutations as input; (3) TADA-R takes de novo, transmitted dominant, and transmitted recessive mutations as input; and (4) TADA-R with gene-specific prior further considers gene-specific prior in the TADA-R model. As shown in Eq. (1), we used a Bayes factor as test statistic to quantify the magnitude of gene-disease association. The p-value of this Bayes factor was then calculated by a permutation test, where samples from the null distribution were generated by setting the relative risk of variants as zero, and p-value is the proportion of permuted samples with values greater than the observed Bayes factor. Finally, the false discovery rate (FDR) was calculated as Benjamini-Hochberg correction of multiple testing p-values. The FDR is estimated for all the above methods under different simulation scenarios. Statistical power was calculated as the proportion of genes with FDRs smaller than 0.05. Under the null hypothesis, we generated mutation counts under γd=1,γ t=1 ,γr=1 and repeated the whole procedure. Similarly, type-I error was calculated as the proportion of genes with FDRs smaller than 0.05 under the null.

References

[1]

Zaidi, S., Choi, M., Wakimoto, H., Ma, L., Jiang, J., Overton, J. D., Romano-Adesman, A., Bjornson, R. D., Breitbart, R. E., Brown, K. K., (2013) De novo mutations in histone-modifying genes in congenital heart disease. Nature, 498, 220–223

[2]

Postma, A. V., Bezzina, C. R. and Christoffels, V. M. (2016) Genetics of congenital heart disease: the contribution of the noncoding regulatory genome. J. Hum. Genet., 61, 13–19

[3]

Gilboa, S. M., Salemi, J. L., Nembhard, W. N., Fixler, D. E. and Correa, A. (2010) Mortality resulting from congenital heart disease among children and adults in the United States, 1999 to 2006. Circulation, 122, 2254–2263

[4]

Wienke, A., Herskind, A. M., Christensen, K., Skytthe, A. and Yashin, A. I. (2005) The heritability of CHD mortality in danish twins after controlling for smoking and BMI. Twin Res. Hum. Genet., 8, 53–59

[5]

Lalani, S. R. and Belmont, J. W. (2014) Genetic basis of congenital cardiovascular malformations. Eur. J. Med. Genet., 57, 402–413

[6]

Bentham, J. and Bhattacharya, S. (2008) Genetic mechanisms controlling cardiovascular development. Ann. N. Y. Acad. Sci., 1123, 10–19

[7]

Teer, J. K. and Mullikin, J. C. (2010) Exome sequencing: the sweet spot before whole genomes. Hum. Mol. Genet., 19, R145–R151

[8]

Rabbani, B., Tekin, M. and Mahdieh, N. (2014) The promise of whole-exome sequencing in medical genetics. J. Hum. Genet., 59, 5–15

[9]

Jin, S. C., Homsy, J., Zaidi, S., Lu, Q., Morton, S., DePalma, S. R., Zeng, X., Qi, H., Chang, W., Sierant, M. C., (2017) Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat. Genet., 49, 1593–1601

[10]

Kong, A., Frigge, M. L., Masson, G., Besenbacher, S., Sulem, P., Magnusson, G., Gudjonsson, S. A., Sigurdsson, A., Jonasdottir, A., Jonasdottir, A., (2012) Rate of de novo mutations and the importance of father’s age to disease risk. Nature, 488, 471–475

[11]

Sanders, S. J., Murtha, M. T., Gupta, A. R., Murdoch, J. D., Raubeson, M. J., Willsey, A. J., Ercan-Sencicek, A. G., DiLullo, N. M., Parikshak, N. N., Stein, J. L., (2012) De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature, 485, 237–241

[12]

He, X., Sanders, S. J., Liu, L., De Rubeis, S., Lim, E. T., Sutcliffe, J. S., Schellenberg, G. D., Gibbs, R. A., Daly, M. J., Buxbaum, J. D., (2013) Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genet., 9, e1003671

[13]

De Rubeis , S., He, X., Goldberg, A. P., Poultney, C. S., Samocha, K., Cicek, A. E., Kou, Y., Liu, L., Fromer, M., Walker, S., (2014) Synaptic, transcriptional and chromatin genes disrupted in autism. Nature, 515, 209–215

[14]

Singh, T., Kurki, M. I., Curtis, D., Purcell, S. M., Crooks, L., McRae, J., Suvisaari, J., Chheda, H., Blackwood, D., Breen, G., (2016) Rare loss-of-function variants in SETD1A are associated with schizophrenia and developmental disorders. Nat. Neurosci., 19, 571–577

[15]

Samocha, K. E., Robinson, E. B., Sanders, S. J., Stevens, C., Sabo, A., McGrath, L. M., Kosmicki, J. A., Rehnström, K., Mallick, S., Kirby, A., (2014) A framework for the interpretation of de novo mutation in human disease. Nat. Genet., 46, 944–950

[16]

Bruneau, B. G. (2008) The developmental genetics of congenital heart disease. Nature, 451, 943–948

[17]

Shanks, M. E.,Downes, S. M., Copley, R. R., Lise, S., Broxholme, J., Hudspith, K. A., Kwasniewska, A., Davies, W. I., Hankins, M. W., Packham, E. R., (2013) Next-generation sequencing (NGS) as a diagnostic tool for retinal degeneration reveals a much higher detection rate in early-onset disease. Eur. J. Hum. Genet., 21, 274–280

[18]

Chahrour, M. H., Yu, T. W., Lim, E. T., Ataman, B., Coulter, M. E., Hill, R. S., Stevens, C. R., Schubert, C. R., Greenberg, M. E., Gabriel, S. B., (2012) Whole-exome sequencing and homozygosity analysis implicate depolarization-regulated neuronal genes in autism. PLoS Genet., 8, e1002635

[19]

Schormair, B., Kemlink, D., Mollenhauer, B., Fiala, O., Machetanz, G., Roth, J., Berutti, R., Strom, T. M., Haslinger, B., Trenkwalder, C., (2018) Diagnostic exome sequencing in early-onset Parkinson’s disease confirms VPS13C as a rare cause of autosomal-recessive Parkinson’s disease. Clin. Genet., 93, 603–612

[20]

Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., Collins, R. L., Laricchia, K. M., Ganna, A., Birnbaum, D. P., (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581, 434–443

[21]

Bajpai, R., Chen, D. A., Rada-Iglesias, A., Zhang, J., Xiong, Y., Helms, J., Chang, C. P., Zhao, Y., Swigut, T. and Wysocka, J. (2010) CHD7 cooperates with PBAF to control multipotent neural crest formation. Nature, 463, 958–962

[22]

Digilio, M. C., Gnazzo, M., Lepri, F., Dentici, M. L., Pisaneschi, E., Baban, A., Passarelli, C., Capolino, R., Angioni, A., Novelli, A., (2017) Congenital heart defects in molecularly proven Kabuki syndrome patients. Am. J. Med. Genet. A., 173, 2912–2922

[23]

Kasahara, A., Cipolat, S., Chen, Y., Dorn, G. W. 2nd and Scorrano, L. (2013) Mitochondrial fusion directs cardiomyocyte differentiation via calcineurin and Notch signaling. Science, 342, 734–737

[24]

Garg, V., Muth, A. N., Ransom, J. F., Schluterman, M. K., Barnes, R., King, I. N., Grossfeld, P. D. and Srivastava, D. (2005) Mutations in NOTCH1 cause aortic valve disease. Nature, 437, 270–274

[25]

Mohamed, S. A., Aherrahrou, Z., Liptau, H., Erasmi, A. W., Hagemann, C., Wrobel, S., Borzym, K., Schunkert, H., Sievers, H. H. and Erdmann, J. (2006) Novel missense mutations (p.T596M and p.P1797H) in NOTCH1 in patients with bicuspid aortic valve. Biochem. Biophys. Res. Commun., 345, 1460–1465

[26]

McBride, K. L., Riley, M. F., Zender, G. A., Fitzgerald-Butt, S. M., Towbin, J. A., Belmont, J . W. and Cole, S. E. (2008) NOTCH1 mutations in individuals with left ventricular outflow tract malformations reduce ligand-induced signaling. Hum. Mol. Genet., 17, 2886–2893

[27]

Homsy, J., Zaidi, S., Shen, Y., Ware, J. S., Samocha, K. E., Karczewski, K. J., DePalma, S. R., McKean, D., Wakimoto, H., Gorham, J., (2015) De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science, 350, 1262–1266

[28]

Verma, S. K., Deshmukh, V., Nutter, C. A., Jaworski, E., Jin, W., Wadhwa, L., Abata, J., Ricci, M., Lincoln, J., Martin, J. F., (2016) Rbfox2 function in RNA metabolism is impaired in hypoplastic left heart syndrome patient hearts. Sci. Rep., 6, 30896

[29]

Lek, M., Karczewski, K. J., Minikel, E. V.,Samocha, K. E., Banks, E., Fennell, T., O’Donnell-Luria, A. H., Ware, J. S., Hill, A. J., Cummings, B. B., (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536, 285–291

[30]

Stessman, H. A. F., Willemsen, M. H., Fenckova, M., Penn, O., Hoischen, A., Xiong, B., Wang, T., Hoekzema, K., Vives, L., Vogel, I., (2016) Disruption of POGZ Is Associated with Intellectual Disability and Autism Spectrum Disorders. Am. J. Hum. Genet., 98, 541–552

[31]

The Deciphering Developmental Disorders Study (2015) Large-scale discovery of novel genetic causes of developmental disorders. Nature, 519, 223–228

[32]

Di Donato, N., Rump, A., Koenig, R., Der Kaloustian, V. M., Halal, F., Sonntag, K., Krause, C., Hackmann, K., Hahn, G., Schrock, E., (2014) Severe forms of Baraitser-Winter syndrome are caused by ACTB mutations rather than ACTG1 mutations. Eur. J. Hum. Genet., 22, 179–183

[33]

Cuvertino, S., Stuart, H. M., Chandler, K. E., Roberts, N. A., Armstrong, R., Bernardini, L., Bhaskar, S., Callewaert, B., Clayton-Smith, J., Davalillo, C. H., (2017) ACTB loss-of-function mutations result in a pleiotropic developmental disorder. Am. J. Hum. Genet., 101, 1021–1033

[34]

Krumm, N., Turner, T. N., Baker, C., Vives, L., Mohajeri, K., Witherspoon, K., Raja, A., Coe, B. P., Stessman, H. A., He, Z. X., (2015) Excess of rare, inherited truncating mutations in autism. Nat. Genet., 47, 582–588

[35]

Fischbach, G. D. and Lord, C. (2010) The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron, 68, 192–195

[36]

1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature, 526, 68–74

[37]

NHLBI Exome Sequencing Project (ESP) website. http://evs.gs.washington.edu/EVS/. Accessed: Sep 1, 2020

[38]

Wang, K., Li, M. and Hakonarson, H. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res., 38, e164

RIGHTS & PERMISSIONS

The Author(s) 2021. Published by Higher Education Press

AI Summary AI Mindmap
PDF (329KB)

3054

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/