Cover illustration
The detection of protein complexes is a fundamental problem in proteomics and bioinformatics, which is equivalent to finding interesting sub-networks from protein-protein interaction networks. One such an interesting measure is the statistical significance of protein complexes in terms of P-values. However, how to evaluate the statistical significance of each detected protein complex has not received much attention in the literature. As a result, the statistical assessment of[Detail] ...
Background: In systems biology, the dynamics of biological networks are often modeled with ordinary differential equations (ODEs) that encode interacting components in the systems, resulting in highly complex models. In contrast, the amount of experimentally available data is almost always limited, and insufficient to constrain the parameters. In this situation, parameter estimation is a very challenging problem. To address this challenge, two intuitive approaches are to perform experimental design to generate more data, and to perform model reduction to simplify the model. Experimental design and model reduction have been traditionally viewed as two distinct areas, and an extensive literature and excellent reviews exist on each of the two areas. Intriguingly, however, the intrinsic connections between the two areas have not been recognized.
Results: Experimental design and model reduction are deeply related, and can be considered as one unified framework. There are two recent methods that can tackle both areas, one based on model manifold and the other based on profile likelihood. We use a simple sum-of-two-exponentials example to discuss the concepts and algorithmic details of both methods, and provide Matlab-based code and implementation which are useful resources for the dissemination and adoption of experimental design and model reduction in the biology community.
Conclusions: From a geometric perspective, we consider the experimental data as a point in a high-dimensional data space and the mathematical model as a manifold living in this space. Parameter estimation can be viewed as a projection of the data point onto the manifold. By examining the singularity around the projected point on the manifold, we can perform both experimental design and model reduction. Experimental design identifies new experiments that expand the manifold and remove the singularity, whereas model reduction identifies the nearest boundary, which is the nearest singularity that suggests an appropriate form of a reduced model. This geometric interpretation represents one step toward the convergence of experimental design and model reduction as a unified framework.
Background: Ab initio protein structure prediction is to predict the tertiary structure of a protein from its amino acid sequence alone. As an important topic in bioinformatics, considerable efforts have been made on designing the ab initio methods. Unfortunately, lacking of a perfect energy function, it is a difficult task to select a good near-native structure from the predicted decoy structures in the last step.
Methods: Here we propose an ensemble clustering method based on k-medoids to deal with this problem. The k-medoids method is run many times to generate clustering ensembles, and then a voting method is used to combine the clustering results. A confidence score is defined to select the final near-native model, considering both the cluster size and the cluster similarity.
Results: We have applied the method to 54 single-domain targets in CASP-11. For about 70.4% of these targets, the proposed method can select better near-native structures compared to the SPICKER method used by the I-TASSER server.
Conclusions: The experiments show that, the proposed method is effective in selecting the near-native structure from decoy sets for different targets in terms of the similarity between the selected structure and the native structure.
Background: Statistical validation of predicted complexes is a fundamental issue in proteomics and bioinformatics. The target is to measure the statistical significance of each predicted complex in terms of p-values. Surprisingly, this issue has not received much attention in the literature. To our knowledge, only a few research efforts have been made towards this direction.
Methods: In this article, we propose a novel method for calculating the p-value of a predicted complex. The null hypothesis is that there is no difference between the number of edges in target protein complex and that in the random null model. In addition, we assume that a true protein complex must be a connected subgraph. Based on this null hypothesis, we present an algorithm to compute the p-value of a given predicted complex.
Results: We test our method on five benchmark data sets to evaluate its effectiveness.
Conclusions: The experimental results show that our method is superior to the state-of-the-art algorithms on assessing the statistical significance of candidate protein complexes.
Background: The induction of neural regeneration is vital to the repair of spinal cord injury (SCI). While compared with peripheral nervous system (PNS), the regenerative capacity of the central nervous system (CNS) is extremely limited. This indicates that modulating the molecular pathways underlying PNS repair may lead to the discovery of potential treatment for CNS injury.
Methods: Based on the gene expression profiles of dorsal root ganglion (DRG) after a sciatic nerve injury, we utilized network guided forest (NGF) to rank genes in terms of their capacity of distinguishing injured DRG from sham-operated controls. Gene importance scores deriving from NGF were used as initial heat in a heat diffusion model (HotNet2) to infer the subnetworks underlying neural regeneration in the DRG. After potential regulators of the subnetworks were found through Connectivity Map (cMap), candidate compounds were experimentally evaluated for their capacity to regenerate the damaged neurons.
Results: Gene ontology analysis of the subnetworks revealed ubiquinone biosynthetic process is crucial for neural regeneration. Moreover, almost half of the genes in these subnetworks are found to be related to neural regeneration via text mining. After screening compounds that are likely to modulate gene expressions of the subnetworks, three compounds were selected for the experiment. Of them, trichostatin A, a histone deacetylase inhibitor, was validated to enhance neurite outgrowth in vivo via an optic nerve crush mouse model.
Conclusions: Our study identified subnetworks underlying neural regeneration, and validated a compound can promote neurite outgrowth by modulating these subnetworks. This work also suggests an alternative approach for drug repositioning that can be easily extended to other disease phenotypes.
Background: Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models.
Methods: In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA-approved small molecules’ targets, in druggable proteins prediction. We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation.
Results: We found that although the 146-d vector used by Li et al. with neuron network achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved-targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study.
Conclusions: Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on https://github.com/pkumdl/target_prediction.
Background: Quantitative analysis of mitochondrial morphology plays important roles in studies of mitochondrial biology. The analysis depends critically on segmentation of mitochondria, the image analysis process of extracting mitochondrial morphology from images. The main goal of this study is to characterize the performance of convolutional neural networks (CNNs) in segmentation of mitochondria from fluorescence microscopy images. Recently, CNNs have achieved remarkable success in challenging image segmentation tasks in several disciplines. So far, however, our knowledge of their performance in segmenting biological images remains limited. In particular, we know little about their robustness, which defines their capability of segmenting biological images of different conditions, and their sensitivity, which defines their capability of detecting subtle morphological changes of biological objects.
Methods: We have developed a method that uses realistic synthetic images of different conditions to characterize the robustness and sensitivity of CNNs in segmentation of mitochondria. Using this method, we compared performance of two widely adopted CNNs: the fully convolutional network (FCN) and the U-Net. We further compared the two networks against the adaptive active-mask (AAM) algorithm, a representative of high-performance conventional segmentation algorithms.
Results: The FCN and the U-Net consistently outperformed the AAM in accuracy, robustness, and sensitivity, often by a significant margin. The U-Net provided overall the best performance.
Conclusions: Our study demonstrates superior performance of the U-Net and the FCN in segmentation of mitochondria. It also provides quantitative measurements of the robustness and sensitivity of these networks that are essential to their applications in quantitative analysis of mitochondrial morphology.
Background: The Oxford MinION nanopore sequencer is the recently appealing third-generation genome sequencing device that is portable and no larger than a cellphone. Despite the benefits of MinION to sequence ultra-long reads in real-time, the high error rate of the existing base-calling methods, especially indels (insertions and deletions), prevents its use in a variety of applications.
Methods: In this paper, we show that such indel errors are largely due to the segmentation process on the input electrical current signal from MinION. All existing methods conduct segmentation and nucleotide label prediction in a sequential manner, in which the errors accumulated in the first step will irreversibly influence the final base-calling. We further show that the indel issue can be significantly reduced via accurate labeling of nucleotide and move labels directly from the raw signal, which can then be efficiently learned by a bi-directional WaveNet model simultaneously through feature sharing. Our bi-directional WaveNet model with residual blocks and skip connections is able to capture the extremely long dependency in the raw signal. Taking the predicted move as the segmentation guidance, we employ the Viterbi decoding to obtain the final base-calling results from the smoothed nucleotide probability matrix.
Results: Our proposed base-caller, WaveNano, achieves good performance on real MinION sequencing data from Lambda phage.
Conclusions: The signal-level nanopore base-caller WaveNano can obtain higher base-calling accuracy, and generate fewer insertions/deletions in the base-called sequences.