Background: Quantitative analysis of mitochondrial morphology plays important roles in studies of mitochondrial biology. The analysis depends critically on segmentation of mitochondria, the image analysis process of extracting mitochondrial morphology from images. The main goal of this study is to characterize the performance of convolutional neural networks (CNNs) in segmentation of mitochondria from fluorescence microscopy images. Recently, CNNs have achieved remarkable success in challenging image segmentation tasks in several disciplines. So far, however, our knowledge of their performance in segmenting biological images remains limited. In particular, we know little about their robustness, which defines their capability of segmenting biological images of different conditions, and their sensitivity, which defines their capability of detecting subtle morphological changes of biological objects.
Methods: We have developed a method that uses realistic synthetic images of different conditions to characterize the robustness and sensitivity of CNNs in segmentation of mitochondria. Using this method, we compared performance of two widely adopted CNNs: the fully convolutional network (FCN) and the U-Net. We further compared the two networks against the adaptive active-mask (AAM) algorithm, a representative of high-performance conventional segmentation algorithms.
Results: The FCN and the U-Net consistently outperformed the AAM in accuracy, robustness, and sensitivity, often by a significant margin. The U-Net provided overall the best performance.
Conclusions: Our study demonstrates superior performance of the U-Net and the FCN in segmentation of mitochondria. It also provides quantitative measurements of the robustness and sensitivity of these networks that are essential to their applications in quantitative analysis of mitochondrial morphology.
Background: The Oxford MinION nanopore sequencer is the recently appealing third-generation genome sequencing device that is portable and no larger than a cellphone. Despite the benefits of MinION to sequence ultra-long reads in real-time, the high error rate of the existing base-calling methods, especially indels (insertions and deletions), prevents its use in a variety of applications.
Methods: In this paper, we show that such indel errors are largely due to the segmentation process on the input electrical current signal from MinION. All existing methods conduct segmentation and nucleotide label prediction in a sequential manner, in which the errors accumulated in the first step will irreversibly influence the final base-calling. We further show that the indel issue can be significantly reduced via accurate labeling of nucleotide and move labels directly from the raw signal, which can then be efficiently learned by a bi-directional WaveNet model simultaneously through feature sharing. Our bi-directional WaveNet model with residual blocks and skip connections is able to capture the extremely long dependency in the raw signal. Taking the predicted move as the segmentation guidance, we employ the Viterbi decoding to obtain the final base-calling results from the smoothed nucleotide probability matrix.
Results: Our proposed base-caller, WaveNano, achieves good performance on real MinION sequencing data from Lambda phage.
Conclusions: The signal-level nanopore base-caller WaveNano can obtain higher base-calling accuracy, and generate fewer insertions/deletions in the base-called sequences.
Background: Most intronic lariats are rapidly turned over after splicing. However, new research suggests that some introns may have additional post-splicing functions. Current bioinformatics methods used to identify lariats require a sequencing read that traverses the lariat branchpoint. This method provides precise branchpoint sequence and position information, but is limited in its ability to quantify abundance of stabilized lariat species in a given RNAseq sample. Bioinformatic tools are needed to better address these emerging biological questions.
Methods: We used an unsupervised machine learning approach on sequencing reads from publicly available ENCODE data to learn to identify and quantify lariats based on RNAseq read coverage shape.
Results: We developed ShapeShifter, a novel approach for identifying and quantifying stable lariat species in RNAseq datasets. We learned a characteristic “lariat” curve from ENCODE RNAseq data and were able to estimate abundances for introns based on read coverage. Using this method we discovered new stable introns in these samples that were not represented using the older, branchpoint-traversing read method.
Conclusions: ShapeShifter provides a robust approach towards detecting and quantifying stable lariat species.
Background: Multiplexed milliliter-scale chemostats are useful for measuring cell physiology under various degrees of nutrient limitation and for carrying out evolution experiments. In each chemostat, fresh medium containing a growth rate-limiting metabolite is pumped into the culturing chamber at a constant rate, while culture effluent exits at an equal rate. Although such devices have been developed by various labs, key parameters — the accuracy, precision, and operational range of flow rate — are not explicitly characterized.
Methods: Here we re-purpose a published multiplexed culturing device to develop a multiplexed milliliter-scale chemostat. Flow rates for eight chambers can be independently controlled to a wide range, corresponding to population doubling times of 3~13 h, without the use of expensive feedback systems.
Results: Flow rates are precise, with the maximal coefficient of variation among eight chambers being less than 3%. Flow rates are accurate, with average flow rates being only slightly below targets, i.e., 3%–6% for 13-h and 0.6%–1.0% for 3-h doubling times. This deficit is largely due to evaporation and should be correctable. We experimentally demonstrate that our device allows accurate and precise quantification of population phenotypes.
Conclusions: We achieve precise control of cellular growth in a low-cost milliliter-scale chemostat array, and show that the achieved precision reduces the error when measuring biological processes.
Background: Module detection is widely used to analyze and visualize biological networks. A number of methods and tools have been developed to achieve it. Meanwhile, bipartite module detection is also very useful for mining and analyzing bipartite biological networks and a few methods have been developed for it. However, there is few user-friendly toolkit for this task.
Methods: To this end, we develop an online web toolkit BMTK, which implements seven existing methods.
Results: BMTK provides a uniform operation platform and visualization function, standardizes input and output format, and improves algorithmic structure to enhance computing speed. We also apply this toolkit onto a drug-target bipartite network to demonstrate its effectiveness.
Conclusions: BMTK will be a powerful tool for detecting bipartite modules in diverse bipartite biological networks.
Availability: The web application is freely accessible at http://www.zhanglabtools.net/BMTK.
Background: Gene co-expression and differential co-expression analysis has been increasingly used to study co-functional and co-regulatory biological mechanisms from large scale transcriptomics data sets.
Methods: In this study, we develop a nonparametric approach to identify hub genes and modules in a large co-expression network with low computational and memory cost, namely MRHCA.
Results: We have applied the method to simulated transcriptomics data sets and demonstrated MRHCA can accurately identify hub genes and estimate size of co-expression modules. With applying MRHCA and differential co-expression analysis to E. coli and TCGA cancer data, we have identified significant condition specific activated genes in E. coli and distinct gene expression regulatory mechanisms between the cancer types with high copy number variation and small somatic mutations.
Conclusion: Our analysis has demonstrated MRHCA can (i) deal with large association networks, (ii) rigorously assess statistical significance for hubs and module sizes, (iii) identify co-expression modules with low associations, (iv) detect small and significant modules, and (v) allow genes to be present in more than one modules, compared with existing methods.
Background: Sequence-specific binding by transcription factors (TFs) plays a significant role in the selection and regulation of target genes. At the protein:DNA interface, amino acid side-chains construct a diverse physicochemical network of specific and non-specific interactions, and seemingly subtle changes in amino acid identity at certain positions may dramatically impact TF:DNA binding. Variation of these specificity-determining residues (SDRs) is a major mechanism of functional divergence between TFs with strong structural or sequence homology.
Methods: In this study, we employed a combination of high-throughput specificity profiling by SELEX and Spec-seq, structural modeling, and evolutionary analysis to probe the binding preferences of winged helix-turn-helix TFs belonging to the OmpR sub-family in Escherichia coli.
Results: We found that E. coli OmpR paralogs recognize tandem, variably spaced repeats composed of “GT-A” or “GCT”-containing half-sites. Some divergent sequence preferences observed within the “GT-A” mode correlate with amino acid similarity; conversely, “GCT”-based motifs were observed for a subset of paralogs with low sequence homology. Direct specificity profiling of a subset of OmpR homologues (CpxR, RstA, and OmpR) as well as predicted “SDR-swap” variants revealed that individual SDRs may impact sequence preferences locally through direct contact with DNA bases or distally via the DNA backbone.
Conclusions: Overall, our work provides evidence for a common structural “code” for sequence-specific wHTH-DNA interactions, and demonstrates that surprisingly modest residue changes can enable recognition of highly divergent sequence motifs. Further examination of SDR predictions will likely reveal additional mechanisms controlling the evolutionary divergence of this important class of transcriptional regulators.
Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.
Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.
Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.
Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package “viRandomForests” based on the original R package “randomForest” and it can be freely downloaded from http://zhaocenter.org/software.
Background: Precision medicine attempts to tailor the right therapy for the right patient. Recent progress in large-scale collection of patents’ tumor molecular profiles in The Cancer Genome Atlas (TCGA) provides a foundation for systematic discovery of potential drug targets specific to different types of cancer. However, we still lack powerful computational methods to effectively integrate multiple omics data and protein-protein interaction network technology for an optimum target and drug recommendation for an individual patient.
Methods: In this study, a computation method, Precision Medicine Target- Drug Selection (PMTDS) based on genetic interaction networks is developed to select the optimum targets and associated drugs for precision medicine style treatment of cancer. The PMTDS system includes three parts: a personalized medicine knowledgebase for each cancer type, a genetic interaction network-based algorithm and a single patient molecular profiles. The knowledgebase integrates cancer drugs, drug-target databases and gene biological pathway networks. The molecular profiles of each tumor consists of DNA copy number alteration, gene mutation, and tumor gene expression variation compared to its adjacent normal tissue.
Results: The novel integrated PMTDS system is applied to select candidate target-drug pairs for 178 TCGA pancreatic adenocarcinoma (PDAC) tumors. The experiment results show known drug targets (EGFR, IGF1R, ERBB2, NR1I2 and AKR1B1) of PDAC treatment are identified, which provides important evidence of the PMTDS algorithm’s accuracy. Other potential targets PTK6, ATF, SYK are, also, recommended for PDAC. Further validation is provided by comparison of selected targets with, both, cell line molecular profiles from the Cancer Cell Line Encyclopedia (CCLE) and drug response data from the Cancer Therapeutics Response Portal (CTRP). Results from experimental analysis of forty six individual pancreatic cancer samples show that drugs selected by PMTDS have more sample-specific efficacy than the current clinical PDAC therapies.
Conclusions: A novelty target and drug priority algorithm PMTDS is developed to identify optimum target-drug pairs by integrating the knowledgebase base with a single patient’s genomics. The PMTDS system provides an accurate and reliable source for target and off-label drug selection for precision cancer medicine.
Background: In eukaryotic genome, chromatin is not randomly distributed in cell nuclei, but instead is organized into higher-order structures. Emerging evidence indicates that these higher-order chromatin structures play important roles in regulating genome functions such as transcription and DNA replication. With the advancement in 3C (chromosome conformation capture) based technologies, Hi-C has been widely used to investigate genome-wide long-range chromatin interactions during cellular differentiation and oncogenesis. Since the first publication of Hi-C assay in 2009, lots of bioinformatic tools have been implemented for processing Hi-C data from mapping raw reads to normalizing contact matrix and high interpretation, either providing a whole workflow pipeline or focusing on a particular process.
Results: This article reviews the general Hi-C data processing workflow and the currently popular Hi-C data processing tools. We highlight on how these tools are used for a full interpretation of Hi-C results.
Conclusions: Hi-C assay is a powerful tool to investigate the higher-order chromatin structure. Continued development of novel methods for Hi-C data analysis will be necessary for better understanding the regulatory function of genome organization.