Towards precise reconstruction of gene regulatory networks by data integration

Zhi-Ping Liu

doi:10.1007/s40484-018-0139-4

Quant. Biol. ›› 2018, Vol. 6 ›› Issue (2) :113 -128. DOI: 10.1007/s40484-018-0139-4

REVIEW

Towards precise reconstruction of gene regulatory networks by data integration

Zhi-Ping Liu ^†

Author information +

History +

PDF (1007KB)

Abstract

Background: More and more high-throughput datasets are available from multiple levels of measuring gene regulations. The reverse engineering of gene regulatory networks from these data offers a valuable research paradigm to decipher regulatory mechanisms. So far, numerous methods have been developed for reconstructing gene regulatory networks.

Results: In this paper, we provide a review of bioinformatics methods for inferring gene regulatory network from omics data. To achieve the precision reconstruction of gene regulatory networks, an intuitive alternative is to integrate these available resources in a rational framework. We also provide computational perspectives in the endeavors of inferring gene regulatory networks from heterogeneous data. We highlight the importance of multi-omics data integration with prior knowledge in gene regulatory network inferences.

Conclusions: We provide computational perspectives of inferring gene regulatory networks from multiple omics data and present theoretical analyses of existing challenges and possible solutions. We emphasize on prior knowledge and data integration in network inferences owing to their abilities of identifying regulatory causality.

Graphical abstract

Keywords

gene regulatory network / computational inference / data integration / bioinformatics

Cite this article

Download citation ▾

Zhi-Ping Liu. Towards precise reconstruction of gene regulatory networks by data integration. Quant. Biol., 2018, 6(2): 113-128 DOI:10.1007/s40484-018-0139-4

登录浏览全文

4963

注册一个新账户忘记密码

INTRODUCTION

Due to the availability of biomedical big data, the research paradigm of biomedical science is undergoing unprecedented changes and challenges [¹]. Gene regulations play central roles in transforming genotypic information to phenotypic performance [²]. Many biomolecules are involved in the biological processes of gene regulations, and their relationships are often modeled as gene regulatory networks [³]. Where the nodes are these players and the edges are their regulatory interactions. In healthy states, the dynamics underlying gene regulatory networks orchestrate a perfect concerto of various physiological processes in a cell, while they perform disorders to dysfunctions to diseases when the harmony in the network system has been broken [⁴]. Temporospatially rewiring regulations indicate the causality of phenotypic transitions and differences [⁵].

The high-throughput techniques such as ChIP-Seq provide direct recognitions of protein-DNA interactions by mapping specific binding sites [⁶]. The prior knowledge about gene regulations such as specific transcription factor (TF) and gene binding sites can be utilized to design and analyze the ChIP-Seq data [⁷]. But the production of antibodies for specific TFs and the sensitivities are still limited in these experiments. Meanwhile, there are more and more gene expression profiling datasets available for measuring the global transcriptomic status [⁸,⁹]. The epigenetic data such as DNA methylation and RNA modification provide more resources for deciphering gene regulations [¹⁰,¹¹]. How to reconstruct gene regulatory network with these heterogeneous data have attracted dense attentions in bioinformatics research community since these transcriptomic data are available [³].

In this work, we offer a brief review with perspectives about these computational methods of reconstructing gene regulatory network from transcriptomic data. We firstly introduce the problem and the computational complexity of inferring gene regulatory network. Then, we review some available data resources and existing methods for network inference. We summarize the two types of errors in these methods individually. To achieve precision network reconstruction, these methods are formulated to remove the two types of errors in the inference, especially false positives. We highlight the data integration of prior knowledge and multiple-level omics for accurately reconstructing a comprehensive gene regulatory network, respectively.

INFERRING GENE REGULATORY NETWORKS

The problem of gene regulatory network reconstruction is a reverse engineering of inferring gene regulations from gene expressions, which are measured from the high-throughput techniques such as microarray [¹²] or RNA-seq [¹³]. The transcriptomic concentrations measure gene expressions in parallel manners. The gene relationships are expected to be reconstructed from their expression profiles in the samples. The gene regulatory relationships will be built up from the data in the form of a network representing their causal interactions.

Usually, gene expression data is represented by matrix

G

G = (g 1 ⋮ ⋮ g n) = (a 11 … a i 1 … a p 1 ⋮ ⋱ ⋮ ⋱ ⋮ a 1 j … a i j … a p j ⋮ ⋱ ⋮ ⋱ ⋮ a 1 n … a i n … a p n),

where

a i j

represents the gene expression value of the j-th gene (

1 ≤ j ≤ n

) in the i-th experiment (

1 ≤ i ≤ p

). Noted that

i

refers to the sample and

j

refers to the gene. The framework of gene regulatory network inference is shown in Figure 1. The task is to reversely induce the regulatory relationships as shown in Figure 1B from gene expression profiles (Figure 1A). For the paramount importance of gene regulations, numerous methods have been proposed to tackle gene regulatory network inference from data matrix

G

Biologically, the experiments are often designed to access specific physiological conditions with limited number of samples with necessary biological replicates. For instance, a gene expression profiling experiment is designed to study the Huh7 cells after hepatitis C virus (HCV) infection. It contains three replicates at 6, 12, 18, 24, and 48 hours post-infections respectively [¹⁴]. Totally, there are 18 samples of microarray with 3 controls before HCV infection. The gene chip platform contains about 25,000 genes. In this case of network inference, the aim is to reconstruct the gene regulatory network of these human cells in response to HCV infection. In these experiments, we can find that the gene expression profiles only refer to several snapshots of gene expression abundancy at several time points after viral infection. While our task is to figure out a global map of regulatory relationships in the thousands of genes. This is very similar to re-outline a 48-hours movie only by 5 blurred pictures at 5 screenplays, at least for deducing the relationships among the 25,000 performers (genes). The difficulty underlies the substance of reversely engineering gene regulatory network from expression data, not to mention the noise of these high-throughput data-generating techniques in their developing periods [¹⁵].

Methodologically, gene regulatory network inference is essentially difficult because of the curse of dimensionality. It is a typical large n, small p problem, which indicates the dimension (gene number) of the inference problem is very big (large n, often ~30,000), while there are only a few samples from the high-throughput experiments (small p, often ~10), i.e.,

p ≪ n

. The constraints generated from these samples are such few that there are many possible combinatorial solutions (the regulatory interactions of gene relationships), which makes the feasible solution space becomes very huge, then makes the search of optimal solution of genuine regulations becomes very difficult.

In computational details, supposing there are n genes in a regulatory network, we model their regulatory dynamics by the ordinary differential equations (ODE) of their expression levels as follows:

(1)

{g 1' = d g 1 d t = c 11 g 1 + c 12 g 2 + ⋯ + c 1 n g n g 2' = d g 2 d t = c 21 g 1 + c 22 g 2 + ⋯ + c 2 n g n ⋯ g n' = d g n d t = c n 1 g 1 + c n 2 g 2 + ⋯ + c n n g n

where

g j'

simply refers to the first order derivative of the expression level of gene

g j

. Under the linear assumption of gene regulatory relationships, the derivative change is caused by the combinatorial regulations of the other genes in the system reflecting by their expression levels. Thus, the gene regulatory network inference is formulated to a parameter identification problem of these equations. When the coefficients

c i j

are determined by leveraging gene expression data, their regulatory relationships are coordinated and the regulatory network is then reconstructed.

Compare to standard linear equations in algebra, the variables in Equation (1) are different and converse. We suppose the number of experiments is p. Let

(c 11, c 12, ⋯, c 1 n) T

(x 1, x 2, ⋯, x n) T

, the determination of these coefficients for

g 1'

is formulated as linear equations with standard formats.

(2)

{a 11 x 1 + a 12 x 2 + ⋯ + a 1 n x n = b 11 a 21 x 1 + a 22 x 2 + ⋯ + a 2 n x n = b 21 ⋯ a p 1 x 1 + a p 2 x 2 + ⋯ + a p n x n = b p 1

where matrix

A = (a i j) p × n

refers to the gene expression values,

a i j

is the gene expression of

g j

in the

i -th

experiment.

b i 1 (1 ≤ i ≤ p)

refers to the derivation values of

g 1'

in the i-th experiment, which is often evaluated by approximation [¹⁶]. In the ubiquitous experiment design, the experiment number p is often very limited due to the limited resources. Thus,

R a n k (A) ≤ p ≪ n

, and there are infinitely many solutions for these equations. It is very hard to achieve an optimal solution under few additional constraints for

g 1

. Note the equations in Equation (2) are only for the coefficients of

g 1'

. For the other genes, the similar cases exist in these equations because the left-hands of gene expressions are same and only the response derivatives on the right-hands are different. The solution space of these coefficients in Equation (1) is so huge that it is difficult to achieve the optimal solutions for individual genes, i.e., the inferred regulatory coefficients between genes.

Theoretically, the parameters of the ODE system in Equation (1) are identifiable by achieving unique solutions when there are enough experimental gene expression datasets. For practical experiment constraints, the limited observational data result gene regulatory model partially identifiable [¹⁷]. The former model is very simple when compared with true gene regulatory processes in cells. Advanced models including time-varying parameters, dynamic gene interactions and higher order derivatives of gene expression will make the computational tasks of gene regulatory network inference much more complicated [¹⁶].

AVAILABLE RESOURCES AND METHODS

Due to the centrality of gene regulation in biological processes, numerous resources have been available for characterizing gene regulatory systems from multiple molecular levels. As described in the former section, the original gene regulatory network reconstruction is from gene expression profiling data, which are deposited such databases as GEO [⁸], ArrayExpression [⁹] and SRA [¹⁸]. Table 1[¹⁹-⁴⁴] lists some currently available resources for inferring gene regulatory networks. For instances, the main components in gene regulations, such as gene, TF, miRNA and protein can be accessed from GenBank [²¹], UniProt [⁴³], miRBase [³⁸] and PDB [⁴²], respectively. The other elements and effects such as ncRNA [³³], RNA methylation [²⁶] and chromatin accessibility [⁴¹] are gradually recognized for understanding gene regulations in much more details. Some prior gene regulation information are also deposited in databases such as RegNetwork [³²], which collects the known and predicted transcriptional regulations (TF-gene) and posttranscriptional regulations (miRNA-mRNA) from various databases and literature. The existing resources provide the materials for building computational methods of reconstructing regulatory networks.

The corresponding regulatory components and information documented in various databases also bring great challenges for integrating these available data. These elements located in different databases are often curated by different research groups and institutes. Thus, the uniform identifiers (ID) for these regulators and targets are often very important in the concrete network inferences. The ID mappings are always time consuming and some ID mapping tools, such as that of UniProt [⁴³], provide the online services for interchanging the component IDs of different databases. In the inference of gene regulatory network, the ID should be consistent when we integrate data from different resources.

So far, numerous methods have been proposed for reconstructing gene regulatory networks. Table 2 lists some of the representative methods. We group them into six categories, i.e., association networks, Bayesian networks, Boolean networks, differential-equation-based methods, knowledge-based methods, and machine-learning-based methods. Here, we briefly introduce the main ideas of each category individually and please see our paper [³] and the references therein for much more details of these available methods.

The first is the association-based methods. These methods are to calculate the expression relationships between genes by defining association measures. The gene pairs will be identified as regulatory interactions when their associations are significant, such as WGCNA [⁵⁹]. Let gene X and gene Y with their expressions be

X = (X 1, X 2, ..., X p)

and

Y = (Y 1, Y 2, ..., Y p)

respectively. Based on the two sample vectors, Pearson’s correlation coefficient (PCC) can be easily calculated. WGCNA defines

s X Y u n s i g n e d = | c o r (X, Y) |

s X Y s i g n e d = 12 + 12 c o r (X, Y)

as their association scores for unsigned regulation or signed regulation. Based on these pairwise correlation values, it uses a concept of topology parameter to set the threshold of linking edges between genes. Thus a gene coexpression regulatory network will be built up. Similarly, the other association measures such as mutual information can be employed in the calculation. Specifically, mutual information between X and Y in terms of entropy is

I (X, Y) = H (X) + H (Y) − H (X, Y)

, where

H (X)

H (Y)

and

H (X, Y)

are the marginal entropies of X and Y, and their joint entropy, respectively. There are many other association-based methods have been proposed for its simplicity. The association network is often the first try of investigating relationship between genes. For their popularity, we provided a comprehensive comparison study of their performances of reconstructing gene regulatory network in [⁶⁰].

Due to the indirect relationships in calculating these associations, some association-based methods improve the network inference by conditional probability such as partial correlation coefficient and conditional mutual information [⁶⁰]. By introducing the other gene or gene set in evaluating the association values, the false positives of indirect regulations will be eliminated for improving the inference accuracy. In this sense, an effective method named ARACNE [⁶¹] employs the information equality formula to remove the indirect mutual information between gene pairs. We proposed such a method named PCA-CMI [⁴⁵] based on conditional mutual information. The strategy is proven to be effective to eliminate false positive inferences. We will introduce it in details in the next sections.

The second category is based on Bayesian network models. They are typical graphical models of representing the gene interdependence via a directed graph. Supposing a directed acyclic graph (DAG) of regulatory network be

G

, the probability of the consistency between graph and data be

P (G | D)

. According to Bayesian theorem

P (G | D) = P (D | G) P (G) P (D)

, the posterior of measuring the consistency can be evaluated from this formula [⁶²]. For extending DAG to general network structure, dynamic Bayesian network is employed to tackle the time series data [⁶³]. Dynamic Bayesian network models display their powerful ability and flexibility of inferring gene regulatory networks from time-course gene expression data. These methods achieve successful applications in identifying various regulatory circuits [⁶⁴].

The third category is based on Boolean networks. The logic operators “AND (

∧

)”, “OR (

∨

)”, and “NOT (

¬

)” are employed to address the decision-making regulatory processes of gene expressions [⁶⁵]. Supposing gene X, Y, Z are in a Boolean network

G (V, F)

, where

V

is its node set, and

F

is its function set. The gene expression of Z is determined by the Boolean operation on the expressions of X and Y, such as

Z = f (X, Y) = X ∧ Y

f ∈ F

. The Boolean assumption often corresponds to the “ON” and “OFF” of gene regulations [⁶⁶]. Due to the stochastics underlying gene regulations, Boolean networks have been extended to include the possibilities in the logic operations, i.e., probabilistic Boolean networks [⁵¹]. The state transitions of gene regulations are shown to be feasible by modeling them as combinatorial logic computations. For the loose requirement of experimental samples in the logic computation, these methods are feasible for inferring gene regulatory networks from few samples of gene expression [single cell].

The fourth category is based on differential equations. As described in the former sections, the derivatives of gene expression are modeled as the response variable (gene) of some dependent variables (genes), i.e.,

d x j d t = ∑ i = 1 n β x i + β 0, j = 1, ⋯, n

. After solving the parameters of these differential equations, the regulatory system will be built up for describing the gene regulatory interactions [¹⁶]. Equations with higher order of dependent variables and variable interactions generate more complicated models [⁵³].

The fifth category is knowledge-based methods, which are based on the prior knowledge domain about gene regulations. For instance, if there is a regulation relationship between TF X and gene Y validated by experiments and documented in literature. The priors can be accessed freely from RegNetwork [³²]. With these priors, the inference is constrained to be kept in the feasibility. For example, if we know X always activates Y, the following model constrains the knowledge of

X → Y

during the inferences, which benefits to avoid the possible misidentification and randomness of this regulation. With this in mind and to specify whether the regulation is existed in specific cases and conditions, the type of methods is to evaluate the probability of regulatory existence between X and Y in particular circumstances [⁶⁷]. The consistency between knowledge regulation and phenotypic data are assessed for recognizing gene regulations. For instance, we proposed a network-based screening method for evaluating the prior regulatory networks in response to the gene expression profiling by graphical models [⁵⁵,⁵⁶]. The maximum likelihood of the consistency between them can be evaluated by a mathematical programming. For documented regulatory pathways, we identified the activated regulations in cell cycles [⁵⁵] and during viral infections [⁵⁶], respectively.

Last but not least, the machine-learning-based methods transform the inference of regulation between gene X and Y to a prediction of their regulatory interaction by a trained classifier [³]. The basic idea of the classifier is firstly to learn these corresponding features underlying the known regulatory relationships from gene expressions and/or sequences of X and Y. For instance, the corresponding regulatory relationship is modeled as

R = f (X, Y)

by a machine learning method, where

f

is an abstract function often without its explicit form [⁶⁸]. R is binary with label 1 representing there is a regulation between X and Y, with label 0 otherwise. Given new gene expression profiles, the trained machine-learning classifier predicts the new interacting events between the two genes. As listed in Table 2, TIGRESS presents a LARS (least angle regression) based classification with resampling the samples and variables [⁵⁷]. It formulates gene regulatory network inference to feature selection. Some other machine-learning-based methods are based on sequence and/or structure features of TF and binding sites of operons [⁶⁹]. The similar processes of training in the known gene regulations and predicting novel regulatory pairs are implemented to reconstruct gene regulatory networks.

TYPE I AND II ERRORS IN INFERENCE

In gene regulatory network inferences, Figure 2 shows the two types of errors, i.e., type I error and type II error. The two errors are the main barriers of precisely reconstructing gene regulatory network from data. While the reasons of generating false positive and false negative inferences underlie at least three aspects. The first is the complicated regulatory relationships between transcriptional regulators and targets. The second is the developing periods of high-throughput technologies of generating the measured data. The third is about the proposed reverse engineering methods.

The gene expressions are controlled by many levels of regulators. TFs and cofactors perform their initialization of gene transcriptions. The products of mRNA are silenced by small noncoding RNAs (ncRNA), e.g., miRNA [⁷⁰]. The long non-coding RNA (lncRNA) is found to be crucial to coordinate gene expressions with combinatorial regulations of mRNA expression levels [⁷¹]. Recently, circular RNAs (circRNAs) are also found to be sponge to gene expressions [⁷²]. These provide clues and indications of the complexity of gene regulatory network. There might be some important regulators still have not been revealed of performing crucial roles in controlling gene expressions.

The gene expression profiling techniques, such as microarray [¹²] and RNA-Seq [¹³], generate the high-dimensional data describing the expression abundancies. These techniques are still in their developing and maturing periods [⁷³]. For instance, microarray splits each gene into several oligonuclotides and elaborately designs them on the probes of a microarray for measuring the expression of the corresponding gene by fluorescence [⁷⁴]. The pipelines of following data preprocessing are still needed to be improved optimally. Thus, the measured datasets strongly affect the accuracy of gene regulatory network inference.

The reverse engineering methods implement various strategies of inferring the gene regulatory relationships via the measured datasets [³]. The noise of measured expressions during RNA extraction from samples, as well as the former-mentioned complexity of gene expressions restrict our precise inference [¹⁵,⁷⁵]. The assumption and limitation of these computational methods also constraint the accuracy of inferring gene regulatory network from data.

Towards precise reconstruction of gene regulatory network from data is essentially to reduce the two types of errors in the inferences. Specifically, type I errors are the false positives, which refer to the inferred regulations between genes which are not exist in truth. Type II errors refer to the false negatives, which are exist but are not inferred (as shown in Figure 2B). Moreover, the false positive inferences inherently contain numerous possibilities. The first is about the regulatory directions. As shown in Figure 2A, gene regulatory network is an oriented graph. The upstream regulators locate in the arc ends and their downstream targets locate at the arc heads. During the inferences, if the arc is upside down for some unknown reason, it will result a wrong inference. The second is about positive and negative regulatory directions. The positive regulation is to enhance the gene expression by promoting transcriptions, while the negative regulation is refer to the inhibition of gene expression by repressing the transcriptional functions of TF, RNA polymerase, and their effective interactions with gene promoters and enhancers [⁷⁶]. If the up- or down-regulations are upside down in some context-specific physiological processes, we can only obtain wrongly inferred regulations. The third is the inferences of non-existent regulations between genes, which are the major part of these false positives.

The major part of false positives are caused by many reasons, such as gene cooperation in regulations. For existing methods of reverse engineering, the inferences often reach concrete quantitative scores of gene regulation between regulator and target. The decision of the existence or non-existence of a regulatory relationship between two genes is often made by the score. As shown in Figure 2C, the regulatory information transferring from X to Y is assessed by a probability (score)

P (X; Y)

. In fact, the high score between X and Y is caused by Z, which means X and Z contain the regulatory cooperation of Y. The strong implication of a direct regulation between X and Y is caused by Z and gene Z directly regulates gene Y. The combinatorial gene regulations of X and Z cause the indirect regulations between X and Y, which is a false positive regulation. We can use this philosophy to remove the indirect false positive regulations by introducing the other genes in the calculation of regulatory possibility. If the regulatory possibility conditioned on Z

P (X; Y | Z)

becomes bigger, this implies the regulation between X and Y is false positive. The indirect regulation between X and Y are caused by the direct regulation between X and Z and that between Z and Y. The other reasons might underlie the proposed methods, such as kernel canonical correlation analysis (CCA) of measuring the regulatory associations between genes. Kernel CCA extracts the partial correlations between genes, which easily results high coefficient values then cause false positive inferences [⁶⁰]. Due to the sparseness of gene regulation network, we can add a regularization to control these false positives.

The false negatives of type II error refer to the true regulations between genes which have not been recognized by the proposed inference methods. There are many reasons for causing false negative inferences. The hardness of a single threshold or cut-off is such a reason of generating the false negative inferences. That is to say, we often only use a single threshold to evaluate all these candidate regulations with no regards of their temporal and spatial features underlying these regulations. For instance, the inferred coefficient between gene X and gene Y is 0.4, there is a true regulatory relationship between them during cell proliferation. When the threshold used for determining the existence of regulation between genes is 0.7, we will generate a type II error inference. The unsuitable threshold causes the missing regulatory relationship in the specific condition.

Compared with the analysis of type I error from a systematic perspective of gene regulatory cooperation, the type II errors are majorly caused by gene competition. For one target gene, the competition between several TFs will result the dominant gene regulatory relationships between some TFs with their targets. As shown in Figure 2C, the regulatory score

P (X; Y)

between X and Y is smaller than

P (Z; Y)

between Z and Y. Moreover, if the value of

P (X; Y)

is smaller than those of all the other pairs, the false negative will be generated after setting up a threshold for selecting the top-scored pairs if there really exists a regulation between X and Y. The weak regulatory signal between them causes the false negative inference.

It is very hard to remove the false negatives because the standard of threshold is often consistent for all gene pairs. The weak signal-noise-ratio of the true regulations cause the missing inferences [⁷⁷]. The false negatives are caused by competitive values between gene pairs. If we can propose a dynamic threshold strategy by intelligently using different thresholds for different gene pairs according to their context-specific regulatory pathways, physiological processes and phenotypic conditions. The number of false negative interactions will be possibly decreased. That is to say, if we can optimally set up different thresholds for determining regulations in different contexts, the false negatives will be greatly reduced accordingly.

REMOVING FALSE POSITIVE REGULATIONS

To achieve accurate reconstruction of gene regulatory network from transcriptomic data, some methods have been proposed to remove the false positive predictions. So far, these available methods are often based on conditional probability and information theory. As discussed in the former sections, another gene or gene set will be introduced in the evaluation of regulations from a system biology perspective. Figure 3A demonstrates the main idea of employing the additional information from the third-party gene or gene set. For evaluating the regulatory score

P (X, Y)

between X and Y, the other related gene or gene set will be gradually introduced in the calculation to remove possible biases and obtain a genuine regulatory score.

Suppose there is another gene or gene set Z, the conditional probability theory takes Z as conditions for accessing the genuine regulation between gene X and gene Y, i.e.,

I (X i; Y j | Z k) = ∑ X i ∈ X, Y j ∈ Y, Z k ∈ Z P (X i; Y j; Z k) log ⁡ P (X i; Y j | Z k) P (X i | Z k) P (Y j | Z k) .

As mentioned in the former section, when

I (X i; Y j | Z k) > I (X i; Y j)

, the false positive regulation between X and Y is caused by the coexistence of gene Z or gene set Z. The indirect regulation between X and Y can then be removed and the inference accuracy can then be improved. Similarly, partial correlation coefficient can also be used to remove the false positives. It is defined as

r X Y ⋅ Z = r X Y − r X Z r Y Z (1 − r X Z 2) (1 − r Y Z 2),

where

r • •

refers to the PCC between two genes, and

r X Y ⋅ Z

between X and Y is to extract the correlation between X and Y by removing the effects of Z [⁷⁸].

Based on conditional mutual information (CMI) and path consistency algorithm (PCA), we have proposed a network inference method named PCA-CMI for removing false positive inferences [⁴⁵]. As shown in Figure 3B, we firstly build a complete association network via mutual information. Then we employ the high-order CMI to eliminate the false positive regulations between genes iteratively. Under the PCA scheme, the iterative algorithm will be terminated if a consistent network has been achieved at the steps of the k-th order CMI and the (k+1)-th order CMI. Because we start our inference from a complete graph, there are no false negative regulations, i.e., no type II errors at the initialization step in our algorithm. The reference or background network has

(n 2)

possible regulations for chosen. With the conditioned gene or gene set, the false positives of indirect regulations will be removed gradually. From computational perspective, CMI can practically calculate small- or middle-size network. Fortunately, we have improved it to be a whole-genome-wide reconstruction method by parallel computing [⁷⁹].

In fact, we provide a very general strategy of inferring gene regulatory networks by controlling these type I errors. For the tremendous difficulty of controlling the type II errors, we start from a complete network without any false negative although it is still possible to contain type II errors in the finally reconstructed network. It is easy to change CMI to partial correlation coefficient in the strategy. We can also calculate the conditional association when we have a prior knowledge network in some specific conditions. By introducing the related gene or gene set, the indirect regulatory relationships will be removed from the background network to achieve accurate inference of responsive gene regulations.

IMPROVING INFERENCE BY PRIOR KNOWLEDGE

Pure data-driven methods cannot always guarantee the accurate inference of gene transcriptional regulations in many aspects, such as the upside and downside from regulators to targets, the positive and negative regulatory directions. For instances, the former reviewed association network inference methods based on mutual information contain no such ability of distinguishing regulatory directions. Without the prior knowledge about regulators such as TFs, the improved conditional association-based methods still cannot obtain such information. As for our method PCA-CMI [⁴⁵], it reconstructs an undirected association network without the regulatory directions indicating information transmission, i.e. which ones are the regulators and which are the corresponding targets. The causality between genes cannot be modeled and revealed [⁶⁰]. Easily, if we set up the prior TFs in the inference, the association-based networks will be oriented and signed. For this perspective, we need combine the prior knowledge about gene regulations with the high-throughput data to achieve more accurate network reconstruction.

Although these pure data-driven methods of inferring gene regulatory network seem to be very flexible in scope, it is impossible to infer all the negative regulations between miRNAs and targets without any incorrect identification. In our knowledge, miRNAs almost always perform their negative regulatory functions by silencing gene transcripts. If we do not include such information and infer the posttranscriptional regulations only from data, the inference will definitely contain many false positives. For instance, we employ Pearson’s correlation coefficients as the quantitative association measures to access the regulatory relationships between miRNAs and their targets using the gene expression profiles of human liver tissues of HCC [⁸⁰]. The values are between ‒1 and 1, and the negative and positive regulations are also defined accordingly. For the documented miRNA-mRNA interactions downloaded from RegNetwork [³²], Figure 4A illustrates the frequency distribution of these pairwise PCC values underlying these miRNA-target pairs. We can find that almost half number (47.8%) of the correlations are with positive values, which indicate that all these pairs with positive PCC values will result false positive regulations if we purely implement the network inference from data. As for the gene pairs with absolute values over than 0.7, we can find most of them are with positive PCCs. They are false positive inferences if we employ the PCC-based methods. The example clearly motivates the integration of such prior knowledge with pure data-driven methods. If we constrain the negative coefficients in the inferences, we will achieve precise reconstruction of gene regulations [⁸¹].

For the urgent requests of combining the prior knowledge of transcriptional and posttranscriptional regulations in network inference, we built RegNetwork for documenting the available regulatory interactions [³²]. We integrated the experimental regulations from numerous databases as well as the predicted regulations from TF-binding sequence motifs. Currently, RegNetwork contains two genome-wide regulatory networks of 300,000+ edges and 20,000+ nodes for human and mouse respectively. If we start from the prior regulatory network, with the profiling data of gene expressions, we can identify the activated gene regulatory networks in response to specific biological processes and phenotypic conditions. To this end, we inferred a regulatory network during influenza A virus infection in human cells by integrating prior genome-wide networks and condition-specific gene expressions. The known regulations such as the downregulation of miRNAs are introduced as the constraints of negative coefficients, i.e.,

C X Y ≤ 0

between miRNA X and its target Y [⁸¹]. Under the constraints of prior regulations, the reconstructed gene regulatory network implies the synergetic regulations between transcriptional regulations and posttranscriptional regulations by the cooperation between TF and miRNA with high accuracy [⁸¹].

For accessing the transcriptional activity of prior regulatory network in specific conditions, we proposed a network screening method of identifying responsive gene regulations by measuring the consistency between network structure and expression profiles [⁵⁵]. The consistency between them is measured by a maximum likelihood value of the graph consistency with the data, and then a permutation test is performed to evaluate its statistical significance. In other words, we transform the problem of reconstructing network from data to selecting a responsive network possibly with minor modifications by accessing the match between network structure and measured data. Different conditions and differential gene expression status will result in different activated regulatory network structures. The obtained gene network structure reflects the specific gene regulations in which their gene expressions have been measured by transcriptomic techniques. In different philosophy, we change a reverse engineering of inferring gene regulatory network from expression data to a forward engineering of designing network structure based on prior knowledge to achieve its match with the data [³].

Harnessing prior knowledge of gene regulations can avoid many potential pitfalls of inaccurate network inferences. If we know which one are the upstream regulators and which one are the downstream targets, we will reach a suitable inference by integrating the priors in the model. Moreover, prior knowledge indicates the truth and standard. When we set up which is regulator and which is target, the models of inference become purposeful. And in the conditional probabilities, we cannot calculate all the possible combinations in these conditioned genes. While with the guidance of prior knowledge, we can formulate the regulatory processes into a rational model of describing causal regulations [¹⁶,⁵⁶]. The prior knowledge narrows down the search space of solutions and the identification of regulations become much easier. It highly benefits the precision inference of gene regulatory network from data.

IMPROVING INFERNECE BY MULTI-OMICS DATA

In current big data era, it is promising to accurately reconstruct gene regulatory network by integrating omics data. There are two lens of data integration. The first is to combine all the related data such as transcriptomics in an integrated manner. For instance, there are three datasets about the study of hepatocellular carcinoma caused by HCV infection. To achieve an accurate inference of the responsive regulations between gene X and gene Y, we need to investigate the different cases of data collection to achieve a robust inference of regulatory relationship between X and Y. The gene regulations are temporal and spatial. The data integration from various laboratories, groups, platforms and institutes will build a consistent and cross-data-validated gene regulatory networks [⁸²]. The false positives will be controlled from double-checks from multiple transcriptomic datasets.

The second aspect of data integration refers to integrate multi-level omics data for characterizing gene regulations in cells. To achieve precision reconstruction of gene regulatory network, we should integrate multi-level omics data and infer hierarchical gene regulations. Compared to the former aspect of ‘deep’ data integration, the hierarchical regulatory network locates special emphasis on the ‘wide’ or ‘broad’ data integration. The determinants of gene expression coordinate in the central dogma that DNA makes RNA and RNA makes protein. The genetic information transformation from DNA to RNA to protein constructs a hierarchical information system with many events related to the transcriptional level where gene regulation takes place. The regulatory components and elements often function in specific conditions. The measured gene expressions often refer to some very specific conditions. For accurately inferring a casual regulatory network of controlling gene expression, it is needed to integrate the data of various levels to build a comprehensive regulatory map.

Figure 5 illustrates the hierarchical levels of gene regulation. For transcription, it is initialized with the binding events of RNA polymerases and TF recognition of a gene promotor region. In current high-throughput techniques, the expression-level measured data are mainly from microarray [¹²] and RNA-Seq [¹³]. Tens of thousands of gene expressions are measured from the extracted RNA concentrations. The transcriptional-level DNA modifications such as methylation and demethylation, chromatin open status of protein-DNA contacts highly affect gene expressions [⁸³]. In the posttranscriptional level, the non-coding RNAs will highly affect RNA abundancies, such as miRNA, lncRNA and circRNA [⁸⁴]. Moreover, the translation and posttranslational modifications will also affect the TF activity of protein abundancy [⁸⁵]. In practice, gene regulations function as an integrated system. Genetics and epigenetics elements perform their critical roles in gene regulations under different temporal and spatial conditions [⁷⁶]. If multi-level omics data about them are available, we should integrate these data in a hierarchical manner to build an integrative network between these regulatory components.

So far, some methods have been proposed to integrate multi-omics datasets for inferring gene regulatory networks [⁸⁶]. They can roughly be categorized into three pipelines. The first is based on regression. Supposing there are M types of multi-omics data, i.e.,

D 1

D 2

⋯

D M

. The expression

g i

of gene i is modeled as the response variable of these regulators (explanatory variables) such as TF

f j

according to these multi-omics data. At the time point t, the regression model is formulated as

g i t = α i + ∑ j M β j D i j f j t + ε i t, ε i t ∼ N o r m a l (0, σ 2) .

where

β j

is the ordinary regression coefficient.

D i j

is the coefficient of individual factor on the TF activity

f j

. The multiple effects are from the multiple TF-influence datasets, and they are integrated into these independent variables (regulators). The problem of inferring gene regulations from multi-omics data is thus changed to identify these parameters in the linear regression equations [⁸⁷]. More complicated regression techniques joint with the latter categories are expected to achieve more precision reconstruction [⁵⁷,⁸⁸–⁹⁰].

The second is based on Bayesian. Similar to the first pipeline, the second category models the gene expression as a joint probability of TFs and the influence factors of TFs, i.e.,

g i t = P (f i t | D 1 t, D 2 t, ..., D M t)

. According to the Bayesian theorem, the intra-relationship structure of these factors in the multi-omics can be represented as

P (D 1 t, D 2 t, ..., D M t) = ∑ i, j, k M P (D i t | D j t, ..., D k t) .

Joint with the combinatorial control from regulators, the combinatorial factors to these regulators reflecting in these multi-omics make the network inference model very complicated [⁹¹–⁹³]. With prior knowledge such as the information flow in the central dogma will make sense of the former probability expansion from biological perspectives, such as TF in transcriptional regulation, miRNA in posttranscriptional regulations, and phosphorylation in posttranslational modification. In this sense, we proposed a method named SITPR to integrate multi-omics data for inferring TF-miRNA cooperative regulatory network by prior knowledge of gene regulations, TF-binding sequence motifs, protein-protein interactions, as well as mRNA and miRNA expressions [⁵⁶]. We reconstructed a comprehensive transcriptional and posttranscriptional regulatory map during influenza A virus infection in human epithelial cells.

The third is based on machine learning. In this category, the methods consider all the related multi-omics data as the regulatory features and used them to train a machine learning classifier [⁹⁴,⁹⁵]. It often achieves good inference performance in the tests and validations [⁵⁸,⁹⁵]. However, the inference results is hard to be interpreted due to the ‘black-box’ paradigm of machine learning classifiers [⁶⁸]. They focus on the binary decision-making of the existence of regulations, and many inter- and intra-relationships among variables (genes, regulators, cofactors, and effectors) are often missing. In the former two pipelines, the regulatory principle and mechanism can be easily revealed due to the causal information flows underlying these modeling variables.

Currently, the high-throughput microarray and next-generation-sequencing techniques are always implemented for cell populations [⁹⁶]. The cell types and their status are mixed together with few information about the specific transcription status. In fact, the gene expression are related to these former mentioned factors, such as DNA modifications, ncRNA and RNA epigenomics, protein translation and modification [⁹⁷]. Their mixed-effects of gene expressions are still not intensively considered in the network inferences. The biology of life activity is dependent on the contexts of DNA, RNA and protein and their inter- and intra-regulations in single cells [⁹⁸]. To study specific cell types in different time and space by their expressions and relationships in specific organism is an alternative way to clearly understand gene regulatory mechanisms [⁹⁹]. With the advance of single cell genomics and sequencing, more specific contexts of these microenvironments of gene expression will enhance to build more accurate gene regulatory networks when these datasets are available [¹⁰⁰].

From computational perspectives, the multi-level omics data provide a promising way to tackle the difficulty of inferring of gene regulatory network. If multi-level omics data are available, we can figure out the trajectory of the dynamics of gene expressions. And the measured gene expression profiles will be modeled as the integrated results of these regulators by matching their cooperation. Generally, it is difficult to build a rational model of describing the hierarchical regulatory structures. In special cases such as with Gaussian distribution function assumptions, the mixed-effects model of these regulators is expected to generate an accurate approximation gene expressions measured from multiple experiments, i.e.,

Y = ∑ i m α i X i + ∑ j n β j Z j + ε

, where the gene expression of Y is a response variable of the fixed effects of multi-omics information of X, and the random effects Z of the hierarchical multi-omics information of X and that of Y and their combinations.

ε

is an error. The information hierarchy of multiple omics can also be modeled in it by introducing causal dependences, i.e.,

P (X i k 1 → X i k 2), k 1, k 2 ∈ {i}

in X [³]. In the near future, with the development of single cell techniques, the purity of gene expressions in single individual cells will bring more clean data with less noise and then achieve more accurate inference of gene regulatory networks. With more and more advanced techniques, we will definitely achieve higher resolutions of regulatory maps by data integration. The dynamics will be finally reconstructed for describing regulatory complexity with the assistance of more advanced technology such as real-time single-molecule observation [¹⁰¹].

CONCLUSION AND OUTLOOK

The difficulty of reversely engineering gene regulatory network from data essentially caused by the signal-noise ratio underlying the data. If our computational methods can distinguish noises from signals and the inference accuracy will be improved. For removing the two types of errors, i.e., false positives and false negatives in the inference, we can integrate the available gene expression profiles in deep and multi-level omics datasets in broad to build a comprehensive regulatory network. In this work, we provided a review of gene regulatory network inference from the perspectives of computational complexity, available resources, existing methods, and the endeavors of reducing the substantial two type errors. We highlighted the importance of data integration to achieve accurate inferences, especially prior knowledge and multiple omics. The systems biology approaches seem to be the rational alternatives of deciphering gene regulations from data. Due to the temporal and spatial features of gene regulation, we also need to focus on specific individual conditions and single cells, and then concentrate them into the whole gene transcriptional regulatory processes. If the resolutions of experimental measures will be improved, the globally dynamic movie of gene regulations will be clearly reconstructed by integrating available high-throughput datasets.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Marx, V. (2013) Biology: the big challenges of big data. Nature, 498, 255–260

[2]	Babu, M. M., Luscombe, N. M., Aravind, L., Gerstein, M. and Teichmann, S. A. (2004) Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol., 14, 283–291

[3]	Liu, Z. P. (2015) Reverse engineering of genome-wide gene regulatory networks from gene expression data. Curr. Genomics, 16, 3–22

[4]	Lee, T. I. and Young, R. A. (2013) Transcriptional regulation and its misregulation in disease. Cell, 152, 1237–1251

[5]	Bandyopadhyay, S., Mehta, M., Kuo, D., Sung, M. K., Chuang, R., Jaehnig, E. J., Bodenmiller, B., Licon, K., Copeland, W., Shales, M., (2010) Rewiring of genetic networks in response to DNA damage. Science, 330, 1385–1389

[6]	Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science, 316, 1497–1502

[7]	Park, P. J. (2009) ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet., 10, 669–680

[8]	Edgar, R., Domrachev, M. and Lash, A. E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res., 30, 207–210

[9]	Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Lara, G. G., (2003) ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res., 31, 68–71

[10]	Jaenisch, R. and Bird, A. (2003) Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat. Genet., 33, 245–254

[11]	Song, C. X., Yi, C. and He, C. (2012) Mapping recently identified nucleotide variants in the genome and transcriptome. Nat. Biotechnol., 30, 1107–1116

[12]	Schena, M., Shalon, D., Davis, R. W. and Brown, P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467–470

[13]	Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet., 10, 57–63

[14]

Blackham, S., Baillie, A., Al-Hababi, F., Remlinger, K., You, S., Hamatake, R. and McGarvey, M. J. (2010) Gene expression profiling indicates the roles of host oxidative stress, apoptosis, lipid metabolism, and intracellular transport genes in the replication of hepatitis C virus. J. Virol., 84, 5404–5414

[15]

Shi, L., Reid, L. H., Jones, W. D., Shippy, R., Warrington, J. A., Baker, S. C., Collins, P. J., de Longueville, F., Kawasaki, E. S., Lee, K. Y., . (2006) The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol., 24, 1151–1161

[16]	Wu, S., Liu, Z. P., Qiu, X. and Wu, H. (2014) Modeling genome-wide dynamic regulatory network in mouse lungs with influenza infection using high-dimensional ordinary differential equations. PLoS One, 9, e95276

[17]	Raue, A., Kreutz, C., Maiwald, T., Bachmann, J., Schilling, M., Klingmüller, U. and Timmer, J. (2009) Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics, 25, 1923–1929

[18]	Leinonen, R., Sugawara, H. and Shumway, M., and the International Nucleotide Sequence Database Collaboration. (2011) The sequence read archive. Nucleic Acids Res., 39, D19–D21

[19]	The Cancer Genome Atlas Research Network. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455, 1061–1068

[20]	Hudson, T. J., Anderson, W., Artez, A., Barker, A. D., Bell, C., Bernabé R. R., Bhan, M. K., Calvo, F., Eerola, I., Gerhard, D. S., (2010) International network of cancer genome projects. Nature, 464, 993–998

[21]	Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. and Sayers, E. W. (2013) GenBank. Nucleic Acids Res., 41, D36–D42

[22]	Maher, B. (2012) ENCODE: the human encyclopaedia. Nature, 489, 46–48

[23]	Muers, M. (2011) Functional genomics: the modENCODE guide to the genome. Nat. Rev. Genet., 12, 80

[24]	Bernstein, B. E., Stamatoyannopoulos, J. A., Costello, J. F., Ren, B., Milosavljevic, A., Meissner, A., Kellis, M., Marra, M. A., Beaudet, A. L., Ecker, J. R., (2010) The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol., 28, 1045–1048

[25]	Fingerman, I. M., McDaniel, L., Zhang, X., Ratzat, W., Hassan, T., Jiang, Z., Cohen, R. F. and Schuler, G. D. (2011) NCBI Epigenomics: a new public resource for exploring epigenomic data sets. Nucleic Acids Res., 39, D908–D912

[26]	Cantara, W. A., Crain, P. F., Rozenski, J., McCloskey, J. A., Harris, K. A., Zhang, X., Vendeix, F. A., Fabris, D. and Agris, P. F. (2011) The RNA Modification Database, RNAMDB: 2011 update. Nucleic Acids Res., 39, D195–D201

[27]	Machnicka, M. A., Milanowska, K., Osman Oglou, O., Purta, E., Kurkowska, M., Olchowik, A., Januszewski, W., Kalinowski, S., Dunin-Horkawicz, S., Rother, K. M., (2013) MODOMICS: a database of RNA modification pathways—2013 update. Nucleic Acids Res., 41, D262–D267

[28]	Bujold, D., de Lima Morais, D.A., Gauthier, C., Côté C., Caron, M., Kwan, T., Chen, K.T., Laperle, J., Markovits, A. N., Pastinen, T., (2016) The International Human Epigenome Consortium Data Portal. Cell Syst., 3, 496–499

[29]	Ardlie, K. G., Deluca, D. S., Segre, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., Trowbridge, C. A., Maller, J. B., Tukiainen, T., Lek, M., (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science, 348, 648–660

[30]	Matys, V., Fricke, E., Geffers, R., Gössling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A. E., Kel-Margoulis, O. V., (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res., 31, 374–378

[31]	Bryne, J. C., Valen, E., Tang, M. H., Marstrand, T., Winther, O., da Piedade, I., Krogh, A., Lenhard, B. and Sandelin, A. (2008) JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res., 36, D102–D106

[32]	Liu, Z. P., Wu, C., Miao, H. and Wu, H. (2015) RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database (Oxford), 2015, bav095

[33]	Xie, C., Yuan, J., Li, H., Li, M., Zhao, G., Bu, D., Zhu, W., Wu, W., Chen, R. and Zhao, Y. (2014) NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic Acids Res., 42, D98–D103

[34]	The RNAcentral Consortium. (2015) RNAcentral: an international database of ncRNA sequences. Nucleic Acids Res., 43, D123–D129

[35]	Sethupathy, P., Corda, B. and Hatzigeorgiou, A. G. (2006) TarBase: a comprehensive database of experimentally supported animal microRNA targets. RNA, 12, 192–197

[36]	Volders, P. J., Helsens, K., Wang, X., Menten, B., Martens, L., Gevaert, K., Vandesompele, J. and Mestdagh, P. (2013) LNCipedia: a database for annotated human lncRNA transcript sequences and structures. Nucleic Acids Res., 41, D246–D251

[37]	Amaral, P. P., Clark, M. B., Gascoigne, D. K., Dinger, M. E. and Mattick, J. S. (2011) lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res., 39, D146–D151

[38]	Griffiths-Jones, S., Saini, H. K., van Dongen, S. and Enright, A. J. (2008) miRBase: tools for microRNA genomics. Nucleic Acids Res., 36, D154–D158

[39]	Glažar, P., Papavasileiou, P. and Rajewsky, N. (2014) circBase: a database for circular RNAs. RNA, 20, 1666–1670

[40]	Yang, J. H., Li, J. H., Jiang, S., Zhou, H. and Qu, L. H. (2013) ChIPBase: a database for decoding the transcriptional regulation of long non-coding RNA and microRNA genes from ChIP-Seq data. Nucleic Acids Res., 41, D177–D187

[41]	Wang, Q., Huang, J., Sun, H., Liu, J., Wang, J., Wang, Q., Qin, Q., Mei, S., Zhao, C., Yang, X., (2014) CR Cistrome: a ChIP-Seq database for chromatin regulators and histone modification linkages in human and mouse. Nucleic Acids Res., 42, D450–D458

[42]	Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242

[43]	The UniProt Consortium. (2008) The universal protein resource (UniProt). Nucleic Acids Res., 36, D190–D195

[44]	von Mering, C., Jensen, L. J., Kuhn, M., Chaffron, S., Doerks, T., Krüger, B., Snel, B. and Bork, P. (2007) STRING 7—recent developments in the integration and prediction of protein interactions. Nucleic Acids Res., 35, D358–D362

[45]	Zhang, X., Zhao, X. M., He, K., Lu, L., Cao, Y., Liu, J., Hao, J. K., Liu, Z. P. and Chen, L. (2012) Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics, 28, 98–104

[46]	Zhang, B. and Horvath, S. (2005) A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol., 4, Article17

[47]	Meyer, P. E., Lafitte, F. and Bontempi, G. (2008) minet: a R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics, 9, 461

[48]	Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla Favera, R. and Califano, A. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7, S7

[49]	Wilczyński, B. and Dojer, N. (2009) BNFinder: exact and efficient method for learning Bayesian networks. Bioinformatics, 25, 286–287

[50]	Scutari, M. (2010) Learning Bayesian Networks with the bnlearn R Package. J. Stat. Softw., 35, 1–22

[51]	Shmulevich, I., Dougherty, E. R., Kim, S. and Zhang, W. (2002) Probabilistic Boolean Networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics, 18, 261–274

[52]	Müssel, C., Hopfensitz, M. and Kestler, H. A. (2010) BoolNe—an R package for generation, reconstruction and analysis of Boolean networks. Bioinformatics, 26, 1378–1380

[53]	Schaffter, T., Marbach, D. and Floreano, D. (2011) GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27, 2263–2270

[54]	Bonneau, R., Reiss, D. J., Shannon, P., Facciotti, M., Hood, L., Baliga, N. S. and Thorsson, V. (2006) The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol., 7, R36

[55]	Liu, Z. P., Zhang, W., Horimoto, K. and Chen, L. (2013) Gaussian graphical model for identifying significantly responsive regulatory networks from time course high-throughput data. IET Syst. Biol., 7, 143–152

[56]	Liu, Z. P., Wu, H., Zhu, J. and Miao, H. (2014) Systematic identification of transcriptional and post-transcriptional regulations in human respiratory epithelial cells during influenza A virus infection. BMC Bioinformatics, 15, 336

[57]	Haury, A. C., Mordelet, F., Vera-Licona, P. and Vert, J. P. (2012) TIGRESS: trustful inference of gene regulation using stability selection. BMC Syst. Biol., 6, 145

[58]	Huynh-Thu, V. A., Irrthum, A., Wehenkel, L. and Geurts, P. (2010) Inferring regulatory networks from expression data using tree-based methods. PLoS One, 5, e12776

[59]	Langfelder, P. and Horvath, S. (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics, 9, 559

[60]	Liu, Z. P. (2017) Quantifying gene regulatory relationships with association measures: a comparative study. Front. Genet., 8, 96

[61]	Basso, K., Margolin, A. A., Stolovitzky, G., Klein, U., Dalla-Favera, R. and Califano, A. (2005) Reverse engineering of regulatory networks in human B cells. Nat. Genet., 37, 382–390

[62]	Friedman, N. (2004) Inferring cellular networks using probabilistic graphical models. Science, 303, 799–805

[63]	Zou, M. and Conzen, S. D. (2005) A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics, 21, 71–79

[64]	Amit, I., Garber, M., Chevrier, N., Leite, A. P., Donner, Y., Eisenhaure, T., Guttman, M., Grenier, J. K., Li, W., Zuk, O., (2009) Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science, 326, 257–263

[65]	Thomas, R. (1973) Boolean formalization of genetic control circuits. J. Theor. Biol., 42, 563–585

[66]	Akutsu, T., Miyano, S. and Kuhara, S. (1999) Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pac. Symp. Biocomput, 99, 17–28

[67]	Saito, S., Aburatani, S. and Horimoto, K. (2008) Network evaluation from the consistency of the graph structure with the measured data. BMC Syst. Biol., 2, 84

[68]	Jordan, M. I. and Mitchell, T. M. (2015) Machine learning: trends, perspectives, and prospects. Science, 349, 255–260

[69]	Marbach, D., Roy, S., Ay, F., Meyer, P. E., Candeias, R., Kahveci, T., Bristow, C. A. and Kellis, M. (2012) Predictive regulatory models in Drosophila melanogaster by integrative inference of transcriptional networks. Genome Res., 22, 1334–1349

[70]	Bartel, D. P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116, 281–297

[71]	Pefanis, E., Wang, J., Rothschild, G., Lim, J., Kazadi, D., Sun, J., Federation, A., Chao, J., Elliott, O., Liu, Z. P., (2015) RNA exosome-regulated long non-coding RNA transcription controls super-enhancer activity. Cell, 161, 774–789

[72]	Memczak, S., Jens, M., Elefsinioti, A., Torti, F., Krueger, J., Rybak, A., Maier, L., Mackowiak, S. D., Gregersen, L. H., Munschauer, M., (2013) Circular RNAs are a large class of animal RNAs with regulatory potency. Nature, 495, 333–338

[73]	Garber, M., Grabherr, M. G., Guttman, M. and Trapnell, C. (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods, 8, 469–477

[74]	Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U. and Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264

[75]	Elowitz, M. B., Levine, A. J., Siggia, E. D. and Swain, P. S. (2002) Stochastic gene expression in a single cell. Science, 297, 1183–1186

[76]	Gibcus, J. H. and Dekker, J. (2012) The context of gene expression regulation. F1000 Biol. Rep., 4, 8

[77]	Ideker, T., Dutkowski, J. and Hood, L. (2011) Boosting signal-to-noise in complex biology: prior knowledge is power. Cell, 144, 860–863

[78]	de la Fuente, A., Bing, N., Hoeschele, I. and Mendes, P. (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics, 20, 3565–3574

[79]	Zheng, G., Xu, Y., Zhang, X., Liu, Z. P., Wang, Z., Chen, L. and Zhu, X. G. (2016) CMIP: a software package capable of reconstructing genome-wide regulatory networks using gene expression data. BMC Bioinformatics, 17, 535

[80]	Burchard, J., Zhang, C., Liu, A. M., Poon, R. T., Lee, N. P., Wong, K. F., Sham, P. C., Lam, B. Y., Ferguson, M. D., Tokiwa, G., (2010) microRNA-122 as a regulator of mitochondrial metabolic gene network in hepatocellular carcinoma. Mol. Syst. Biol., 6, 402

[81]	Liu, Z. P. (2014) Systematic identification of local structure binding motifs in protein-RNA recognition. In: Proceedings of 8th International Conference on Systems Biology, pp. 74–80

[82]	Cheng, C., Yan, K. K., Hwang, W., Qian, J., Bhardwaj, N., Rozowsky, J., Lu, Z. J., Niu, W., Alves, P., Kato, M., (2011) Construction and analysis of an integrated regulatory network derived from high-throughput sequencing data. PLoS Comput. Biol., 7, e1002190

[83]	The ENCODE Project Consortium. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74

[84]	Amaral, P. P., Dinger, M. E., Mercer, T. R. and Mattick, J. S. (2008) The eukaryotic genome as an RNA machine. Science, 319, 1787–1789

[85]	Spitz, F. and Furlong, E. E. (2012) Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet., 13, 613–626

[86]	Hecker, M., Lambeck, S., Toepfer, S., van Someren, E. and Guthke, R. (2009) Gene regulatory network inference: data integration in dynamic models-a review. Biosystems, 96, 86–103

[87]	Jensen, S. T., Chen, G. and Stoeckert, C. J. Jr (2007) Bayesian variable selection and data integration for biological regulatory networks. Ann. Appl. Stat., 1, 612–633

[88]	Yeung, M. K., Tegnér, J. and Collins, J. J. (2002) Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl. Acad. Sci. USA, 99, 6163–6168

[89]	Tegner, J., Yeung, M. K., Hasty, J. and Collins, J. J. (2003) Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc. Natl. Acad. Sci. USA, 100, 5944–5949

[90]	Lam, K. Y., Westrick, Z. M., Müller, C. L., Christiaen, L. and Bonneau, R. (2016) Fused regression for multi-source gene regulatory network inference. PLoS Comput. Biol., 12, e1005157

[91]	Werhli, A. V. and Husmeier, D. (2007) Reconstructing gene regulatory networks with bayesian networks by combining expression data with multiple sources of prior knowledge. Stat. Appl. Genet. Mol. Biol., 6, Article15

[92]	Zhu, J., Zhang, B., Smith, E. N., Drees, B., Brem, R. B., Kruglyak, L., Bumgarner, R. E. and Schadt, E. E. (2008) Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat. Genet., 40, 854–861

[93]	Santra, T. (2014) A Bayesian framework that integrates heterogeneous data for inferring gene regulatory networks. Front. Bioeng. Biotechnol., 2, 13

[94]	De Smet, R. and Marchal, K. (2010) Advantages and limitations of current network inference methods. Nat. Rev. Microbiol., 8, 717–729

[95]	Mordelet, F. and Vert, J. P. (2008) SIRENE: supervised inference of regulatory networks. Bioinformatics, 24, i76–i82

[96]	Patel, A. P., Tirosh, I., Trombetta, J. J., Shalek, A. K., Gillespie, S. M., Wakimoto, H., Cahill, D. P., Nahed, B. V., Curry, W. T., Martuza, R. L., (2014) Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science, 344, 1396–1401

[97]	Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., Tanzer, A., Lagarde, J., Lin, W., Schlesinger, F., (2012) Landscape of transcription in human cells. Nature, 489, 101–108

[98]	Rosenfeld, N., Young, J. W., Alon, U., Swain, P. S. and Elowitz, M. B. (2005) Gene regulation at the single-cell level. Science, 307, 1962–1965

[99]	Marbach, D., Costello, J. C., Küffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R., Kellis, M., Collins, J. J. and Stolovitzky, G., (2012) Wisdom of crowds for robust gene network inference. Nat. Methods, 9, 796–804

[100]

Moignard, V., Woodhouse, S., Haghverdi, L., Lilly, A. J., Tanaka, Y., Wilkinson, A. C., Buettner, F., Macaulay, I. C., Jawaid, W., Diamanti, E., (2015) Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat. Biotechnol., 33, 269–276

[101]

Graham, J. E. Marians, K. J.and Kowalczykowski, S. C. (2017) Independent and stochastic action of DNA polymerases in the replisome. Cell, 169, 1201–1213

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

PDF (1007KB)

2095

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Editorial board

Abstracting / indexing

Cover gallery

Contact us

Browse

Latest issue

All volumes and issues

Collections

Collections

Authors & reviewers

Online submisson

Call for papers

Editorial policy

Open access

Compliance with Ethical Requirement

Guidelines for authors

Classifications via endnote

Guidelines for reviewers

Abstract

Graphical abstract

Keywords

Cite this article

INTRODUCTION

INFERRING GENE REGULATORY NETWORKS

AVAILABLE RESOURCES AND METHODS

TYPE I AND II ERRORS IN INFERENCE

REMOVING FALSE POSITIVE REGULATIONS

IMPROVING INFERENCE BY PRIOR KNOWLEDGE

IMPROVING INFERNECE BY MULTI-OMICS DATA

CONCLUSION AND OUTLOOK

References

RIGHTS & PERMISSIONS