INTRODUCTION
The problem of counting the number of occurrences of a subgraph within a large graph is commonly termed subgraph counting. Related problems, such as network motif finding, relative graphlet frequency distance and graphlet degree distribution agreements, are all fundamental graph analysis methods to identify latent structure in complex systems. They have applications in bioinformatics [
1,
2], chemoinformatics [
3], online social network analysis [
4], and many other areas. Recently, Liang
et al. [
5] presented a novel algorithm CoMoFinder to detect composite motifs in co-regulatory networks with two types of molecules including transcription factors and microRNAs. Consequently, we can exploit CoMoFinder to identify composite motifs in networks with different types of nodes, such as drug-target interaction network [
6], disease-ncRNA association network [
7] and disease-microbe association network [
8–
10].
As any connected subgraph that is non-treelike can be generated by expanding a corresponding subtree through adding edges, this paper focuses on the problems of counting tree-structured subgraphs, which have been found to be very useful for revealing significant systemic differences among organisms [
2,
11]. Subtree mappings can be categorized into two, namely, induced and non-induced (see the Section of Preliminaries in Methods for definitions). The definition of non-induced subtree considers all possible occurrences of the subtree of interest within a network. Counting non-induced occurrences of subtrees is also desirable as real-world networks may include missing and spurious edges.
Counting the number of all possible non-induced subtrees is computationally demanding and a practical bottleneck for large networks. Omidi
et al. [
12] introduced the MODA algorithm, which determines the frequency of a given subtree by searching for all possible mappings from the subtree into the input network. However, MODA is unable to scale to large networks with thousands of nodes, and the computation of very large subtrees remains exceedingly expensive. An alternative approach for estimating the number of non-induced subtrees and bounded treewidth subgraphs is based on the powerful color coding technique [
13], which forms the basis for several recent serial and parallel implementations. Alon
et al. [
2] implemented color-coding subtree counting to demonstrate its applicability for finding large tree-structured motifs in biological networks. Zhao
et al. [
14,
15] implemented distributed color-coding subtree counting for large graphs via both MPI and MapReduce, with applications in social network analysis. Recently, Slota and Madduri [
16–
18] proposed an algorithm called FASCIA, an efficient and scalable distributed implementation of subtree counting via color coding.
The aforementioned algorithms for counting non-induced occurrences of subtrees belong to the subtree-centric approaches, which search for one specific single subtree type at a time. Hence, subtree-centric methods have to be run once for each possible
k-size subtrees, and all the different non-isomorphic
k-size subtrees should be generated first. Although there are many motif-finding algorithms which exploit network-centric approaches, such as ESU [
19] or Kavosh [
20], there is no such study, to the best of our knowledge, that attempt to apply network-centric approaches to subtree counting problem. Network-centric approaches enumerate all subgraphs of a certain size, and then classify each enumerated subgraph into isomorphic classes. This means that isomorphism test will be required for each subgraph occurrence, while the number of actual non-isomorphic classes is much smaller, particularly for tree topologies. For this reason, researchers recently devised several novel algorithms, such as QuateXelero [
21] and FaSE [
22], which incorporate a rooted tree data structure in the enumeration process to reduce the number of isomorphism tests required.
In 2011, Ferreira
et al. [
23] presented the first output-sensitive algorithm that enumerates all
k-size subtrees in a graph
G of size
n in
O(s
k) total time, where
s is the number of these subtrees in
G and scales as
nk. To our knowledge, the research is theoretical, with no experimental results for the purpose of comparison, and currently there is no improved algorithm for this enumeration problem [
24].
In this paper, we derive a network-centric algorithm to count the number of all non-induce subtrees of a given size that occur in the input network. Our algorithm for counting non-induced occurrences of subtrees, called MTMO, consists of two subtasks, namely, enumeration and classification.
We provide significant improvements for the above two subtasks of counting non-induced subtrees. Based on the composition operation of an integer, we present a new algorithm for enumeration of the subtrees. Our proposed algorithm is much easier and faster to implement compared with the existing algorithms.
To avoid a large number of isomorphism tests, a rooted tree data structure, which represents the topology of the subtrees being enumerated, is searched during the actual enumeration. Each time a new vertex is added to the current enumerated subtree, a new ramification of the rooted tree is created, or we follow an already existing path. A path from the root node to any leaf in the created rooted tree corresponds to a different vertex connection pattern of a particular tree type. We have to compute a canonical labeling for each leaf to determine its subtree type, but we eliminate the need on all other occurrences that corresponds to the same path in the tree. Furthermore, we use an array-based indexing scheme to simplify the subtree counting method.
We evaluate our algorithm on three representative undirected complex networks. Experimental results show that we can obtain significant speedups, being roughly an order of magnitude faster than existing subtree-centric approaches and base network-centric algorithm which does not use rooted tree. Also, we show major differences between unicellular and multicellular organisms. In addition, our proposed algorithm is applied to identify network motif with yeast protein-protein interaction (PPI) data.
RESULTS
We use three representative networks to test the performance of our algorithm. One is a biological network comprising protein-protein interaction network of the budding Yeast [
25,
26]. The other two are non-biological networks, namely, an electronic [
27] and a dolphins social network [
28,
29]. For simplicity, all occurring self-loops are removed. The features of these networks are summarized in Table 1.
Unless otherwise specified, the platform used in these experiments is: Intel Xeon X5670 2.93 GHz 12 MB Cache 1,333 MHz with 48 GB main memory running under the Ubuntu 12.04 (amd64) operating system. The source codes were compiled with the GNU gcc/g++ 4.6.3 compiler using the option “–O3”.
Effect of the labeled rooted tree
We compare MTMO with our base network-centric algorithm called NTMO, which does not use rooted tree and requires an isomorphism test for each subtree occurrence. We capture the runtime necessary to conduct a complete k-subtree counting on the original networks as we varied k from 3 to 12. The results are listed in Table 2.
In all cases, MTMO outperforms NTMO, with the speedup being roughly an order of magnitude faster. Furthermore, as the subtree size k increases, the speedup becomes greater as well. More precisely, the speedup is related to the ratio of number of subtree occurrences to the number of classes. For example, MTMO is up to 70 times faster when counting subtrees of size 7 in the Yeast network, but only 34 times faster in the electronic network. This finding is primarily because the number of subtree occurrences in Yeast is greater than that in the electronic network.
Comparison to other algorithms
MODA [
12] is a recent open-source motif discovery tool that reports counts for non-induced subgraph occurrences. This tool requires Microsoft Visual Studio. To achieve a direct comparison, both MTMO and MODA are executed on the same computer with Intel Xeon X5670 2.93 GHz 12 MB Cache 1,333 MHz with 48 GB main memory running under the Windows 7 operating system. Considering that MODA is unable to scale to large networks with thousands of nodes, we use two other non-biological networks. As shown in the results listed in Table 3, MTMO can attain substantial speedups compared with MODA. In addition, MTMO has less memory cost compared with MODA.
We further compare MTMO with FASCIA [
16]. Slota and Madduri [
16] reported the execution time for all possible subtrees of size 7 on the electronic network was approximately 22 s. According to Table 3, the execution time of MTMO for all possible subtrees of size 7 within the same network is under 1 s.
Comparison of PPI networks
In the following we download the PPI networks of four species, which consist of three unicellular organisms (
Saccharomyces cerevisiae,
Escherichia coli, and
Helicobacter pyloris) and a multicellular organism (
Caenorhabditis elegans) from the DIP database (2014.10.1 updated) [
30].
We calculate many topological statistics in these four networks, which are summarized in Table 4. All networks display small-world and scale-free properties. However, the clustering coefficient of all networks is exceptionally small. This is because the PPI networks of these species are far from complete.
We use the proposed algorithm to obtain normalized treelet distributions, that is, the number of non-induced occurrences of different tree topologies of size
k=8 normalized by the total of all non-induced trees of size 8 for each PPI network. The corresponding treelet distributions are displayed in Figure 1. As can be seen, there are obvious differences between these three unicellular organisms and a multicellular organism, which is consistent with the one made by Alon
et al. [
2].
Network motif discovery based on pattern growth approach
The MODA algorithm introduced the concept of expansion trees based on pattern growth approach, which exploits the hierarchical structure of expansion trees by starting from
k-subtrees and then extending these subtrees through adding edges to compute the frequency of each non-tree subgraph without using subgraph isomorphism. Therefore, to reduce the time of subtree search, we utilize MTMO to list all mappings of
k-subtrees and then expanding these subtrees step by step until a complete graph. Here we apply our proposed algorithm to detect network motifs with yeast PPI data. The PPI data is downloaded from the Munich information center for protein sequences (MIPS) database [
31] (marked as MIPS network), which contains 4,546 proteins and 12,319 interactions after filtering out the self-interactions and repeated edges. There are 2 non-isomorphic types for undirected 3-node subgraphs and 6 non-isomorphic types for undirected 4-node subgraphs. Figure 2 shows each undirected 3- and 4-node subgraph type, which is labeled as
Mij, where the subscript
i represents the size of the subgraph,
j denotes the code of the subgraph, which is a decimal number that transformed from the adjacency matrix of the subgraph.
Table 5 shows the statistical properties of each 3- and 4-node subgraph type in MIPS network. Since subgraphs with Z-score larger than 2.0 are identified as motifs, the five types of M3238、M427030、M44958、M413278 and M431710 are detected as network motifs in MIPS network. Since MTMO is based on non-induced subgraphs as network motifs, we also extract induced subgrpahs in the same network using ESU algorithm for comparison. As shown in Table 6, the five types of M3238、M44382、M44958、M413278 and M431710 are network motifs.
As we can see the results from Table 5 and Table 6, four types of motifs are identical, but there is still one difference that the type of M427030 (a rectangle) is motif in our algorithm, whereas the type of M44382 (a star with three vertices) is motif in ESU. From the results of the existing algorithms, we found that motifs have high connectivity and treelike subgraphs are generally non-motifs. Therefore, non-induced subgraphs as network motifs may be more applicable, especially in incomplete PPI network.
DISCUSSION
On the basis of the detailed analyses of previous approaches, we have presented a network-centric algorithm which allows for a faster counting of non-induced subtrees. This enables to count larger subtrees in larger networks than previously possible, thereby facilitating future research on the extraction of combinatorial patterns. We also use the proposed algorithm to obtain treelet distributions for k=8 of the PPI networks and confirm previously reported differences between unicellular and multicellular organisms. Additionally, we apply our proposed algorithm to find network motifs based on pattern growth approach, which indicate that non-induced subgraphs may be more suitable for high quality network motif detection.
In this study, MTMO is applied to undirected networks. We find that applying MTMO to handle directed networks is also possible, as long as the labeled rooted tree data structure can represent the topological information of directed subtrees. Meanwhile, we will try to exploit multi-core parallel architectures, Graphic Processing Units (GPUs), HAlign and Spark distributed computing system [
34], and hadoop platform [
35] to accelerate the MTMO algorithm.
METHODS
Preliminaries
In this paper, G=(V, E) indicates an undirected graph, where V is a finite set of vertices, |V|=n is the graph size, and E is the edge set of the graph that satisfies EVV. A k-graph has size k. All vertices are assigned consecutive integers starting from 1.
We define a tree T that is a connected graph containing no cycle. Tree Ts=(Vs, Es) is a subtree of graph G if VsV and EsE( VsVs). This subtree is non-induced when u, vVs: if (u, v)Es then (u, v)E (in contrast to induced subtree, in which (u, v)E if and only if (u, v)Es). Therefore, non-induced subtree is decided by the edges that are selected, whereas induced subtree is decided by the vertices that are selected. For instance, in a triangle, there are 3 non-induced occurrences of pattern “” and one occurrence of pattern “∇”, whereas there is only one induced occurrence of pattern “∇” and no occurrence of pattern “”.
Two subtrees T1=(V1, E1) and T2=(V2, E2) are isomorphic if there exists a bijection between V1 and V2 such that two vertices are adjacent in T1 if and only if their corresponding vertices in T2 are adjacent.
For a particular subtree Ts, all subtrees isomorphic to Ts in the graph G are considered as occurrences of Ts. The frequency of a particular subtree in an input graph is the number of its occurrences in the graph.
We will now define the problem we are attempting to solve in a more precise manner. Definition 1 (Subtree Census Problem): given an input graph G and an integer k, determine the frequency of all non-induced k-subtrees of G. For example, the frequency is 9 for the particular 3-tree in the graph shown in Figure 3. Note that different occurrences can have overlaps in vertices or edges, that is, two occurrences of a subtree T, namely, T1 and T2 may share vertices. Actually, the vertex sets of T1 and T2 may be same. T1 and T2 are regarded as different occurrences of T, if they have at least one edge that they do not share.
Enumeration
Among the existing network-centric algorithms, Kavosh [
20] is one of the best one. However, the algorithm focuses on induced occurrences. Note that an induced occurrence is determined by the selected vertices, while a non-induced occurrence is determined by the selected edges. In the present study, we propose an algorithm for enumerating non-induced subtrees that is similar to Kavosh in some aspects. In Kavosh, the subgraph is extended by neighboring vertices, whereas in our algorithm, the subtree is extended by neighboring edges. The process of the enumeration is explained in detail in the following discussion.
For enumerating all non-induced k-subtrees of a given graph G, all subtrees which include a particular vertex are first found. After that, this vertex is removed and the process is repeated for the remaining vertices.
For finding the k-subtrees in which a particular vertex participated, trees with maximum depth of k, rooted at this vertex and based on neighborhood relationship are implicitly built. The children of each vertex include neighboring edges. A neighboring edge can be chosen as a child only if the corresponding neighboring vertex has not been included in the current enumerated subtree. Furthermore, all children in a particular tree must have numerical labels that are larger than that of the root of the tree.
The enumeration procedure uses the composition operation of an integer. A subtree of size k consists of k–1 edges. To extract k-subtrees, all possible compositions of integer k–1 must be taken into consideration. In mathematics, a composition of an integer k–1 is a sequence of positive integers with total sum k–1. Two sequences that differ in the order of their terms define different compositions of their summation. A composition can be expressed as k1, k2, ..., km, where k1 + k2 + ⋯+ km =k–1.To find subtrees based on the composition, ki edges are selected from the i-th level of the implicit tree to be edges of the subtrees (i = 1, 2, ..., m). Meanwhile, any neighboring vertex cannot be selected repeatedly at each level of the implicit tree, to select an edge that will expand the subtree size by one. Finally, k−1 selected edges, which consist of k vertices, define a non-induced k-subtree within the graph G.
For a particular level
i,
ki<
ni is possible, where
ni is the number of edges present at level
i. At level
i,
C(
ni,
ki) different selections of edges should be considered. In this study, all combinations of edges are selected by using the “revolving door ordering” algorithm [
34], which is the fastest algorithm used to generate combinations, and is a constant amortized time algorithm [
35].
This step is described by an example on a given graph shown in Figure 4. For this graph, all 4-subtrees containing the vertex 1 are found, as illustrated in Figure 5.
Classification
While performing the enumeration, we want a data structure to represent the topological information of the subtrees being enumerated. Hence, we construct a labeled rooted tree to perform a partial classification for enumerated subtrees. This data structure is similar to the quaternary tree in QuateXelero [
21], but the set-up and usage of this data structure is slightly altered. Specifically, the labeled rooted tree in MTMO is built according to the parent information of newly added vertex, while the quaternary tree in QuateXelero is built according to the pattern of connections of newly added vertex. Compared to the quaternary tree, the labeled rooted tree in MTMO has less memory cost. In the remainder of this paper, we use the term “vertex” for a vertex in the input network and the term “node” for the nodes of the rooted tree to avoid confusion.
The enumeration process builds an implicit tree. Thus, there is one unique parent for each vertex that is added to the current enumerated subtree, except the first vertex. Hence, we use the labeled rooted tree to represent the parent information for each newly added vertex, as exemplified in Figure 6. For k-subtrees enumeration, the labeled rooted tree has the following properties:
(i) The total length of the path from the root node to leaf is k–1.
(ii) The level of the root node is assumed to be 0.
(iii) At i-th (i = 0, 1, …, k–2) level, each internal node has at most i+1 children.
(iv) At
i-th (
i = 0, 1, …,
k–1) level, the number of nodes is at most
C(2
i,
i)/(
i+1), which is the
i-th Catalan number [
36].
(v) Each edge is labeled with a number, indicating the parent information of the newly added vertex.
The labeled rooted tree can be searched along with the extension of subtree. That is, when the current enumerated subtree is extended by one vertex, the labeled rooted tree is likewise searched one level further using the parent information of the newly added vertex. An example of such a search is illustrated in Figure 7.
When the size of the current subtree reaches
k, the current pointer of the labeled rooted tree will be moved to the appropriate leaf. All subtrees that reach the same leaf are isomorphic. However, two different leaves may correspond to the same subtree isomorphic class (Table 7). Consequently, the canonical labeling for each leaf must be computed to determine its subtree type only when the leaf is first created. Much work has already been done concerning tree isomorphism and we rely on AHU algorithm [
37] for performing it in practice. For each rooted tree of size
k, AHU algorithm produces a binary canonical labeling of length 2
k, which contains
k ones and
k zeros. An ordinary
k-tree has one or two centers. Therefore, considering the center vertex as the root vertex, an ordinary tree can be converted into one or two rooted trees. So there are one or two canonical labelings for each class of tree isomorphism. All tree topologies for
k=5 and the corresponding canonical labelings are listed in Figure 8. In order to simplify the subtree counting method, instead of searching a binary tree, we convert the canonical labeling into a decimal number, which corresponds to an array index. Thus, the frequency of each
k-subtree is stored in a corresponding array element. However, the vast majority of the array elements are not used. For all different non-isomorphic class of a given size, the canonical labeling of star tree is minimal, while the canonical labeling of path tree is maximal; not much difference exists between the maximum and minimum. To conserve memory, we generate and store an array whose length is determined by the difference between the two values plus 1 (Table 8). Among these array elements, the first element stores the counts of star tree, while the last element stores the counts of path tree, and so on.
The pseudo code for our algorithm, which integrates the isomorphism tests into the enumeration process, is shown in Algorithm 1 (see Supplemental Materials). In this algorithm, the Explore procedure searches the labeled rooted tree while the subtree is extended. After being fully expanded, the subtree is classified. Finally, the actual frequencies are stored in an array.
It should be noted that in this algorithm (lines 10 to 12) the need to call the AHU algorithm is eliminated in many cases by using the labeled rooted tree. The time complexity of AHU algorithm is O(k2) in the worst case, while the maximal complexity of Search procedure is O(k). As seen in Table 7, the number of leaves is less, which means that a great number of subtrees will reach the same leaf of the labeled rooted tree, and so calling the AHU algorithm will not be needed for them except for the first one. Consequently, this will remarkably reduce the computational time of subtree counting. In addition, the memory overhead that is used to construct the labeled rooted tree is negligible because the total number of nodes in the labeled rooted tree is less.
Complexity Analysis of subtree enumeration
The process of the enumeration takes full advantage of the composition operation of an integer. Each positive integer
m has 2
m−1 distinct compositions [
38]. For enumerating all
k-subtrees in which a particular vertex participated, all distinct compositions of integer
k−1 must be taken into consideration. According to the result of [
38], there are 2
k−2 compositions of
k−1. Each composition need to select
k−1 edges based on adjacent relationship. Meanwhile, considering that the degree of each vertex of
G is not more than the hub degree (
D), where
D is the highest degree of
G and
D scales with
n. Consequently, the worst complexity of subtree enumeration is
O(2
k−2nDk−1)
O(2
k−2nk). While the worst complexity of Ferreira’s method is
O(
knk) [
23], which is slightly less than our method when
k5. However, our proposed algorithm is much easier to implement compared with Ferreira’s algorithm, which is only theoretical research.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature