1 INTRODUCTION
Metabolic pathways are among the most important components of the cellular systems. A metabolic pathway by definition is a linked series of enzyme-catalyzed chemical reactions that occur in a cell [
1]. In these series of reactions, metabolites (reactants and intermediates) are modified into products that can be used in various cellular activities including cellular growth and maintenance. Recently, the number of internet-accessible biological knowledge bases/databases storing pathway information has been growing rapidly [
2]. These pathway resources aim to systematically bring together metabolic pathway and network information and expedite research and developments in systems biology, as well as aid accessibility of this information to the scientific community [
3]. The databases can be categorized based on the type of information/data they store, the analytical tools they contain and their objectives [
4]. Table 1 presents a list of available metabolic pathway resources with their descriptions and website links. A comprehensive list of the pathway resources from different categories of life sciences is available at http://www.expasy.org/links.html and http://www.pathguide.org.
Information from pathway resources has been extensively applied in
in-silico reconstructions and analysis of biological models representing vital cellular systems such as signaling, regulatory and metabolic networks [
5]. Accordingly, hundreds of metabolic pathway models have been reconstructed and represented in the form of predictive mathematical models for those organisms with their genome sequenced [
6–
8]. These reconstructions aim to integrate metabolic entities (genes, enzymes, reactions, and compounds) for a better understanding of the cellular structure and functions [
9]. Well curated and experimentally tested reconstructions have been applied in wide variety of areas such as identifying essential gene/reactions and drug targets [
5], studying protein-protein interactions, producing and increasing the yields of important industrial chemicals [
5,
10,
11]. Therefore, it is of great value to assess and provide an insight view of the existing databases and pathway resources so that to help the researchers in the field choose the most suitable one according to the information and data pertinent to a particular research.
Unlike the previous reviews such as [
12,
13] presenting an in-depth review of few pathway databases, here we will discuss relatively larger number and different types of pathway resources and model repositories that have been commonly used in metabolic network reconstructions and analysis in two perspectives. Firstly, our review compares the pathway resources with respect to their content, scope and data representation, and presents the most up-to-date information and statistics on pathways, reactions, compounds and enzymes for each database and model repository. Secondly, the review presents a brief comparison of four of the databases based on additional functions. For that, we summarized the available tools and their functions. The databases reviewed in this article include Reactome [
14], MetaCyc [
15], Kyoto Encyclopedia of Genes and Genomes (KEGG) [
16], and plant metabolic network (PMN) [
17], and the web servers and model repositories and web servers include Biochemical Genomic and Genetic knowledgebase (BiGG) [
18], BioModels [
19] and MetaNetX [
20]. (Table 1).
2 METABOLIC PATHWAY DATABASES
Although they are diverse in their scope, metabolic databases should be able to provide reliable information on four of the basic constituents of metabolic networks. These include biochemical reactions, enzymes catalyzing the reactions, pathways, and metabolites. Metabolites should be represented in their appropriate charged or neutral states in the databases. Similarly, the databases should incorporate reaction directionality information, elemental and charge balance and compartment to which the reaction and compounds belong to. Some comprehensive databases include genomic information, software/tools for analysing and visualizing the pathways and reactions. Availability and accuracy of the tools can also determine the extent to which the databases would be applied. The databases also vary in data source and export formats. Some databases are organism or species specific such as PMN and Reactome whereas the others describe multiple organisms or reference databases such as MetaCyc and KEGG. The following section describes some database-specific features for each database. Table 2 represents a summary of contents of the databases assessed accordingly in this review.
2.1 Reactome
Reactome is publically accessible, open-source, manually curated and peer-reviewed database of human pathways [
21]. The primary goal of Reactome is to provide molecular details of signal transduction, transport, metabolism, DNA replication and other cellular processes as an ordered network of molecular transformations of
Homo sapiens [
14,
22]. The latest version includes organisms from other domains of life. Users can exploit three important features of Reactome. They can browse, visualize and analyze reactions, metabolites, and enzymes in each pathway and download the information in various formats.
However, there are some drawbacks that need attention. For instance, comparison function for the pathways in various organisms returns the number of pathways, reactions or compounds with no description about the type of the database entities shared across the organisms compared. Furthermore, there is no way to download part of reactions or compounds at the pathway browsing interface or further description page other than programmatically accessing the whole data in the database. Most database visitors seek for information related to a particular project such as searching for gap filling reaction in metabolic network reconstruction. In such cases users pay more attention to fragments of information than the whole, thus offering such option is invaluable. Besides, Gene-Protein-Reaction (GPR) association information is one important quality that Reactome lacks, hence one has to refer other linked databases for such information, and this may cost additional time of the user.
In general, the Reactome knowledgebase has been applied in developing other species specific and general reactomes. Fly Reactome, Gallus Reactome, microme and plant Reactome [
23] can be mentioned in this regard.
2.2 MetaCyc
MetaCyc is a highly curated database of experimentally validated metabolic pathways from all domains of life [
15,
24–
26]. It contains pathways derived from extensively large number of primary literature [
15]. MetaCyc aims to serve as a general reference database on metabolism. An important feature of MetaCyc is its pathway tool that can be used to computationally predict metabolic network models of any organism from a sequenced genome.
However, the data retrieval and download system require expertise in programming skills that most users may not have. This may hinder accessibility of useful information in the database. Most reactions in MetaCyc are elementally balanced and accompanied by thermodynamic information (Standard Gibbs free energy values, ∆rGo) although the substrates and products are represented in full names rather than in simplified molecular formulae or abbreviations. This makes the browsing process cumbersome to some extent.
Regardless of these minor set-backs, MetaCyc has been applied to create more than 5,700 PGDBs for large number of organisms, including
Saccharomyces cerevisiae [
27],
Arabidopsis thaliana [
28],
Oryza sativa [
29],
Mus musculus [
30],
Bos taurus [
31],
Medicago truncatula [
32],
Populus trichocarpa [
33],
Dictyostelium discoideum [
34],
Leishmania major [
35],
Chlamydomonas reinhardtii [
36], several
Solanaceae species [
37], bioenergy-related organisms (BeoCyc) and many pathogenic organisms [
38].
3 PMN
Plant metabolic network is a metabolic pathways database that hosts one reference database (PlantCyc) and 22 species-specific databases. All of them especially focus on plant metabolic pathways. At the center of PMN is PlantCyc, which is a metabolic pathway reference database containing more than 900 pathways, their catalytic enzymes and genes. Furthermore, PlantCyc contains compounds from over 350 plant species. The data source in PlantCyc covers pathways from experimentally validated literature and curated by PMN and its collaborators [
17].
PlantCyc can be considered as a derivative of MetaCyc. Accordingly, many features of PlantCyc resemble MetaCyc. The databases schema and classification of the pathways, reactions, and compounds are all in a similar fashion as in MetaCyc. Hence, the drawbacks in MetaCyc are also reflected in PlantCyc. In addition, PlantCyc contains some computationally predicted and hypothetical (which are not part of plant metabolism but utilized in the metabolic pathways of other organisms) pathways but there are no special markers or other forms of identifiers used for these classes of database entities on the pathway diagram.
The datasets and the pathway from PMN have been applied for in-depth investigation of plant metabolism including pathway predictions, metabolic network reconstructions and developing bioinformatics tools [
33,
39–
44], etc. For instance, in [
43] PMN was used to investigate genomic signatures of specialized metabolism in plants. The MORPH algorithm [
44], ranking candidate genes for membership in
Arabidopsis and tomato pathways, was also developed based on the pathway information from PMN.
4 KEGG
KEGG is a database resource developed to understand high-level functions and utilities of the biological system, from molecular-level information. The objective of KEGG was to create a reference knowledge base of metabolism and other cellular processes [
45]. The most recent version of the KEGG contains 16 databases which are grouped into four categories, namely systems, genomic, chemical and health information databases [
16].
KEGG is well known for a detailed description, representation, and visualization of its contents. However, there are some discrepancies arise in reaction representations. For instance, R00472 is represented as a sum of two reactions, R00473+ R10612, but there is no description how these reactions combined and resulted in R00472. KEGG can be accessed by users of varying levels of programming skill, but the search function seems complicated for those with a little or no programming skill at both KEGG and DBGET searching interfaces. In fact, DBGET offers keyword searching option, still, it requires KEGG prefixes as search criteria. Providing examples, one for each method in the briefings at the help page may maximize the usage in this regard. In addition, browsing some pathways such as Glycolysis (Embden-Meyerhof pathway) D-glucose to Pyruvate ends up with long non-zooming maps. This forces the user to scroll up and down while browsing parts of that particular pathway.
The datasets and the tools of KEGG have been largely utilized in a remarkable number of scientific studies, including genome annotations, pathway predictions, metabolic network reconstructions, and developing bioinformatics tools [
46–
52]. For instance, the KEGG omics data visualization tool was integrated with a desktop application, KegArray and many more web-based tools to visualize and interpret large amount of data obtained from high-throughput measurement techniques such as microarray, metagenome, and metabolome analyses [
49]. Pathway Inspector [
50], a pathway-based web application for RNAseq analysis of model and non-model organisms is another tool created based on KEGG database. The KEGG GO terms have been used in developing CGDB [
51] a database of circadian genes in eukaryotes and many more.
4.1 Comparison of the pathway databases
The pathways databases can also be compared in terms of the software or tools that can facilitate analysis and interpretation of their contents. Table 3 summarizes unique tools specific to each database.
5 MODEL REPOSITORIES
Since the models come from diverse research groups or individuals, one of the current challenges related to metabolic network models is the inconsistencies in representing their components such as reaction, or compound names, and formulae, and symbols used to designate cellular compartments. Therefore, the online model repositories should go beyond simply serving as a repository to platforms for systematic analysis and standardization of these models, the efforts of BiGG and MetaNetX.org can be mentioned in this regard. Furthermore, the model repositories are expected to provide concise descriptions of the models such as simulation conditions, this may include in-silico growth media condition, energy requirements, objective function, and file compatibility information. Bearing this in mind, the following section will assess some of the most common metabolic network model databases.
5.1 BiGG
Biochemical, Genetic and Genomic (BiGG) is a knowledge base of genome-scale metabolic network (GEM) models. The latest version of BiGG comprises 80 high-quality, manually curated genome-scale metabolic models (including E. coli core model), 15,288 reactions and 5,175 metabolites (as of Nov. 28th 2016). Additional features include pathway visualization with Escher map (a web-based tool for building, viewing, and sharing visualizations of biological pathways), various model export formats (SBML Level 3, MAT and JSON) and model validation function.
Despite its numerous good qualities, BiGG has some minor drawbacks that should be addressed. One such point to be considered is that there should be information about each model describing whether a given model is experimentally validated or not. In addition, the model validation tool is limited to SBML level 3 v1 and flux balance constraints (FBC) v2 files, since significant numbers of metabolic models still exist in SBML level 2 and incorporating an option for this category of file format is invaluable.
The GEMs from BiGG have been applied to develop bioinformatics tools such as Fast-SNP (Fast matrix pre-processing algorithm for efficient loopless flux optimization of metabolic models) [
53], the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB) [
54] including integrative view of protein, gene and 3D structural information, etc.
5.2 BioModels
BioModels is an online reference repository that hosts peer-reviewed quantitative, dynamic models of biological networks of numerous organisms [
19]. Currently, BioModels contains over 1,200 models reported in the literature and 140,000 models automatically generated from pathway resources using a Path2Models project, a tool developed by Büchel
et al. [
55].
However, the automatically generated models still need some degree of manual interventions as the models show some incapability in accurately predicting the defined biomass precursors. For example, the automatically generated
Homo sapiens metabolic network model was incapable of synthesizing nine of the amino acids namely cysteine, histidine, isoleucine, leucine, lysine, methionine, threonine, tryptophan, and valine in a defined minimal medium [
55]. In another example, FBA of BMID000000142305 model resulted in the growth rate of zero for wild-type (WT) strain on a simulation carried out at MetaNetX.org and local machine using COBRA toolbox version 2.0 when the model is used as obtained from BioModels database. Therefore, it is critical to carry out initial curation of the models prior to any research or investigation that will depend on these models.
Models from BioModels have been utilized in large numbers of scientific studies involving metabolic network analysis. The manually curated models have been widely used in gene and reaction essentiality analysis, protein-protein interaction, identifying drug targets, developing bioinformatics tools [
56–
58], etc.
5.3 MetaNetX/MetaNetX.org
MetaNetX serves as a repository for genome-scale metabolic network models and MetaNetX.org [
20] is an online platform for accessing, analyzing and manipulating genome-scale metabolic networks (GSM). The website and the tools in MetaNetX.org provide multiple functions: i) free access to MNXref reconciliation data and collection of published GSMNs, ii) it also allows users to upload, manipulate, analyze or modify their own genome-scale metabolic network models and export them in SBML or MXN tab-limited format, iii) tools for metabolic network analyses based on network structure, FBA or nested pattern methods [
59].
MetaNetX.org has tremendous advantages users can exploit. However, there are still issues that need to be considered. For instance, there is no option for programmatic access to the wealth of metabolic knowledge in this repository. In addition, some models that can be picked up from the repository for analysis are automatically generated by computational methods. Therefore, providing evidence or references for experimental validation of this group of models is critical.
The resources and tools in MetaNetX.org have been generally applied to the metabolic network reconstruction and analysis. FBA and other interactive tools were utilized to develop new reconstructions, compare existing reconstructions, and extract components of reconstructions including compounds, reactions, gene-protein associations, media components and the biomass equation components with its coefficients [
60–
62].
5.4 Comparison of the model repositories
The model repositories are also compared based on the additional function they provide to the users. Table 4 represents some database specific tools available.
6 CONCLUSION AND PERSPECTIVE
The present review discusses pathway knowledge bases/databases and model repositories that have been commonly used in metabolic network reconstructions and analysis in two perspectives. Firstly, although each database and knowledge base has its own aim and scope, the review presents a brief comparison of pathway resources based on the scope, contents, and applicability. Accordingly, some inconsistencies have been observed in nomenclature and representation of database entities. For example, KEGG uses an R number as a reaction identifier such as R00259 for acetylation of L-glutamate, while BiGG and MetaCyc use quite different reaction identifiers such as ACGS and EC number 2.3.1.1 or enzyme name respectively for the same reaction. On the other hand, MetaNetX uses MNX prefix followed by a single letter indicating the type of object: R for reaction; M for metabolite; C for cellular compartment and followed by an integer (e.g., MNXR1234). These inconsistencies hamper maximal use of the knowledge accumulated in these databases and in the area of systems biology at large. Hence, it is strongly recommended that the database creators and the metabolic network models developers should follow international standards for nomenclature of reactions and metabolites such as IUBMB (International Union for Biochemistry and Molecular Biology) rule thereby to facilitate the integration of the databases and related pathway resources.
Secondly, we observed that three of the databases reviewed, PMN, KEGG, and Reactome, have well-organized summary and statistics about their content on the website. In fact, other pathway resources also contain such information but not in a well-structured fashion. Hence, it would be of great value if the pathway resources offer such information in clear and precise way so that users can easily catch up a clear image about the resources in database or repository, and this enables them to choose an appropriate database or model repository according to a particular research interest.
Finally, all of the reviewed model repositories do not provide a brief description of the models’ characteristics such as simulation conditions. Providing such information including in-silico growth media information, objective function, energy requirements, evidence for experimental validation of each model, and so on, is a critical point to be considered by existing model repositories and those that will be built in future.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature