In 2011, we proposed the need to develop ePlant [
1], a highly mechanistic model of plant growth and developmental processes throughout the whole plant growth cycle, which will differ from all previous crop models by having detailed mechanistic basis of all processes spanning from molecular reactions up through plant environment interactions. Rapid progress has been made in recent years in development of the component modules (or sub-models), theoretical tools and applications around ePlant. In this perspective paper, we overview the original rationale, concept, components for ePlant and a method for its development. We then propose a theoretical framework to develop and apply ePlant in the big data era. Finally, we discuss recent efforts in developing an international consortium on promoting quantitative and predictive plant science research, with the realization of ePlant being one of the central goals.
WHAT IS ePLANT AND WHY DO WE NEED TO DEVELOP IT?
ePlant will be a mathematical model which aims to simulate the dynamic plant growth and development process throughout its growth cycle. It differs from the earlier crop models, such as APSIM [
2] and DSSAT crop models [
3], by explicitly simulating the detailed mechanisms underlying different processes. It spans scales from organelle, cell, tissue, organ, whole plant to ecosystem levels; it includes the processes spanning gene regulation, metabolic process, metabolite transport at the tissue and organ levels, organ morphogenesis, and plant environment interactions (Figure 1). We envisage ePlant will become a pivotal tool in the predictive and quantitative plant science research in the modern big data era.
Firstly, ePlant or the sub-models used in ePlant, can be used as a basic tool for quantitative study of diverse plant systems, such as the regulatory circuits controlling the stability of plant metabolic systems under different conditions [
4], mechanistic basis of the biophysical signals, such as the chlorophyll fluorescence induction curve [
5,
6] and mesophyll conductance [
7], and identification of optimal agronomic practices for improved biomass production [
8]. Similar to the earlier crop growth models, ePlant can be used to guide crop management [
3], selection of physiological traits for crop breeding [
9] and predicting response of crops to changing climates [
10–
12].
Secondly, ePlant can be used as a critical component in the current general circulation models (GCMs) [
13]. GCMs are models of circulation of planetary atmosphere or ocean, which can be used for weather forecasting, studying climate and climate change. Due to the large magnitude of CO
2 fluxes from terrestrial photosynthesis and respiration [
14], terrestrial processes greatly influence the global carbon cycle. In current GCMs, compared to models of atmosphere and soil related physical processes, models representing plant growth and development are much less accurate. As a reflection of this, even for some of the best studied plant species, such as rice, no contemporary model can accurately predict its productivity under elevated CO
2 and temperature at different sites [
12]. Furthermore, variations in predicted rice productivity are higher between individual crop models than variations resulting from 16 global climate model-based scenarios [
12]. One possible reason for this low predictive power is the lack of molecular details of plant growth and development in current crop models. ePlant, with a mechanistic description of plant growth and developmental processes and the interaction between plants and their environments, will drastically improve predictions of plants behavior under different climates, and hence improve the capacity of current GCMs in predicting climate and identifying strategies to cope at the changing climate.
Thirdly, ePlant can be used as a basic tool to support molecular design of crops to develop new strategies to improve crops for desirable traits, such as improved yield potential, improved grain quality, or higher stress tolerance or resource use efficiency [
15]. This is currently especially relevant since it has become relatively easy to manipulate a gene or some gene combinations in plants, especially in agriculturally important crops. The main challenge that remains is to identify targets to be manipulated to gain the desired traits. Previously, through a systems modeling approach, a number of options to improve photosynthesis have been identified. For example, canopy photosynthesis models were used to identify the optimal Rubisco kinetic properties, photoprotective properties and canopy architectural parameters [
16–
19], dynamic systems models were used to identify genes controlling photosynthetic efficiencies in both natural and designed metabolisms [
20–
23], a reaction diffusion model of mesophyll cell was used to identify major limiting factors controlling mesophyll conductance [
7] and leaf internal light prediction model was used to demonstrate the importance of different anatomical features on leaf photosynthetic rates [
24]. Many of the identified options have been shown to be effective in enhancing photosynthesis and biomass production [
25,
26], demonstrating the effectiveness of this approach. We envisage that once ePlant is developed, it can be used to systematically evaluate different aspects of plants that holds potential to be improved for desirable features.
Fourthly, ePlant and the sub-models included in ePlant (as discussed in detail later) can be used to quantitatively represent the contemporary plant biology knowledge. Compared to a textual representation, either in the form of papers, textbooks or Wikipedia, the quantitative representation of plant biological knowledge encapsulated in ePlant or its sub-models can effectively facilitate communication among researchers specializing in different aspects of plant growth and development and hence promote cross-fertilization of ideas. Such quantitative representation will also help identify knowledge gaps in the current understanding of plant growth and development. Finally, ePlant and its modules can also be used as effective and visual teaching tools.
THE ESSENTIAL FUNCTIONAL MODULES OF ePLANT
To achieve the ePlant described above, at least four categories of functions are required. Different mathematical models therefore have been developed or need to be developed to realize the simulation of these functions.
Firstly, ePlant needs to explicitly incorporate the biophysical and biochemical mechanisms controlling photosynthesis and all the closely related metabolic processes, such as respiration, nitrogen assimilation etc. [
27,
28]. On this aspect, mechanistic models of the metabolic process of photosynthesis have been established now for C
3, C
4 and crassulacean acid metabolism [
21,
29,
30]. In contrast, a mechanistic model of respiration is yet to be developed. In this line, it is worth to note that a mechanistic model of mitochondria energy generation in a human heart cell has been built [
31] and a simplified model for plant respiratory processes has been built [
32] earlier. A fully mechanistic model being able to predict interactions between photosynthesis, respiration and nitrogen assimilation is yet to be developed.
The availability of the substrate of photosynthesis, i.e., CO
2, is controlled by stomatal conductance and mesophyll conductance. Stomatal conductance is influenced by an array of internal metabolic processes and external environmental factors [
33–
37]. Different models of stomatal conductance with varying degree of mechanistic basis have been built [
38,
39]. So far, a fully mechanistic model of stomatal conductance is yet to be developed. Mesophyll conductance is another critical factor controlling leaf photosynthetic efficiency. Highly mechanistic models of mesophyll conductance have been built in recent years [
7,
40,
41].
Leaf anatomy controls leaf photosynthesis by influencing leaf internal light environments, leaf internal CO
2 temperature profiles [
42,
43]. Efforts to model photosynthesis by considering leaf anatomy have been made recently [
24,
44]. Furthermore, considering the close interaction between plant primary metabolism and other secondary metabolism, combination of the kinetic systems models with genomic scale models of metabolic and regulatory processes [
45] is needed to enhance the prediction accuracy of the future systems models. Such a combination will ultimately enable model to predict not only the crop yield potential, but also the quality of harvestable components since anabolism of different metabolites related to quality, such as starch composition and aroma related compounds, can be explicitly represented in such models.
Secondly, ePlant needs to predict the complete crop growth and developmental process [
1]. Along this line, models with different degree of mechanistic basis have been built to simulate plant developmental processes, e.g., gradual formation of 3D canopies with time [
46,
47], flowering [
48], shoot patterning [
49], flow of photosynthate from source to sink [
50,
51], 3D root growth dynamics [
52,
53], etc. So far, however, a mechanistic model of partitioning of assimilates among different organs is yet to be developed [
54].
Thirdly, ePlant needs to predict acclimation of plant metabolism and structure under different environments. Hence modules simulating the gene regulatory processes and signal transduction processes related to crop growth and development are needed. Predicting variation of phenotypes under different genotype × environment × management combinations is the holy grail of crop systems models research [
55]. Most of the research on this topic is still in its infancy. Development of a gene regulatory network (GRN), which incorporates the interaction of all involved regulatory cis-elements and trans-factors, is a critical task towards such a goal. Various bioinformatics algorithms, based on either correlation, or features selection, or probabilistic graph models etc., have been developed, which use genomic scale omics data, in particular, the transcriptomics data, to build GRNs [
56–
58]. There is only limited number of GRNs developed for plant related processes, such as flowering date determination [
59], photoperiod and circadian clock [
60], and seed setting [
61]. However, so far, these GRNs are not linked to current crop systems models. It is worth noting that a GRN related to circadian clock has been linked to an
Arabidopsis model [
62]. Such disconnection between GRNs and crop models partially explains why the current models, though capable of predicting performance of crops in the regions where they are parameterized, cannot predict crop performance as accurately once beyond their regions or environmental conditions or cultivars used for their parameterization. Here we emphasize that though some work has been done in developing GRNs based on transcriptomics data, models predicting the regulatory processes at post-translational levels, including transcript stability and degradation, translation, post-translational modification etc., are largely lacking. There is a long way to go before any realistic model of predicting the acclimation of plants under different environments becomes available.
Fourthly, in addition to the above discussed biological processes, ePlant needs to include models of interaction between plants and their surrounding soil and atmosphere. These interactions control plant growth and development. Modeling plant-environment interaction requires simulation of soil hydraulic dynamics, nutrient cycles and temperature profiles etc., which are the basis for predicting the soil water status and nutrient availability to roots. The microclimates inside the canopy, such as light, temperature, CO
2, humidity and wind speed also need to be incorporated in a crop systems models [
19,
63], to ensure an accurate prediction of the exchange of gas, water and momentum between canopy and atmosphere. Models of soil related processes have been well developed, i.e., CENTURY model [
64,
65]; while fully integrated canopy photosynthesis and microclimate models are yet to be developed. ePlant needs to integrate the above-ground processes with the below ground processes to develop a fully integrated microclimate model, including linking soil water status with the leaf biological and hydrological processes [
38,
66,
67].
USING DIVIDE-CONQUER AND TRANSFER VARIABLES TO REALIZE THE MULTI-SCALE, MULTI-PHYSICS ePLANT
As discussed above, ePlant includes modules describing processes at different temporal and spatial scales, with each process at particular scales potentially represented by different modules and each module potentially using different methods (see Figure 1 and Table 1). Therefore, ePlant is not a single model, rather it is an assembly of modules which can be combined to form models with different temporal, spatial and physical resolutions. ePlant development follows a two-step strategy, i.e., first divide-and-conquer to develop individual modules and then integrate modules through transfer variables. When we divide plant growth and developmental processes into different units, i.e., modules, we follow the principle of maximizing connections within modules while minimizing connections between modules, as did during development of the ePhotosynthesis models [
20,
29,
78]. The connectivity between photosystem II unit with other components of ePhotosynthesis is minimal, which justifies development of an independent module for PSII photochemistry and biophysical processes [
78]; similarly, the connectivity between the photosynthetic carbon metabolism with that of photosynthetic light reactions is relatively less, which justifies the development of an independent model of photosynthetic carbon metabolism [
20]. Another principle that can be used to divide modules is to separate reactions/processes occurring at drastically different time scales because every process in ePlant can be viewed as dynamic at a higher time resolution; similarly, every step can be viewed as a steady-state process if viewed at a lower time resolution. Processes at similar temporal and spatial scales can be grouped together as a module. ePlant hence includes modules working at different temporal and spatial scales, i.e., ecosystems level, crop physiology level, metabolism level, and gene regulatory network level (Figures 1 and 2). Transfer variables, which are defined as an output of a lower level modules, which at the same time are also inputs to higher levels, are used to integrate modules at these different scales (Figure 1).
Photosynthetic CO
2 uptake occurs at different temporal and spatial scales. Here we use modules of photosynthetic CO
2 uptake to illustrate how transfer variables are used to integrate modules at different scales. At the ecosystem scale, photosynthetic CO
2 uptake can be predicted using a sunlit-shaded model which calculates canopy photosynthesis by summing up CO
2 uptake rate of both sunlit and shaded leaves [
72]. At the leaf scale, photosynthetic CO
2 uptake can be predicted using models which explicitly describes both the leaf anatomy and leaf metabolic processes [
24]. The leaf scale photosynthetic CO
2 uptake can also be predicted with a steady state biochemical model with consideration of stoichiometric relationship between reactions [
73]. Photosynthetic CO
2 uptake rate at the metabolism scale can be predicted with a dynamic systems model with consideration of both the stoichiometry and also enzyme kinetics [
20]. Photosynthetic CO
2 uptake rate at the level of gene regulatory network can be predicted with a detailed consideration of the regulatory processes influencing photosynthesis [
79]. If we need to integrate a physiological model of canopy photosynthesis, e.g., a sunlit-shaded model [
72], with a dynamic systems model of C
3 photosynthesis, the Rubisco-limited RuBP carboxylation rate (
Vcmax) and maximal rate of electron transfer rate (
Jmax) can be used as transfer variables. Specifically, we can use the dynamic systems model of photosynthetic metabolism, such as the C
3 carbon metabolism model [
20] to predict responses of photosynthetic CO
2 uptake rates (
A) under different CO
2 levels, which can be used to infer
Vcmax and
Jmax. These two transfer variables can then be used as inputs to the physiological level models to predict photosynthesis at the canopy level under different environments. Such an integration combining canopy photosynthesis model and metabolism model enables examination of the impacts of manipulating different enzymes on canopy photosynthesis. Similarly, if a kinetic model of gene regulatory processes controlling photosynthesis development is available, it can be used to predict the quantity of different proteins involved in photosynthesis, which can then be used as transfer variables for metabolic systems models.
The above discussed model integration process works well if models working at different scales are described continuous processes. However, models for continuous processes and discrete processes can not be integrated using this method. Under such circumstances, a probabilistic regulation of metabolism algorithm, which has been developed to link GRNs to a constraint based genomic scale metabolism model for
E. coli [
80], can be used. In all these model integration processes, it is important to ensure that the known constraints, such as stoichiometric constraints of the biomass composition and growth rate [
81], are maintained.
Here we emphasize that ePlant will not be one model, rather it will be a series of continuously evolving models with gradually increased mechanistic details with time. The level of mechanistic details needed for any particular realization of ePlant depends on the question to be addressed. Therefore, though development of the first integrative ePlant model is a concrete goal, development and improvement of ePlant will be a continuously ongoing work. Considering that modules describing different plant processes have different levels of mechanistic details, therefore, ePlant developed at any particular time point will inevitably be a mosaic of modules with different mechanistic details.
A THEORETICAL FRAMEWORK TO SUPPORT PREDICTIVE AND QUANTITATIVE PLANT BIOLOGY RESEARCH IN THE BIG DATA ERA
ePlant is a mathematical representation or integration of the current knowledge about a living plant. Each component or process or action on plants can be abstracted as a term used to describe the component, function or application of ePlant (Table 2). In a broad sense, everyone has his or her model, which is used to interpret experimental observations, analogous to the process of fitting model parameters to a mathematical model, though in a qualitative way. During a typical research project, we explore the unknown and extend the boundary of our knowledge by studying a difference that cannot be explained by current knowledge or “model”. Push this analogy even further, when experiments are designed and results are compared between different groups, we are in some sense studying phenotypic variations with different models embraced by different labs. Unfortunately, due to the complex nature of plants, every “model” is right only to certain degree and no “model” is absolutely right [
82]. The process of pushing “models” closer and closer to the absolute “truth” can be seen as the essence of scientific research. This same process occurs during the development of ePlant and its component models, i.e., ePlant will become a better and better representation of the reality with its gradually improved capacity to predict the distribution of output variables with the distribution of the input variables (Figure 3).
When simulations are used to draw conclusions about a particular process, the reliability of the model prediction is crucial. Four types of errors can potentially create “artificial” difference between model predictions and real plants: i) errors caused by inaccurate and imprecise experimental techniques or operations; ii) errors introduced during model parameterization, i.e., parameters measuredin vitro may not represent those in vivo, and even parameters estimated in vivo may still biased due to limitation of technologies; iii) errors due to uncertainty of the model, especially when the model is used to represent a process for which a complete mechanistic understanding is unavailable, either due to unknown variables or unknown relationships among variables, and hence some empirical equations or relationships derived from limited data are used; iv) errors due to the model structure. Simulation of a particular phenomenon needs a model with appropriate spatial and temporal scales. If a model’s temporal and spatial resolution is too high for a question to study, too many unnecessary assumptions will be introduced and hence magnifying potential structural errors. If a model’s temporal or spatial scale is too low for the question to study, the model will unlikely generate novel insights regarding the questions under study.
A theoretical framework therefore needs to be developed to enable studying these different errors and their impact on model behaviors. Minimally, the framework needs to address following questions: how much will the bias in measurable and non-measurable parameters influence the reliability of our simulations? How much will the uncertainty of the model itself influence the reliability of model simulations? How much will the scale of model influence the reliability of model simulations? How to unify models developed with different temporal and spatial resolutions and mechanistic details while maintain the essential prediction capacity? How to interpret the potential bias of experimental measurements? How much will this bias influence the comparison between experiment and simulation, and the reliability of conclusions? If for a particular phenomenon no mechanistic understanding is available, how can information from experimental data still be effectively used in model simulations?
On this aspect, mathematical theories such as information geometry [
83] can potentially be adapted and used to support studies as discussed above. Theoretically, information geometry takes a model as a function/mapping between experiment measurements (outputs) and model parameters (inputs), model structure therefore is equivalent to certain shape of a manifold in a hyperspace [
83]. Although the relationship between model input parameters to measured phenotypic output parameters are many-for-one, with the variation of the model parameters, it is possible to estimate the confidence intervals of the model output variables, i.e., creating an ensemble of input parameters and using these to predict the distribution of model outputs and hence deriving the potential confidence intervals of the model outputs. Conversely, if the variation of a particular physiological parameters (or model output) is known, it is possible to deduce the potential variation of certain input model parameters as well. The deduced variability of model parameters can inform us about the level of feasibility and effectiveness of engineering a particular plant trait for a desired biological output. If a deduced input variable shows little variation, it would be less feasible to manipulate this variable; furthermore, even if a deduced input variable values show large variation but it has little impacts on output parameter, it is unlikely that this parameter will be an effective parameter to modify (Figure 4).
In this sense, the concept of ePlant will include not only the model itself, it will also include a theoretical framework to enable predictive and quantitative plant science research. Finally here we emphasize that though great amounts of experimental data have been collected by the plant science research community, however, most of these data only cover a limited number of variables and thus have limited value in promoting identification of new knowledge gap in current plant science using ePlant. To study the above discussed questions, carefully designed internally consistent data sets need to be collected systematically, in particular on those parameters related to the expanded model components. Here the internally consistent data sets refer to those data collected on the same plants grown under the same condition and at the same developmental stages. Such data will be crucial to verify each module and the integration of different modules.
With a validated model available, any further difference between model simulations and new experimental observations can help target potential causes, design specialized experiments, discover unknown factors or mechanisms related to a particular area [
84]. Such a process will also urge development of new methodology and technology to measure key parameters limiting the development of current knowledge/models. Such an iterative model development, validation, improvement process, or supervised learning process, has the potential to become a new paradigm of the future quantitative and predictive plant science research.
ePlant will become a crucial tool to integrate and use the diverse data in the big data era. Big data, including genomes, transcriptomes, proteomes, metabolomes, and different phenomes, can be regarded as either input or output for ePlant or its component models. Mapping between ePlant or its component models with natural variations in these data poses a tremendous challenge and offers huge opportunities for development of new algorithms, tools and frameworks (Figure 3). Only after these tools and frameworks are fully established, the great promise offered by ePlant to help guide future crop engineering, breeding, and agronomy can be realized. From this perspective, the creation of ePlant model itself is only the first step on this New Long March.
FINAL COMMENTS: THE GLOBAL EFFORTS
Model plant species, such as
Arabidopsis, rice and maize, for which vast amount of genetic resources, background knowledge and efficient transformation protocols are available [
85–
88], are likely to be the first set of plants that will be used to realize ePlant. Here we highlight a number of recent advances on development of ePlant or its equivalents. Chew
et al. [
62] developed a multi-scale digital
Arabidopsis which can predict organ and whole organism growth. Zhu
et al. [
89] proposed the development of a collaborative model development platform, i.e.,
Plant in silico, which includes not only the basic modules, data for model parameterization and validation, but also the basic algorithmic tools for model application, visualization, etc. With developing
Plant in silico as a goal, a
Crop in silico international consortium was recently proposed [
90]. The Department of Energy of the United States started an Integrative Plant Air Soil Systems (iPASS) initiative, aiming at creation of an integrative plant systems model, which, when combined with plant ecosystems phenomics, can be used to study the interaction between plants, microbiome, atmosphere and soil [
91]. It is foreseeable that development of ePlant and the associated algorithms and resources, both for models development and their application, will become a nucleus to integrate research activities spanning diverse disciplines, including plant biology, computer science, computer vision, high performance computing, agronomy, phenomics, for decades to come, or to put it simply, function as the nexus of the future predictive and quantitative plant science research, which has the potential to transform the future agriculture by harvesting the power of model guided crop engineering, breeding and agronomy.
Higher Education Press and Springer-Verlag Berlin Heidelberg