INTRODUCTION
Promoter strength or activity plays a key role in regulating the transcription of downstream genes. Living cells have evolved a number of promoters with a range of strengths to fine-tune the expression of key genes so as to achieve specific physiological function. For artificial biological systems reconstruction, various strength of promoters and other regulatory elements are also indispensable tools for design of controllable circuits or networks. Although construction of random mutation based library has practical application [
1–
3], quantitative modeling strategies are still required to improve efficiency and save costs during the design of large scale networks and systems. Therefore, developing methodologies based on model calculation and prediction for designing element sequence will become a trend in the future. Certain modeling methods have been tried to achieve precise prediction or even
de novo design of element sequence, including a serial of rational methods (e.g., biophysical modeling [
4,
5]) and irrational methods (e.g., position weight matrix modeling [
6], partial least squares regression modeling [
7], and machine learning based modeling [
8], etc.). Recent progress on construction of such quantitative models was reviewed and discussed by our previous work [
9]. Specifically, as a machine learning based method, artificial neural network (ANN) was employed to characterize the highly nonlinearity between promoter sequence and its strength [
8], and a high regression correlation coefficient of
R2 = 0.96 was achieved for both model training and test, far outstripping those modeling methods based on linear regression (or its derivative methods). The success of this example demonstrates the application prospect of machine learning methods in predicting promoter strength. As another important machine learning method, Support Vector Machine (SVM) was thus tried to reconstruct such models in this work.
SVM is developed by Vapnik [
10,
11] in the 1990s based on the statistical learning theory. It applies kernel functions to map the input data into a higher dimensional feature space so as to change the non-linear problem to a linear one in the higher space. Compared to ANN, SVM is a relatively newer and more disciplined machine learning algorithm. Traditional learning methods such as ANN use Empirical Risk Minimization (ERM) criteria to minimize the error of sample points in training, easily resulting in an intractable problem of over fitting. In contrast, SVM takes the criteria of Structural Risk Minimization (SRM) rather than ERM, making it easy to overcome the local minimum and over fitting that usually occurs in ANN modeling, and thus improving the generalization ability of the model. This advantage is more prominent in the study of small samples. Besides, other advantages versus ANN mainly include [
12]: i) the automatic structure selection; ii) more advantages in non-linear and high dimensional pattern recognition and function regression; and iii) a more rigorous mathematical derivation and proof. Accordingly, it can be well applied to solve both classification and regression problems. Due to many distinct advantages, SVMs have been widely employed to different fields of artificial intelligence such as handwritten character recognition, face recognition, text classification, data mining, etc [
12].
In the field of life sciences, SVM is also a powerful tool to build effective predicting models in bioinformatics and computational systems biology, such as protein structure and stability prediction [
13,
14], RNA secondary structure prediction [
15], bacterial transcription start sites prediction [
16], virtual screening for drug discovering [
17–
20], drug metabolism prediction [
21], disease prognosis and prediction [
22,
23], as well as promoter recognition and structure analysis [
24–
32]. However, SVM has not been reported to use in predicting the strength of promoter or even the regulatory elements. Due to its many advantages compared to ANN, SVM is therefore supposed to be able to build a precise model for prediction of promoter strength after being trained by a small dataset. To this end, we tried to construct a high-performing SVM model for prediction of promoter strength in this work. After multi-parameter optimization, model training and test, we finally obtained the best predicting model, which can precisely output a desired strength value from the promoter sequence (Figure 1).
RESULTS
Model construction and training
The complex relationship between promoter sequence (
x) and its strength (
y) is considered to be mapped by an SVM regression function
. To achieve this goal, SVM models were tried to construct according to Vapnik
et al. [
10,
11]. The SVM toolbox [
33] running on Matlab platform was employed to build, train and test the SVM models for promoter strength prediction. The performance of constructed SVM models was evaluated by the following two indexes:
i) Mean Squared Error (MSE)
and ii) Squared correlation coefficient (R2)
where and are the strength values of prediction and experiment, respectively. Several kernel functions including the polynomial function, sigmoid function, and radial basis function (RBF) were tried one-by-one in preliminary experiments, and found that the RBF is most suitable for fitting the data.
A mutation library containing 100 promoter sequences and their corresponding strength values [
8] was randomly divided into a training set and a test set for training and test of the SVM models, respectively. Different amount of sequences (from 10 to 90) were tried one-by-one to determine the minimum size of training set for reaching the best prediction performance. Each size was independently and randomly sampled for five times, and maximum
R2 and minimum
MSE values for prediction of the test set were calculated. As a result,
R2 increases and
MSE decreases with an increase of the size of training set (Figure 2). The best prediction performance was achieved at the size of 90 sequences, similar to that of ANN [
8]. Therefore, the model was trained based on 90 sequences and tested on other 10 sequences.
To achieve a smaller MSE or a higher R2, a range of feasible values for each parameter (including the balance factor C for loss function and the width value for RBF kernel function) under different precision error were tried by different combinations to get the best parameters. As a result, the general trend of MSE is increased with a large C and , for both of the training (Figure 3A‒3D) and test (Figure 3E‒3H). Although the minimum mean MSEs (0.561 for training and 0.394 for test) and the minimum MAX MSEs (2.39 for training and 2.32 for test, Figure 3D and 3H) were both achieved when , the best C and to get the lowest MSE occurred when . Hence, the best parameter combination , and was further chosen to train the model again and then we obtained the best model termed ‘OptModel’. The fitting for training set by ‘OptModel’ shows a high correlation, with R2>0.99 (Figure 4A).
Model test, prediction and evaluation
The performance of model ‘OptModel’ was evaluated by applying it to predict the data of test set. Fitting results indicate that a fine correlation was achieved (
R2>0.98, Figure 4B), and the model can accurately predict the strength of each promoter in the test set (Figure 4C), indicating the success of avoiding over learning problem. Both
R2 of training and test are better than the corresponding values (
R2 = 0.96 for training and test) obtained by an ANN model presented by our previous work [
8]. To display more intuitively, the fitting results were further compared with that of ANN (Figure 5). Data points of SVM prediction are more concentrated along the diagonal, indicating a better fitting and a precise prediction when compared to ANN model in this case.
Next, the effect of single base mutation on the sequence strength was evaluated by the prediction of OptModel. Each base of the wildtype sequence was mutated to another one (e.g., ‘A’ was changed to ‘C’, ‘G’, or ‘T’), and the corresponding sequence strength was predicted one-by-one (Figue 6A). The highest strength 1.42 occurs at the mutation of 209
A→G, and the lowest strength 0.48 appears at 196
A→G. An average strength of 0.9 was obtained of all 672 single base mutations, which is smaller than that of the wildtype sequence. Besides, the ‘key-points’ strongly influencing the promoter strength (≥1.2 or≤0.8) were further picked out from Figue 6A. Of these “key-points”, 82 cases belong to negative mutation (≤0.8), far more than 8 cases of the positive (≥1.2). Most of the mutations (541/672) tally with that of the ANN predictions [
8], and only ~19% show significant difference (absolute value≥0.2) between SVM and ANN predictions (Figue 6B), indicating an approximate prediction performance of these two methods.
DISCUSSION
Numerous works demonstrated that the sequences of the regulatory element and its strength/activity have a direct correlation, and many quantitative prediction models were constructed to bridge the gap between sequence and strength [
9]. Of these models, machine learning based methods such as ANN were introduced to this field in recent years, and a better prediction performance indicates the potential of this methodology in synthetic regulatory element design [
8]. In this study, another important machine learning method, SVM, was tried to build similar models, and the prediction performance was proven no worse than ANN, demonstrating a promising application prospect of SVM in prediction of prokaryotic promoter strength as well. Besides, the methodology has a good generalization performance on various problems and is expected to do such work in eukaryotic promoters, and even other regulatory elements like terminators.
Machine learning technologies have been widely used in artificial intelligence (AI) and made tremendous progress; especially, rapid development of intelligent robots triggers the coming era of ‘Industry 4.0’. Early this year, a powerful ANN-based machine named AlphaGo was designed by Google to play Go game, and surprisingly beat the world champion Sedol Lee with a big score. In the academic field of life sciences, introduction of machine learning based methods has greatly promoted the development of the discipline, especially modeling in bioinformatics and systems biology. Besides aforementioned applications, this work exemplified the use of SVM modeling in a new field, the prediction of promoter strength. Considering the common situation of small samples generalized by various biological experiments, as well as the advantage of SVM in learning small samples, this methodology is also expected suitable for building prediction models on such small dataset like this work.
Although the introduction of machine learning methods can effectively promote the prediction accuracy to a very high level compared to traditional methods, some limitations brought by these kinds of AI algorithms still seem difficult to overcome. It is well-known that the prediction performance of machine learning algorithms directly depends on the ‘knowledge’ they have learnt. For example in this study, the mutated sequences were generated by error-prone PCR technology based on the wildtype promoter, which introduced only<30% mutation rate to the initial sequence; so the model cannot get adequate learning from this “pseudo-random” mutation data and its generalization ability is thus weakened. The best model trained by this library may not precisely predict the strength of those sequences with≥30% mutation rate. Furthermore, 90 sequences for training are inappreciable compared to the sample space ( ), so the information what the model can learn is very limited. Although SVM is supposed to construct high-performance models based on small samples, the prediction performance of such models was significantly degraded when using a smaller training set (less than 90 sequences, see Figure 2). But to be sure, the more information learnt by machine learning algorithms, the more powerful the model can be. Just as the well-trained AlphaGo machine, it is quite hard for human to revenge on it henceforward. With rapid growth of experimental data, more powerful or intelligent AI models may be constructed to extensively learn knowledge from multiple datasets including strength of different regulatory elements from various kinds of species, thereupon greatly improve the precision and generalization ability, and emancipate us from repeated and laborious experiments.
METHODS
Computational platform and tools
Matlab 2013a (Mathworks Inc., http://www.mathworks.com/) ran on a personal computer with Microsoft Windows 10 operating system (Microsoft Inc., http://www.microsoft.com/). SVM Toolbox [
33] was integrated into Matlab and served as the computational tool for SVM model construction, data training and prediction. All calculations and simulations were programed and run upon SVM Toolbox and Matlab environment.
Data sources and dataset preparation
A mutation library of
Escherichia coli Trc promoter containing 100 promoter sequences and their corresponding strength values was constructed by our previous work [
8]. Briefly, error-prone PCR was performed on Trc promoter region in plasmid pTrcHis2B (224 bp, including-35 box, -10 box, RBS, and other regions) to introduce random mutation; then a
gfp gene was inserted into the plasmid as a reporter; finally the promoter strength was assayed by detecting GFP expression in
E. coli using flow cytometry. The strength of each sequence is a relative value compared to the wildtype sequence. In this work, the library was used to train and test SVM models. Similar to the previous work, the library was randomly divided into two datasets, including a training set and a test set (Supplementary Dataset S1 shows an example sampling for training the best performance model ‘OptModel’).
The original sequence data coded by ‘A’, ‘G’, ‘C’ and ‘T’ were translated to digital matrix for SVM model input according to the following orthogonal rules: ‘A’ = [1, 0, 0, 0], ‘G’ = [0, 1, 0, 0], ‘C’ = [0, 0, 1, 0], and ‘T’ = [0, 0, 0, 1]. For instance, a given sequence ‘AGTGCC’ can be translated to a ‘0‒1’ digital series of [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0] under this conversion rule.
Model training and test
A range of values were set for each parameter to train the SVM model. Specifically, set the balance factor , the width value of kernel function , and the precision error . Other setting conditions include: ‘eInsensitive’ for the loss function, and ‘rbf’ for the kernel function. A 2-layer nested loop was employed to search for the best parameters during model training and test. The mean squared error (MSE) and squared correlation coefficient (R2) were calculated as the index to evaluate the performance of model training and test under each setting of parameters. An absolute value was operated on the prediction of training and test data, since the strength values are non-negative. Finally, the best combination of parameters ( and ) was used to retrain and generate the best model.
Higher Education Press and Springer-Verlag Berlin Heidelberg