1. Department of Electrical Engineering and Computer Science, National Institute of Technology, Matsue College, Shimane 690-8518, Japan
2. Graduate School of Medicine, Kyoto University, Kyoto 606-8507, Japan
3. Riken Quantitative Biology Center, Hyogo 650-0047, Japan
morihiro@matsue-ct.jp
Show less
History+
Received
Accepted
Published
2017-09-01
2017-12-13
2018-06-11
Issue Date
Revised Date
2018-05-09
2017-12-04
PDF
(411KB)
Abstract
Background: For understanding biological cellular systems, it is important to analyze interactions between protein residues and RNA bases. A method based on conditional random fields (CRFs) was developed for predicting contacts between residues and bases, which receives multiple sequence alignments for given protein and RNA sequences, respectively, and learns the model with many parameters involved in relationships between neighboring residue-base pairs by maximizing the pseudo likelihood function.
Methods: In this paper, we proposed a novel CRF-based model with more complicated dependency relationships between random variables than the previous model, but which takes less parameters for the sake of avoidance of overfitting to training data.
Results: We performed cross-validation experiments for evaluating the proposed model, and took the average of AUC (area under receiver operating characteristic curve) scores. The result suggests that the proposed CRF-based model without using L1-norm regularization (lasso) outperforms the existing model with and without the lasso under several input observations to CRFs.
Conclusions: We proposed a novel stochastic model for predicting protein-RNA residue-base contacts, and improved the prediction accuracy in terms of the AUC score. It implies that more dependency relationships in a CRF could be controlled by less parameters.
Morihiro Hayashida, Noriyuki Okada, Mayumi Kamada, Hitoshi Koyano.
Improving conditional random field model for prediction of protein-RNA residue-base contacts.
Quant. Biol., 2018, 6(2): 155-162 DOI:10.1007/s40484-018-0136-7
It is important to uncover biological cellular systems from a molecular point of view. Interactions between proteins and RNAs play essential roles in the regulation of gene expression, the stabilization of protein complexes, maturation of mRNA to the trafficking [1]. Therefore, some disruption to RNA-binding proteins can lead various diseases. In many interactions between proteins and RNAs, its protein and RNA recognize specific sites of each other. It was reported that DNA-protein interactions are different from RNA-protein interactions, and RNA bases make more direct contacts with proteins than do DNA bases [2]. As RNA-binding regions of proteins, the K-homology (KH) domains [3], double-stranded RNA-binding domains (dsRBD) [4,5], DEAD-box domains [6], Pumilio repeat domain [7], zinc fingers [8] and so on are known. In contrast, binding regions of RNAs have not been well investigated. Gupta and Gribskov reported that different bases are preferred in base-specific and base-nonspecific interactions, and RNA structures in protein-binding regions can be sufficiently distinguished from non-binding regions [9].
Several computational methods for detecting RNA-binding sites and protein-RNA interactions have been developed. Peled et al. proposed a de-novo function prediction approach based on identifying biophysical features [10]. In their method, random forest (RF) [11] was employed because it yielded better results than neural networks and support vector machines (SVMs). Kumar et al. made use of evolutionary information and position-specific scoring matrix (PSSM) profiles, and employed support vector machine (SVM) [12,13]. Perez-Cano and Fernandez-Recio developed an ad hoc algorithm using protein-RNA interface propensities calculated from nonredundant X-ray structures of protein-RNA complexes [14]. Liu et al. combined a new interaction propensity with features based on sequences and structures, and achieved an accuracy of 84.5% [15].
Zhang et al. proposed a hidden Markov model (HMM)-based algorithm to predict clustered functional RNA-binding sites of proteins by integrating the number and spacing of individual motif sites, the accessibility in RNA secondary structures, and cross-species conservation [16]. Zhao et al. developed a method based on structural alignment to known protein-RNA complex structures [17]. Ren and Shen proposed new structural features based on accumulated distances from template patches extracted from RNA-binding interfaces [18]. Wang et al. proposed an extended naive Bayes classifier for de novo prediction of protein-RNA interactions [19]. Sun et al. proposed structural features of residue electrostatic surface potential and triplet interface propensity according to the statistical and structural analysis of protein-RNA complexes [20]. These methods predict RNA-binding sites of proteins and interactions of proteins and RNAs. In this paper, we focus on interactions between both sites of amino acid residues and bases in protein-RNA interactions.
Lafferty et al. developed conditional random fields (CRFs) to segment and label sequence data [21]. CRFs have been applied to many problems in the fields of image recognition, natural language processing, and bioinformatics [22–24]. Statistical models based on CRFs have been developed for predicting protein-protein interactions [25], protein residue-residue contacts [26], and protein-RNA residue-base contacts [27,28]. CRFs require evidences that another event has occurred, and mutual information (MI) between residues and bases was introduced, which is calculated from multiple sequence alignments. In general, it is considered that an amino acid residue at an interacting site has coevolved together with its partner RNA base to keep the interaction. MIp was developed to improve residue-residue contact prediction, and is calculated by subtracting a bias value from MI [29]. A prediction method for residue-base contacts in protein-RNA complexes was developed using a CRF-based model [27]. In the model, relationships between neighboring residue-base pairs were considered. Since the model has many parameters, L1-norm regularization (lasso) [30] was applied to improve the prediction accuracy [28]. In this study, we propose a novel CRF-based model with more complicated dependency relationships and less parameters than the existing one. As well as MIp, we examine the pseudolikelihood maximization direct-coupling analysis (plmDCA) [31], which was developed to infer a protein tertiary structure from its protein sequence, and tries to separate direct interactions from indirect ones between residues. For evaluating the proposed CRF-based model, we perform cross-validation computational experiments, and show that the proposed model without using the lasso regularization outperforms the existing model with and without the lasso under both input observations of MIp and plmDCA to CRFs.
RESULTS
To evaluate the proposed CRF-based model, we used the same dataset as that in the previous paper, which was extracted from tertiary structures of protein-RNA complexes in PDB [32], and consists of the residue-base pairs included in thirteen protein-RNA pairs as shown in Table 1.
Here, the sequences stored in PDB for these proteins and RNAs were the same as those included in multiple sequence alignments of the corresponding Pfam [33] and Rfam [34] entries, respectively, and the sequence in a PDB entry was the same as that in UniProt [35]. Table 1 shows the followings: the identifier of UniProt of a protein sequence, its length, the identifier of GenBank [36] of an RNA sequence, its length, the identifiers of Pfam and Rfam of alignments, the identifier of PDB, and the number of contacts. It was assumed that a residue and a base interact with each other if the Euclidean distance between an atom of the residue and one of the base is less than or equal to 3 Å because the distances of hydrogen bonds between oxygen and nitrogen atoms, OH-O, OH-N, NH-O, and NH-N, are about 2.7 to 2.9 Å.
To calculate MIp and plmDCA, we used the file “Pfam-A.full” of Pfam database (release 26.0) and “Rfam.full” of Rfam database (release 10.1) for getting multiple sequence alignment data of proteins and RNAs, respectively. We used an implementation of plmDCA available from https://github.com/pagnani/PlmDCA. In counting the frequencies of amino acids and bases, we also examined several classifications of amino acids with 8, 10, and 15 groups proposed by Murphy et al. [37] as shown in Table 2.
To estimate the parameters of the CRF-based models, we employed the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [38,39] implemented by libLBFGS (version 1.10), available from http://www.chokkan.org/software/liblbfgs/, with default options, which is a quasi-Newton method approximating the Hessian matrix to maximize the likelihood function. For the contact inference, an implementation of the sequential tree-reweighted message passing (TRW-S) algorithm [40], MRF energy minimization software (version 2.1), available from http://vision.middlebury.edu/MRF/code/, was modified for use, which iteratively update messages from a node to another in the graph, and replace edge weights to minimize the upper bound of the objective function for a maximization problem.
We performed cross-validation procedures, and took the average of AUC (area under ROC curve) scores as in the previous work, where each procedure used all residue-base pairs contained in one protein-RNA pair of the dataset for test, and those in the other protein-RNA pairs for training. For the previous model, the lasso with coefficient of the regularization term was applied to the parameter estimation of also in this study because it output the best result among in the previous study.
Table 3 shows the result on the average AUC scores for test data by the proposed and previous CRF-based models using MIp and plmDCA as input observations in 8, 10, 15, and 20 groups of amino acids. In both input observations of MIp and plmDCA, the average AUC score by the proposed model was larger than that by the previous model. The average AUC score by the proposed model with plmDCA in 10 and 15 groups of amino acids was larger than those by the others.
Figure 1 shows the result on the average ROC (receiver operating characteristic) curves by the proposed and previous CRF-based models in 15 groups of amino acids. The curve of the proposed model with plmDCA was above the other curves, and the prediction accuracy, which is the ratio of the number of truely predicted residue-base pairs to the total number of residue-base pairs, was 0.997. These results suggest that the proposed CRF-based model outperforms the existing model even if the lasso regularization is not applied to the proposed model.
CONCLUSION
We improved the existing model for predicting residue-base contacts between proteins and RNAs, and developed a novel model with more complicated dependency relationships and less parameters based on conditional random fields. For evaluation of our proposed model, we performed cross-validation computational experiments, and took the average of AUC scores. The results suggest that the proposed CRF-based model without using L1-norm regularization (lasso) outperforms the existing model with and without the lasso under both input observations of MIp and plmDCA to CRFs. The number of parameters of the proposed model is 86 without using any classification of amino acids, whereas that of the existing model is 960. It can be considered that the lasso regularization increased the average AUC score for the existing model by automatically selecting effective parameters. In contrast, the proposed model did not need the lasso and obtained the better result because it has a sufficiently small number of parameters and rich dependency relationships between a target residue-base pair and its neighboring pairs. As future work, we would like to further improve the prediction accuracy for understanding detailed mechanisms of protein-RNA interactions. For instance, we can take other features in our model than evolutionary relationships calculated from multiple sequence alignments such as structural and biophysical features.
METHODS
In this section, we briefly review the existing CRF-based model, and coevolution measures, MIp and plmDCA, which are input observations to CRFs, calculated from multiple sequence alignments of given protein amino acid and RNA base sequences. In addition, we describe the proposed CRF-based model with more complicated dependency relationships and less parameters.
Conditional random field (CRF)-based models
Conditional random fields were developed by extending Markov random fields (MRFs) [21]. Suppose that is a graph with a set of nodes and a set of edges, and for a subgraph of , and are random variables corresponded to nodes and , respectively. Let be a set of neighboring nodes to , that is,
. Then, is a conditional random field if all s follow the Markov property under observations according to the graph . It means that the probability of given for all and is equal to the probability of given for only neighboring nodes and , that is, . A conditional random field with a strictly positive density can be written by
where denotes the normalization constant as , and denotes a potential function concerning the node .
For our purpose, given a protein sequence and an RNA sequence , a node in is corresponding to a residue-base position pair . Figure 2 illustrates residue-base pairs around . A set of neighboring nodes of is defined as . is a random variable, and if residue and base at positions and interact with each other, otherwise. Suppose that , , and is a 0‒1 constant vector with size that the element of the amino acid-base pair corresponding to is 1 and the others are 0.
Then, the conditional probability of given and , and sequences in the previous work [28] was defined using parameter vectors and by
where denotes the normalization constant, denotes the transpose of , , , and denotes the Kronecker product, for example, . The number of parameters is equal to the sum of dimensions of and , that is, . Mutual information (MI) and the improved MI calculated from multiple sequence alignments were used as input observations .
Figure 3 illustrates the dependency relationship between random variables by an element of and . In this model, the number of parameters , to be estimated is large, and the L1-norm regularization (lasso) was utilized by improving the prediction accuracy. In addition, depends on only , , and in . Hence, we propose the potential function with more complicated dependency relationships and less parameters having the following local features , by adding other association with .
where if residue and base at positions and interact with each other, otherwise, the conditional probability is written by Equation (2), and denotes the direct sum, for example, . The number of parameters and to be estimated in the training phase is .
Figure 4 illustrates the dependency relationship between random variables by an element of and . depends on input observations of all the neighboring nodes according to the maximum and minimum of for all in .
For both CRF-based models, parameters can be estimated from training data of N protein-RNA sequence pairs , and contacts by maximizing the following pseudo-likelihood function.
For the sake of reducing redundant parameters of and , in the previous model, we used the lasso, and maximized , where is a positive constant, and denotes the norm of .
In the prediction phase, is determined for test data using the estimated parameters and input observations. Then, the problem of finding maximizing for all under trained parameters and is NP-hard as generally discussed in [40].
Coevolution measure
We examine the improved mutual information MIp [29] and the pseudolikelihood maximization direct-coupling analysis (plmDCA) [31] as input observations to CRFs. We have two multiple sequence alignments for protein and RNA sequences (see Figure 5).
Let and be the observed frequencies of amino acid at position , and that of base at position , respectively. Let be the joint frequency of amino acid and base at positions and , where the sequence that appears must belong to the same species as the sequence that appears. These frequencies are divided by the total number of sequences in a multiple alignment. Then, mutual information between positions and is defined by .
For removing background noise of MI, MIp was proposed to be for protein residue-residue contacts. For our purpose of predicting residue-base contacts, MIp is modified to
Ekeberg et al. developed plmDCA for predicting the tertiary structure of a protein by solving the inverse Potts problem. A generalized Potts model can reproduce the empirically observed amino acid frequencies and , and is defined as
where and are parameters to be determined by the constraints, and . From a multiple sequence alignment of a given protein sequence, is determined. Then, the score of plmDCA between amino acid residues is defined bywhere denotes the Frobenius norm of , which is the zero-sum gauge of . For our purpose, we concatenate two multiple sequence alignments of protein and RNA sequences into one alignment such that the species of a protein sequence is the same as that of an RNA sequence.
Re, A., Joshi, T., Kulberkyte, E., Morris, Q. and Workman, C. T. (2014) RNA-protein interactions: an overview. Methods Mol. Biol., 1097, 491–521
[2]
Lejeune, D., Delsaux, N., Charloteaux, B., Thomas, A. and Brasseur, R. (2005) Protein-nucleic acid recognition: statistical analysis of atomic interactions and influence of DNA structure. Proteins, 61, 258–271
[3]
Siomi, H., Matunis, M. J., Michael, W. M. and Dreyfuss, G. (1993) The pre-mRNA binding K protein contains a novel evolutionarily conserved motif. Nucleic Acids Res., 21, 1193–1198
[4]
Feng, G. S., Chong, K., Kumar, A. and Williams, B. R. (1992) Identification of double-stranded RNA-binding domains in the interferon-induced double-stranded RNA-activated p68 kinase. Proc. Natl. Acad. Sci. USA, 89, 5447–5451
[5]
St Johnston, D., Brown, N. H., Gall, J. G. and Jantsch, M. (1992) A conserved double-stranded RNA-binding domain. Proc. Natl. Acad. Sci. USA, 89, 10979–10983
[6]
Gorbalenya, A. E., Koonin, E. V., Donchenko, A. P. and Blinov, V. M. (1989) Two related superfamilies of putative helicases involved in replication, recombination, repair and expression of DNA and RNA genomes. Nucleic Acids Res., 17, 4713–4730
[7]
Parisi, M. and Lin, H. (2000) Translational repression: a duet of Nanos and Pumilio. Curr. Biol., 10, R81–R83
[8]
Hall, T. M. (2005) Multiple modes of RNA recognition by zinc finger proteins. Curr. Opin. Struct. Biol., 15, 367–373
[9]
Gupta, A. and Gribskov, M. (2011) The role of RNA sequence and structure in RNA–protein interactions. J. Mol. Biol., 409, 574–587
[10]
Peled, S., Leiderman, O., Charar, R., Efroni, G., Shav-Tal, Y. and Ofran, Y. (2016) De-novo protein function prediction using DNA binding and RNA binding proteins as a test case. Nat Commun, 7, 13424
[11]
Ho, T. (1995) Random decision forests. Proc. Third Int. Con. on Document Analysis and Recognition, 1, 278–282
[12]
Kumar, M., Gromiha, M. M. and Raghava, G. P. (2008) Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins, 71, 189–194
[13]
Kumar, M., Gromiha, M. M. and Raghava, G. P. (2011) SVM based prediction of RNA-binding proteins using binding residues and evolutionary information. J. Mol. Recognit., 24, 303–313
[14]
Pérez-Cano, L. and Fernández-Recio, J. (2010) Optimal protein-RNA area, OPRA: a propensity-based method to identify RNA-binding sites on proteins. Proteins, 78, 25–35
[15]
Liu, Z. P., Wu, L. Y., Wang, Y., Zhang, X. S. and Chen, L. (2010) Prediction of protein-RNA binding sites by a random forest method with combined features. Bioinformatics, 26, 1616–1622
[16]
Zhang, C., Lee, K. Y., Swanson, M. S. and Darnell, R. B. (2013) Prediction of clustered RNA-binding protein motif sites in the mammalian genome. Nucleic Acids Res., 41, 6793–6807
[17]
Zhao, H., Yang, Y. and Zhou, Y. (2011) Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets. Nucleic Acids Res., 39, 3017–3025
[18]
Ren, H. and Shen, Y. (2015) RNA-binding residues prediction using structural features. BMC Bioinformatics, 16, 249
[19]
Wang, Y., Chen, X., Liu, Z. P., Huang, Q., Wang, Y., Xu, D., Zhang, X. S., Chen, R. and Chen, L. (2013) De novo prediction of RNA-protein interactions from sequence information. Mol. Biosyst., 9, 133–142
[20]
Sun, M., Wang, X., Zou, C., He, Z., Liu, W. and Li, H. (2016) Accurate prediction of RNA-binding protein residues with two discriminative structural descriptors. BMC Bioinformatics, 17, 231
[21]
Lafferty, J., McCallum, A. and Pereira, F. (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. Int. Conf. on Machine Learning 2001, pp. 282–289
[22]
Sha, F. and Pereira, F. (2003) Shallow parsing with conditional random fields. Proc. HLT-NAACL 2003, pp. 134–141
[23]
Yao, K., Peng, B., Zweig, G., Yu, D., Li, X. and Gao, F. (2014) Recurrent conditional random field for language understanding. 2014 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4077–4081
[24]
Vemulapalli, R., Tuzel, O., Liu, M. Y. and Chella, R. (2016) Gaussian conditional random field network for semantic segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3224–3233
[25]
Hayashida, M., Kamada, M., Song, J. and Akutsu, T. (2011) Conditional random field approach to prediction of protein-protein interactions using domain information. BMC Syst. Biol., 5, S8
[26]
Kamada, M., Hayashida, M., Song, J. and Akutsu, T. (2011) Discriminative random field approach to prediction of protein residue contacts. In IEEE International Conference on Systems Biology, pp. 285–291
[27]
Hayashida, M., Kamada, M., Song, J. and Akutsu, T. (2012) Predicting protein-RNA residue-base contacts using two-dimensional conditional random field. In 2012 IEEE International Conference on Systems Biology
[28]
Hayashida, M., Kamada, M., Song, J. and Akutsu, T. (2013) Prediction of protein-RNA residue-base contacts using two-dimensional conditional random field with the lasso. BMC Syst. Biol., 7, S15
[29]
Dunn, S. D., Wahl, L. M. and Gloor, G. B. (2008) Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics, 24, 333–340
[30]
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58, 267–288
[31]
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M. and Aurell, E. (2013) Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys. Rev. E Stat. Nonlin. Soft Matter Phys., 87, 012707
[32]
Rose, P. W., Beran, B., Bi, C., Bluhm, W. F., Dimitropoulos, D., Goodsell, D. S., Prlic, A., Quesada, M., Quinn, G. B., Westbrook, J. D., (2011) The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res., 39, D392–D401
[33]
Punta, M., Coggill, P. C., Eberhardt, R. Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., (2012) The Pfam protein families database. Nucleic Acids Res., 40, D290–D301
[34]
Gardner, P. P., Daub, J., Tate, J., Moore, B. L., Osuch, I. H., Griffiths-Jones, S., Finn, R. D., Nawrocki, E. P., Kolbe, D. L., Eddy, S. R., (2011) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res., 39, D141–D145
[35]
The UniProt Consortium. (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res., 38, D142–D148
[36]
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. and Sayers, E. W. (2011) GenBank. Nucleic Acids Res., 39, D32–D37
[37]
Murphy, L. R., Wallqvist, A. and Levy, R. M. (2000) Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng., 13, 149–152
[38]
Bertsekas, D. P. (1999) Nonlinear Programming. Nashua: Athena Scientific
[39]
Nocedal, J. (1980) Updating quasi-Newton matrices with limited storage. Math. Comput., 35, 773–782
[40]
Kolmogorov, V. (2006) Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell., 28, 1568–1583
RIGHTS & PERMISSIONS
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary 中Eng×
Note: Please be aware that the following content is generated by artificial intelligence. This website is not responsible for any consequences arising from the use of this content.