A survey of current trends in computational predictions of protein-protein interactions

Yanbin WANG; Zhuhong YOU; Liping LI; Zhanheng CHEN

doi:10.1007/s11704-019-8232-z

PDF(296 KB)

Front. Comput. Sci. ›› 2020, Vol. 14 ›› Issue (4) : 144901. DOI: 10.1007/s11704-019-8232-z

REVIEW ARTICLE

A survey of current trends in computational predictions of protein-protein interactions

Author information +

History +

Abstract

Proteomics become an important research area of interests in life science after the completion of the human genome project. This scientific is to study the characteristics of proteins at the large-scale data level, and then gain a holistic and comprehensive understanding of the process of disease occurrence and cell metabolism at the protein level. A key issue in proteomics is how to efficiently analyze the massive amounts of protein data produced by high-throughput technologies. Computational technologies with low-cost and short-cycle are becoming the preferred methods for solving some important problems in post-genome era, such as protein-protein interactions (PPIs). In this review, we focus on computational methods for PPIs detection and show recent advancements in this critical area from multiple aspects. First, we analyze in detail the several challenges for computational methods for predicting PPIs and summarize the available PPIs data sources. Second, we describe the stateof-the-art computational methods recently proposed on this topic. Finally, we discuss some important technologies that can promote the prediction of PPI and the development of computational proteomics.

Keywords

proteomics / protein-protein interactions / protein eature extraction / computational proteomics

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yanbin WANG, Zhuhong YOU, Liping LI, Zhanheng CHEN. A survey of current trends in computational predictions of protein-protein interactions. Front. Comput. Sci., 2020, 14(4): 144901 https://doi.org/10.1007/s11704-019-8232-z

1 1 Introduction

The success of a deep learning-based network intrusion detection systems (NIDS) relies on large-scale, labeled, realistic traffic [1,2]. However, automated labeling of realistic traffic, such as by sandbox and rule-based approaches, is prone to errors [3], which in turn affects deep learning-based NIDS.

Several effective schemes for learning with noisy labels (LNL) have been proposed, among which the use of parallel networks for sample selection during training has been shown to be effective. It is argued that parallel networks can discriminate samples from different perspectives, thus adding additional information to the training process. The centerpiece is maintaining disagreement of networks. Both Co-teaching [4] and Co-teaching+ [5] train two networks with the same structure and the same inputs but with different initial weights. Co-learning [6] trains a two-headed encoder that collaboratively performs sample selection and weight updates. The above methods have only a single input, and rely only on different initial network states to maintain disagreement, ignoring multimodalities in the data. However, multimodality is naturally present in the traffic [7,8], and it is difficult to optimize the model using a single modality.

To solve the above problem, we propose MMCo, a Co-teaching-like method using multimodal information and parallel, heterogeneous networks to detect malicious traffic with noisy labels. Unlike existing methods, (1) MMCo is the first LNL method that uses multimodality to maintain disagreement; and (2) the parallel networks in MMCo are heterogeneous and input different modalities of samples, which can mitigate self-control degradation and enhance robustness.

2 2 Architecture

The architecture of MMCo is shown in Fig.1. Multimodal information is extracted from the raw traffic and fed into parallel-trained networks, which perform collaborative training and sample selection from different data perspectives of local trend variation and long-term behavior. Together, the trained networks form a malicious traffic classifier.

Fig.1 Architecture of MMCo

Full size|PPT slide

Notation Let

D := (x_{a}, x_{b}, \hat{y})

be a noisy dataset, where

x_{a}

and

x_{b}

denote the different modalities of the samples and

\hat{y}

denotes the noisy labels.

F_{C N N}

and

F_{R N N}

refer to the selected CNN and RNN with weight parameters

θ_{C N N}

and

θ_{R N N}

, respectively, with corresponding inputs

D^{(a)} := (x_{a}, \hat{y})

and

D^{(b)} := (x_{b}, \hat{y})

. For simplicity, we refer to the inputs as

D

Multimodal information extractor We segment the raw traffic and extract different modalities, such as semantic and spatio-temporal modality. In the semantic modality, metadata of the protocol stack and the agreement of the encryption are included, since encrypted packets are highly structured. These metadata are unencrypted, can reflect the communication setup and negotiation process for encryption suites and extensions, and can help identify individuals. Then, we extract the semantic embedding of these fields by a projection network. Spatio-temporal modalities include size sequences and arrival time sequences of the packets that characterize the behavior, such as sending short packets at regular intervals may indicate Trojan’s keep-alive behavior. Both the above modalities can help identify malicious traffic from different perspectives. We input them into suitable networks for training with noisy labels.

Parallel training Algorithm 1 represents the training process of MMCo. In this stage, we choose CNN and RNN to learn two different modalities respectively. The semantic modality consists of the control and exchange information represented in the packets, which are aligned and suitable for processing using CNNs. While spatio-temporal modality is essentially two time-series with variable length that RNNs are good at handling.

In each mini-batch,

F_{C N N}

and

F_{R N N}

are fed with different modalities of the same subset.

F_{C N N}

and

F_{R N N}

select for each other the samples they consider more important, i.e., the samples with different distinguish or less loss among all mini-batches. Only these samples will be used for updating the parameters of the networks.

L

is used to estimate the loss of the samples. In steps 5 and 6, the samples with less loss in each mini-batch are selected. In steps 7 and 8,

θ_{C N N}

and

θ_{R N N}

are updated using the subsets selected by each other, respectively. The relative entropy between the observed and predicted labels is measured.

(1)

\begin{aligned} L (D; θ) = - \frac{1}{| D |} \sum_{D} \hat{y} \log (F (x; θ)) . \end{aligned}

λ

determines how many samples are selected in each epoch to update

θ_{C N N}

and

θ_{R N N}

. Benefiting from the memory properties of neural networks, they can learn the correct knowledge preferentially. As the network continually fits the noisy data distribution, the impact of noisy labels becomes increasingly significant. Therefore,

λ

is decreased monotonically with

e p o c h

, and the range in this paper is [0.5,0.9].

Full size|PPT slide

Decision fusion In the decision fusion stage, we used the classical late classifier fusion, i.e., constructing a weighted linear combination of the scores of both classifiers.

(2)

\begin{aligned} H (x) = w_{C N N} F (x; θ_{C N N}) + w_{R N N} F (x; θ_{R N N}) . \end{aligned}

3 3 Experiment and analysis

Datasets We need to extract multimodal information from raw traffic. Therefore, we choose the pcaps provided by CICIDS-2017 [9] and DoHBrw-2020 [10], which are divided into training and validation sets. The labels in the training set are flipped randomly as noise. We set up Symmetric and Asym-metric scenarios according to realistic situations. In the symmetric scenarios, all the labels are likely to be flipped, while in the asymmetric scenarios, only some classes are flipped. In the validation set, all labels remain unchanged.

Results The disagreement of the two networks is shown in Fig.2. The accuracy on the validation set is shown in Fig.3. When 200 epochs are completed, the classification networks of MMCo still maintain 10% disagreement with a final classification accuracy of 90%, while the disagreement of the two networks of other methods is close to zero. At this time, the two networks are in a state of self-control degradation, and it is difficult to learn more knowledge. However, MMCo can maintain a higher disagreement compared to others, thus helping the classifiers to learn more correct knowledge, with about 10% higher accuracy.

Fig.2 Disagreement of networks (Sym-20%)

Full size|PPT slide

Fig.3 Accuracy on validation set (Sym-20%)

Full size|PPT slide

Tab.1 compares MMCo and other methods under different patterns and noise levels, and MMCo outperforms state-of-the-art methods.

Tab.1 Accuracy under different noise scenarios

Acc(%)	Asym.			Sym.
Acc(%)	20%	50%	70%	20%	40%
Co-teaching	83.13	81.17	73.55	83.61	73.11
Co-teaching+	80.49	79.47	71.88	79.89	72.83
Co-learning	80.24	80.11	74.71	80.35	75.54
MMCo (ours)	92.89	90.76	85.31	92.86	81.31

4 4 Conclusion and future work

In this paper, we validated the feasibility of LNL using multimodal information. We found that MMCo could maintain higher disagreement through networks with different structures and modal inputs. This improved the robustness of the classifier. Thus, MMCo could alleviate the problem of self-control degradation of parallel models, to which current LNL methods are prone.

In the future, we will further investigate the analysis of the representations of two networks in multimodal networks using explainable artificial intelligence, which may help identify and clean malicious traffic with noisy labels.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Colinge J, Bennett K L. Introduction to computational proteomics. PLoS Computational Biology, 2007, 3(7): e114 CrossRef Google scholar

[2]	Matthiesen R. Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics, 2010, 7(16): 2815–2832 CrossRef Google scholar

[3]	Jones S, Thornton J M. Principles of protein-protein interactions. Proceedings of the National Academy of Sciences of the United States of America, 1996, 93(1): 13–20 CrossRef Google scholar

[4]	Phizicky E M, Fields S. Protein-protein interactions: methods for detection and analysis. Microbiological Reviews, 1995, 59(1): 94–123 CrossRef Google scholar

[5]	Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan N J, Chung S, Emili A, Snyder M, Greenblatt J F, Gerstein M. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 2003, 302(5644): 449–453 CrossRef Google scholar

[6]	Rhodes D R, Tomlins S A, Varambally S, Mahavisno V, Barrette T, Kalyanasundaram S, Ghosh D, Pandey A, Chinnaiyan A M. Probabilistic model of the human protein-protein interaction network. Nature Biotechnology, 2005, 23(8): 951–959 CrossRef Google scholar

[7]	Oti M, Snel B, Huynen M A, Brunner H G. Predicting disease genes using protein-protein interactions. Journal of Medical Genetics, 2006, 43(8): 691–698 CrossRef Google scholar

[8]	Sprinzak E, Sattath S, Margalit H. How reliable are experimental protein-protein interaction data? Journal of Molecular Biology, 2003, 327(5): 919–923 CrossRef Google scholar

[9]	Letovsky S, Kasif S. Predicting protein function from protein-protein interaction data: a probabilistic approach. Intelligent Systems in Molecular Biology, 2003, 19: 197–204 CrossRef Google scholar

[10]	Xenarios I, Salwinski L, Duan X J, Higney P, Kim S, Eisenberg D. DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research, 2002, 30(1): 303–305 CrossRef Google scholar

[11]	Chatr-Aryamontri A, Breitkreutz B J, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, Odonnell L. The BioGRID interaction database. Nucleic Acids Research, 2013, 41: D816–D823 CrossRef Google scholar

[12]	Bader G D, Betel D, Hogue CWV. BIND: the biomolecular interaction network database. Nucleic Acids Research, 2001, 31(1): 248–250 CrossRef Google scholar

[13]	Cherry J M, Adler C, Ball C A, Chervitz S A, Dwight S S, Hester E T, Jia Y, Juvik G, Roe T, Schroeder M. SGD: saccharomyces genome database. Nucleic Acids Research, 1998, 26(1): 73–79 CrossRef Google scholar

[14]

Peri S, Navarro J D, Amanchy R, Kristiansen T Z, Jonnalagadda C K, Surendranath V, Niranjan V, Muthusamy B, Gandhi T K, Gronborg M. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research, 2003, 13(10): 2363–2371

CrossRef Google scholar

[15]	Pagel P, Kovac S, Oesterheld M, Brauner B, Dungerkaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes H. The MIPS mammalian protein–protein interaction database. Bioinformatics, 2005, 21(6): 832–834 CrossRef Google scholar

[16]	Samuel K, Bruno A, Lionel B, Alan B, Fiona B C, Carol C, Margaret D, Marine D,Marc F, Ursula H. The IntAct molecular interaction database in 2012. Nucleic Acids Research, 2012, 40(Database issue): 841–846

[17]	Wei L, Xing P, Zeng J, Chen J, Su R, Guo F. Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine, 2017, 83: 67–74 CrossRef Google scholar

[18]	Ding Y, Tang J, Guo F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinformatics, 2016, 17(1): 398 CrossRef Google scholar

[19]	Wang T, Li L, Huang Y, Zhang H, Ma Y, Zhou X. Prediction of proteinprotein interactions from amino acid sequences based on continuous and discrete wavelet transform features. Molecules, 2018, 23(4): 823 CrossRef Google scholar

[20]	Wang Y, You Z, Li L, Huang Y, Yi H. Detection of interactions between proteins by using legendre moments descriptor to extract discriminatory information embedded in PSSM. Molecules, 2017, 22(8): 1366 CrossRef Google scholar

[21]	Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H. Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences of the United States of America, 2007, 104(11): 4337–4341 CrossRef Google scholar

[22]	Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Research, 2008, 36(9): 3025–3030 CrossRef Google scholar

[23]	Cosic I, Hearn M T. Studies on protein-DNA interactions using the resonant recognition model: application to repressors and transforming proteins. FEBS Journal, 2010, 205(2): 613–619 CrossRef Google scholar

[24]	Yang L, Xia J F, Gui J. Prediction of protein-protein interactions from protein sequence using local descriptors. Protein & Peptide Letters, 2010, 17(9): 1085–1090 CrossRef Google scholar

[25]	Hu L, Chan K C C. Extracting coevolutionary features from protein sequences for predicting protein-protein interactions. IEEE/ACM Transactions Computational Biology and Bioinformatics, 2017, 14(1): 155–166 CrossRef Google scholar

[26]	Wei Z S, Yang J Y, Yu D J. Predicting protein-protein interactions with weighted PSSM histogram and random forests. In: Proceedings of International Conference on Intelligent Science and Big Data Engineering. 2015, 326–335 CrossRef Google scholar

[27]	Zahiri J, Yaghoubi O, Mohammad-Noori M, Ebrahimpour R, Masoudi-Nejad A. PPIevo: protein-protein interaction prediction from PSSM based evolutionary information. Genomics, 2013, 102(4): 237–242 CrossRef Google scholar

[28]	Lin C Y, Chen Y C, Lo Y S, Yang J M. Inferring homologous proteinprotein interactions through pair position specific scoring matrix. BMC Bioinformatics, 2013, 14(S2): S11 CrossRef Google scholar

[29]

Wang Y, You Z, Li X, Chen X, Jiang T, Zhang J. PCVMZM: using the probabilistic classification vector machines model combined with a zernike moments descriptor to predict protein-protein interactions from protein sequences. International Journal of Molecular Sciences, 2017, 18(5): 1029

CrossRef Google scholar

[30]	Li L P, Wang Y B, You Z H, Li Y, An J Y. PCLPred: a bioinformatics ethod for predicting protein-protein interactions by combining relevance vector machine model with low-rank matrix approximation. International Journal of Molecular Sciences, 2018, 19(4): 1029 CrossRef Google scholar

[31]	Song X Y, Chen Z H, Sun X Y, You Z H, Li L P, Zhao Y. An ensemble classifier with random projection for predicting protein–protein interactions using sequence and evolutionary information. Applied Sciences, 2018, 8(1): 89 CrossRef Google scholar

[32]	An J Y, Meng F R, You Z H, Fang Y H, Zhao Y J, Zhang M. Using the relevance vector machine model combined with local phase quantization to predict protein-protein interactions from protein sequences. BioMed Research International, 2016, 2016: 1–9 CrossRef Google scholar

[33]	Cheung W, Hamarneh G. n-SIFT: n-dimensional scale invariant feature transform. IEEE Transactions on Image Processing, 2009, 18(9): 2012 CrossRef Google scholar

[34]	Bay H, Ess A, Tuytelaars T, Van Gool L. Speeded-up robust features. Computer Vision & Image Understanding, 2008, 110(3): 404–417 CrossRef Google scholar

[35]	Žunić J, Hirota K, Rosin P L. A Hu moment invariant as a shape circularity measure. Pattern Recognition, 2010, 43(1): 47–57 CrossRef Google scholar

[36]	Khotanzad A, Hong Y H. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis &Machine Intelligence, 1990, 12(5): 489–497 CrossRef Google scholar

[37]	Zhang F, Liu S Q,Wang D B, Guan W. Aircraft recognition in infrared image using wavelet moment invariants. Image & Vision Computing, 2009, 27(4): 313–318 CrossRef Google scholar

[38]	Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of International Conference on Computer Vision and Pattern Recognition. 2005, 886–893

[39]	Whitehill J, Omlin C W. Haar features for FACS AU recognition. In: Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition. 2006, 97–101

[40]	Ojala T, Pietikainen M, Harwood D. Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In: Proceedings of the 12th International Conference on Pattern Recognition. 1994, 582–585

[41]	He D C, Wang L. Texture features based on texture spectrum. Pattern Recognition, 1991, 24(5): 391–399 CrossRef Google scholar

[42]	Qian S, Chen D. Discrete gabor transform. IEEE Transactions on Signal Processing, 1993, 41(7): 2429–2438 CrossRef Google scholar

[43]	Zeng J, Li D, Wu Y, Zou Q, Liu X. An empirical study of features fusion techniques for protein-protein interaction prediction. Current Bioinformatics, 2016, 11(1): 4–12 CrossRef Google scholar

[44]	Tipping M E. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 2001, 1(3): 211–244

[45]	Tipping M E. The relevance vector machine. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 2000, 652–658

[46]	Wei L, Yang Y, Nishikawa R M, Wernick M N, Edwards A. Relevance vector machine for automatic detection of clustered microcalci fications. IEEE Transactions on Medical Imaging, 2005, 24(10): 1278 CrossRef Google scholar

[47]	Zhou Z. Learnware: on the future of machine learning. Frontiers of Computer Science, 2016, 10(4): 589–590 CrossRef Google scholar

[48]	Rong W, Peng B, Ouyang Y, Li C, Xiong Z. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis. Frontiers of Computer Science, 2015, 9(2): 171–184 CrossRef Google scholar

[49]	Mikolov T, Karafiat M, Burget L, Cernocký J, Khudanpur S. Recurrent neural network based language model. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association. 2010, 1045–1048

[50]	Gregor K, Danihelka I, Graves A, Rezende D J, Wierstra D. DRAW: a recurrent neural network for image generation. In: Proceedings of International Conference of Machine Learning. 2015, 1462–1471

[51]	Sainath T N, Vinyals O, Senior A, Sak H. Convolutional, long shortterm memory, fully connected deep neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015, 4580–4584 CrossRef Google scholar

[52]	Dyer C, Ballesteros M, Ling W, Matthews A, Smith N A. Transitionbased dependency parsing with stack long short-term memory. Computer Science, 2015, 37(2): 321–332

[53]	Sak H, Senior A, Beaufays F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. In: Proceedings of the 15 Annual Conference of the International Speech Communication Association. 2014

[54]	Li Z, Wang Y, Zhi T, Chen T. A survey of neural network accelerators. Frontiers of Computer Science, 2017, 11(5): 746–761 CrossRef Google scholar

[55]	Lazib L, Qin B, Zhao Y, Zhang W, Liu T. A syntactic path-based hybrid neural network for negation scope detection. Frontiers of Computer Science, 2020, 14(1): 84–94 CrossRef Google scholar

[56]	Sprinzak E, Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. Journal of Molecular Biology, 2001, 11(4): 681–692 CrossRef Google scholar

[57]	Bock J R, Gough D A. Predicting protein–protein interactions from primary structure. Bioinformatics, 2001, 17(5): 455–460 CrossRef Google scholar

[58]	Martin S, Roe D, Faulon J L. Predicting protein–protein interactions using signature products. Bioinformatics, 2004, 21(2): 218–226 CrossRef Google scholar

[59]	Benhur A, Noble W S. Kernel methods for predicting protein–protein interactions. Intelligent Systems in Molecular Biology, 2005, 21(1): 38–46 CrossRef Google scholar

[60]	Chou K, Cai Y. Predicting protein-protein interactions from sequences in a hybridization space. Journal of Proteome Research, 2006, 5(2): 316–322 CrossRef Google scholar

[61]	Wang Y, You Z, Li L, Cheng L, Zhou X, Zhang L, Li X, Jiang T. Predicting protein interactions using a deep learning method-stacked sparse autoencoder combined with a probabilistic classification vector machine. Complexity, 2018, 2018: 1–12 CrossRef Google scholar

[62]	Sun T, Zhou B, Lai L, Pei J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics, 2017, 18(1): 277 CrossRef Google scholar

[63]	Almagro Armenteros J J, Sønderby C K, Sønderby S K, Nielsen H, Winther O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics, 2017, 33(21): 3387–3395 CrossRef Google scholar

[64]	Yi H C, You Z H, Huang D S, Li X, Jiang T H, Li L P. A deep learning framework for robust and accurate prediction of ncRNA-protein interactions using evolutionary information. Molecular Therapy Nucleic Acids, 2018, 11: 337–344 CrossRef Google scholar

[65]	Wang Y B, You Z H, Li X, Jiang T H, Chen X, Zhou X, Wang L. Predicting protein-protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Molecular Biosystems, 2017, 13(7): 1336–1344 CrossRef Google scholar

[66]	Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics. 2016, 1715–1725 CrossRef Google scholar

[67]	Kudo T. Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 66–75 CrossRef Google scholar

[68]	Kudo T, Richardson J. SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018, 66–71 CrossRef Google scholar

[69]	Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big data classification. Physical Review Letters, 2013, 113(13): 130503 CrossRef Google scholar

[70]	Crawford D, Levit A, Ghadermarzy N, Oberoi J S, Ronagh P. Reinforcement learning using quantum boltzmann machines. 2016, arXiv preprint arXiv:1612.05695

[71]	Qiu D, Li L. An overview of quantum computation models: quantum automata. Frontiers of Computer Science, 2008, 2(2): 193–207 CrossRef Google scholar