Modeling the specificity of protein-DNA interactions
Gary D. Stormo
Modeling the specificity of protein-DNA interactions
The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of in vivo regulatory networks.
[1] |
Stormo, G. D., Schneider, T. D., Gold, L. and Ehrenfeucht, A. (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res., 10, 2997-3011
Pubmed
|
[2] |
Benos, P. V., Lapedes, A. S. and Stormo, G. D. (2002) Probabilistic code for DNA recognition by proteins of the EGR family. J. Mol. Biol., 323, 701-727
Pubmed
|
[3] |
Kaplan, T., Friedman, N. and Margalit, H. (2005) Ab initio prediction of transcription factor targets using structural knowledge. PLoS Comput. Biol., 1, e1
Pubmed
|
[4] |
Wolfe, S. A., Nekludova, L. and Pabo, C. O. (2000) DNA recognition by Cys2His2 zinc finger proteins. Annu. Rev. Biophys. Biomol. Struct., 29, 183-212
Pubmed
|
[5] |
Klug, A. (2010) The discovery of zinc fingers and their development for practical applications in gene regulation and genome manipulation. Q. Rev. Biophys., 43, 1-21
Pubmed
|
[6] |
Foat, B. C. and Stormo, G. D. (2009) Discovering structural cis-regulatory elements by modeling the behaviors of mRNAs. Mol. Syst. Biol., 5, 268
Pubmed
|
[7] |
Gorodkin, J., Heyer, L. J. and Stormo, G. D. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res., 25, 3724-3732
Pubmed
|
[8] |
Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755-763
Pubmed
|
[9] |
Rosenblatt, F. (1962) Principles of Neurodynamics. New York: Spartan Books.
|
[10] |
Stormo, G. D., Schneider, T. D. and Gold, L. M. (1982) Characterization of translational initiation sites in E. coli. Nucleic Acids Res., 10, 2971-2996
Pubmed
|
[11] |
Djordjevic, M., Sengupta, A. M. and Shraiman, B. I. (2003) A biophysical approach to transcription factor binding site discovery. Genome Res., 13, 2381-2390
Pubmed
|
[12] |
Maxam, A. M. and Gilbert, W. (1977) A new method for sequencing DNA. Proc. Natl. Acad. Sci. U S A, 74, 560-564
Pubmed
|
[13] |
Sanger, F., Nicklen, S. and Coulson, A. R. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U S A, 74, 5463-5467
Pubmed
|
[14] |
Rosenberg, M. and Court, D. (1979) Regulatory sequences involved in the promotion and termination of RNA transcription. Annu. Rev. Genet. 13, 319-353
Pubmed
|
[15] |
Hawley, D. K. and McClure, W. R. (1983) Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res., 11, 2237-2255
Pubmed
|
[16] |
Siebenlist, U., Simpson, R. B. and Gilbert, W. (1980) E. coli RNA polymerase interacts homologously with two different promoters. Cell, 20, 269-281
Pubmed
|
[17] |
Gold, L., Pribnow, D., Schneider, T., Shinedling, S., Singer, B. S. and Stormo, G. (1981) Translational initiation in prokaryotes. Annu. Rev. Microbiol., 35, 365-403
Pubmed
|
[18] |
Scherer, G. F., Walkinshaw, M. D., Arnott, S. and Morré, D. J. (1980) The ribosome binding sites recognized by E. coli ribosomes have regions with signal character in both the leader and protein coding segments. Nucleic Acids Res., 8, 3895-3907
Pubmed
|
[19] |
Mount, S. M. (1982) A catalogue of splice junction sequences. Nucleic Acids Res., 10, 459-472
Pubmed
|
[20] |
Harr, R., Häggström, M. and Gustafsson, P. (1983) Search algorithm for pattern match analysis of nucleic acid sequences. Nucleic Acids Res., 11, 2943-2957
Pubmed
|
[21] |
Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12, 505-519
Pubmed
|
[22] |
Kel, A. E., Gössling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O. V. and Wingender, E. (2003) MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res., 31, 3576-3579
Pubmed
|
[23] |
Quandt, K., Frech, K., Karas, H., Wingender, E. and Werner, T. (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res., 23, 4878-4884
Pubmed
|
[24] |
Mulligan, M. E., Hawley, D. K., Entriken, R. and McClure, W. R. (1984) Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity. Nucleic Acids Res., 12, 789-800
Pubmed
|
[25] |
Schneider, T. D., Stormo, G. D., Gold, L. and Ehrenfeucht, A. (1986) Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188, 415-431
Pubmed
|
[26] |
Schneider, T. D. and Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18, 6097-6100
Pubmed
|
[27] |
von Hippel, P. H. (1979) On the Molecular Bases of the Specificity of Interaction of Transcriptional Proteins with Genome DNA. New York: Plenum Publishing Corp.
|
[28] |
von Hippel, P. H. and Berg, O. G. (1986) On the specificity of DNA-protein interactions. Proc. Natl. Acad. Sci. U S A, 83, 1608-1612
Pubmed
|
[29] |
Berg, O. G. and von Hippel, P. H. (1987) Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol., 193, 723-750
Pubmed
|
[30] |
Heumann, J. M., Lapedes, A. S. and Stormo, G. D. (1994) Neural networks for determining protein specificity and multiple alignment of binding sites. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology; ISMB. International Conference on Intelligent Systems for Molecular Biology, 2, 188-194.
|
[31] |
Stormo, G. D. and Fields, D. S. (1998) Specificity, free energy and information content in protein-DNA interactions. Trends Biochem. Sci., 23, 109-113
Pubmed
|
[32] |
Foat, B. C., Morozov, A. V. and Bussemaker, H. J. (2006) Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics, 22, e141-e149
Pubmed
|
[33] |
Stormo, G. D. (2011) Maximally efficient modeling of DNA sequence motifs at all levels of complexity. Genetics, 187, 1219-1224
Pubmed
|
[34] |
Galas, D. J., Eggert, M. and Waterman, M. S. (1985) Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186, 117-128
Pubmed
|
[35] |
Waterman, M. S., Arratia, R. and Galas, D. J. (1984) Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46, 515-527
Pubmed
|
[36] |
Stormo, G. D. and Hartzell, G. W. 3rd. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. U S A, 86, 1183-1187
Pubmed
|
[37] |
Hertz, G. Z. and Stormo, G. D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563-577
Pubmed
|
[38] |
Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology; ISMB. International Conference on Intelligent Systems for Molecular Biology, 2, 28-36.
|
[39] |
Lawrence, C. E. and Reilly, A. A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 41-51
Pubmed
|
[40] |
Cardon, L. R. and Stormo, G. D. (1992) Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol., 223, 159-170
Pubmed
|
[41] |
Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-214
Pubmed
|
[42] |
Bailey, T. L. and Machanick, P. (2012) Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res., 40, e128
Pubmed
|
[43] |
Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835-839
Pubmed
|
[44] |
Machanick, P. and Bailey, T. L. (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics, 27, 1696-1697
Pubmed
|
[45] |
Roth, F. P., Hughes, J. D., Estep, P. W. and Church, G. M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16, 939-945
Pubmed
|
[46] |
Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M. and Wong, W. H. (2008) An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol., 26, 1293-1300
Pubmed
|
[47] |
Stormo, G. D., Schneider, T. D. and Gold, L. (1986) Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res., 14, 6661-6679
Pubmed
|
[48] |
Benos, P. V., Bulyk, M. L. and Stormo, G. D. (2002) Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res., 30, 4442-4451
Pubmed
|
[49] |
Bulyk, M. L., Johnson, P. L. and Church, G. M. (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res., 30, 1255-1261
Pubmed
|
[50] |
Lee, M. L., Bulyk, M. L., Whitmore, G. A. and Church, G. M. (2002) A statistical model for investigating binding probabilities of DNA nucleotide sequences using microarrays. Biometrics, 58, 981-988
Pubmed
|
[51] |
Man, T. K. and Stormo, G. D. (2001) Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res., 29, 2471-2478
Pubmed
|
[52] |
Zhao, Y., Ruan, S., Pandey, M. and Stormo, G. D. (2012). Improved models for transcription factor binding site identification using nonindependent interactions. Genetics, 191, 781-790
Pubmed
|
[53] |
Maerkl, S. J. and Quake, S. R. (2007) A systems approach to measuring the binding energy landscapes of transcription factors. Science, 315, 233-237
Pubmed
|
[54] |
Stormo, G. D. and Zhao, Y. (2007) Putting numbers on the network connections. BioEssays: news and reviews in molecular, cellular and developmental biology, 29, 717-721.
|
[55] |
Zhao, Y., Granas, D. and Stormo, G. D. (2009) Inferring binding energies from selected binding sites. PLoS Comput. Biol., 5, e1000590
Pubmed
|
[56] |
Zhao, Y. and Stormo, G. D. (2011) Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol., 29, 480-483
Pubmed
|
[57] |
Sarai, A. and Takeda, Y. (1989) Lambda repressor recognizes the approximately 2-fold symmetric half-operator sequences asymmetrically. Proc. Natl. Acad. Sci. U S A, 86, 6513-6517
Pubmed
|
[58] |
Takeda, Y., Sarai, A. and Rivera, V. M. (1989) Analysis of the sequence-specific interactions between Cro repressor and operator DNA by systematic base substitution experiments. Proc. Natl. Acad. Sci. U S A, 86, 439-443
Pubmed
|
[59] |
Bussemaker, H. J., Li, H. and Siggia, E. D. (2001) Regulatory element detection using correlation with expression. Nat. Genet. 27, 167-171
Pubmed
|
[60] |
Bussemaker, H. J., Foat, B. C. and Ward, L. D. (2007) Predictive modeling of genome-wide mRNA expression: from modules to molecules. Annu. Rev. Biophys. Biomol. Struct., 36, 329-347
Pubmed
|
[61] |
Tanay, A. (2006) Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res., 16, 962-972
Pubmed
|
[62] |
Stormo, G. D. and Zhao, Y. (2010) Determining the specificity of protein-DNA interactions. Nature reviews. Genetics, 11, 751-760.
|
[63] |
Fordyce, P. M., Gerber, D., Tran, D., Zheng, J., Li, H., DeRisi, J. L. and Quake, S. R. (2010) De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat. Biotechnol., 28, 970-975
Pubmed
|
[64] |
Wu, R. Z., Chaivorapol, C., Zheng, J., Li, H. and Liang, S. (2007) fREDUCE: detection of degenerate regulatory elements using correlation with expression. BMC Bioinformatics, 8, 399
Pubmed
|
[65] |
Tuerk, C. and Gold, L. (1990) Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science, 249, 505-510
Pubmed
|
[66] |
Fields, D. S., He, Y., Al-Uzri, A. Y. and Stormo, G. D. (1997) Quantitative specificity of the Mnt repressor. J. Mol. Biol., 271, 178-194
Pubmed
|
[67] |
Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., Taipale, M., Vaquerizas, J. M., Yan, J., Sillanpää, M. J.,
Pubmed
|
[68] |
Zykovich, A., Korf, I. and Segal, D. J. (2009) Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res., 37, e151
Pubmed
|
[69] |
Atherton, J., Boley, N., Brown, B., Ogawa, N., Davidson, S. M., Eisen, M. B., Biggin, M. D. and Bickel, P. (2012) A model for sequential evolution of ligands by exponential enrichment (SELEX) data. Ann. Appl. Stat., 6, 928-949.
|
[70] |
Slattery, M., Riley, T., Liu, P., Abe, N., Gomez-Alcala, P., Dror, I., Zhou, T., Rohs, R., Honig, B., Bussemaker, H.J.,
Pubmed
|
[71] |
Philippakis, A. A., Qureshi, A. M., Berger, M. F. and Bulyk, M. L. (2008) Design of compact, universal DNA microarrays for protein binding microarray experiments. Journal of computational biology: a journal of computational molecular cell biology, 15, 655-665.
|
[72] |
Berger, M. F., Philippakis, A. A., Qureshi, A. M., He, F. S., Estep, P. W. 3rd, and Bulyk, M. L. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol., 24, 1429-1435
Pubmed
|
[73] |
Robasky, K. and Bulyk, M. L. (2011) UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res., 39, D124-D128
Pubmed
|
[74] |
Badis, G., Berger, M. F., Philippakis, A. A., Talukder, S., Gehrke, A. R., Jaeger, S. A., Chan, E. T., Metzler, G., Vedenko, A., Chen, X.,
Pubmed
|
[75] |
Weirauch, M. T., Cote, A., Norel, R., Annala, M., Zhao, Y., Riley, T. R., Saez-Rodriguez, J., Cokelaer, T., Vedenko, A., Talukder, S.,
Pubmed
|
[76] |
Meng, X., Brodsky, M. H. and Wolfe, S. A. (2005) A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotechnol., 23, 988-994
Pubmed
|
[77] |
Meng, X., Thibodeau-Beganny, S., Jiang, T., Joung, J. K. and Wolfe, S. A. (2007) Profiling the DNA-binding specificities of engineered Cys2His2 zinc finger domains using a rapid cell-based method. Nucleic Acids Res., 35, e81
Pubmed
|
[78] |
Noyes, M. B., Meng, X., Wakabayashi, A., Sinha, S., Brodsky, M. H. and Wolfe, S. A. (2008) A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res., 36, 2547-2560
Pubmed
|
[79] |
Christensen, R. G., Gupta, A., Zuo, Z., Schriefer, L. A., Wolfe, S. A. and Stormo, G. D. (2011) A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity. Nucleic Acids Res., 39, e83
Pubmed
|
[80] |
Chu, S. W., Noyes, M. B., Christensen, R. G., Pierce, B. G., Zhu, L. J., Weng, Z., Stormo, G. D. and Wolfe, S. A. (2012) Exploring the DNA-recognition potential of homeodomains. Genome Res., 22, 1889-1898
Pubmed
|
[81] |
Gupta, A., Christensen, R. G., Rayla, A. L., Lakshmanan, A., Stormo, G. D. and Wolfe, S. A. (2012) An optimized two-finger archive for ZFN-mediated gene targeting. Nat. Methods, 9, 588-590
Pubmed
|
[82] |
Gupta, A., Meng, X., Zhu, L. J., Lawson, N. D. and Wolfe, S. A. (2011) Zinc finger protein-dependent and-independent contributions to the in vivo off-target activity of zinc finger nucleases. Nucleic Acids Res., 39, 381-392
Pubmed
|
[83] |
Zhu, C., Gupta, A., Hall, V. L., Rayla, A. L., Christensen, R. G., Dake, B., Lakshmanan, A., Kuperwasser, C., Stormo, G. D. and Wolfe, S. A. (2013) Using defined finger-finger interfaces as units of assembly for constructing zinc-finger nucleases. Nucleic Acids Res.,
Pubmed
|
[84] |
Siggers, T., Duyzend, M. H., Reddy, J., Khan, S. and Bulyk, M. L. (2011) Non-DNA-binding cofactors enhance DNA-binding specificity of a transcriptional regulatory complex. Mol. Syst. Biol., 7, 555
Pubmed
|
[85] |
Nutiu, R., Friedman, R. C., Luo, S., Khrebtukova, I., Silva, D., Li, R., Zhang, L., Schroth, G. P. and Burge, C. B. (2011) Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol., 29, 659-664
Pubmed
|
[86] |
Agius, P., Arvey, A., Chang, W., Noble, W. S. and Leslie, C. (2010) High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comput. Biol., 6, 6
Pubmed
|
[87] |
Pique-Regi, R., Degner, J. F., Pai, A. A., Gaffney, D. J., Gilad, Y. and Pritchard, J. K. (2011) Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res., 21, 447-455
Pubmed
|
[88] |
Narlikar, L., Gordân, R. and Hartemink, A.J. (2007) A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Comput. Biol., 3, e215
Pubmed
|
[89] |
Degner, J. F., Pai, A. A., Pique-Regi, R., Veyrieras, J. B., Gaffney, D. J., Pickrell, J. K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G. E.,
Pubmed
|
[90] |
Gaffney, D. J., Veyrieras, J. B., Degner, J. F., Pique-Regi, R., Pai, A. A., Crawford, G. E., Stephens, M., Gilad, Y. and Pritchard, J. K. (2012) Dissecting the regulatory architecture of gene expression QTLs. Genome Biol., 13, R7
Pubmed
|
[91] |
Maurano, M. T., Humbert, R., Rynes, E., Thurman, R. E., Haugen, E., Wang, H., Reynolds, A. P., Sandstrom, R., Qu, H., Brody, J.,
Pubmed
|
[92] |
Neph, S., Vierstra, J., Stergachis, A. B., Reynolds, A. P., Haugen, E., Vernot, B., Thurman, R. E., John, S., Sandstrom, R., Johnson, A. K.,
Pubmed
|
[93] |
Cooper, G. M. and Shendure, J. (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet., 12, 628-640
Pubmed
|
[94] |
Hesselberth, J. R., Chen, X., Zhang, Z., Sabo, P. J., Sandstrom, R., Reynolds, A. P., Thurman, R. E., Neph, S., Kuehn, M. S., Noble, W. S.,
Pubmed
|
[95] |
Neph, S., Stergachis, A. B., Reynolds, A., Sandstrom, R., Borenstein, E. and Stamatoyannopoulos, J. A. (2012) Circuitry and dynamics of human transcription factor regulatory networks. Cell, 150, 1274-1286
Pubmed
|
/
〈 | 〉 |