INTRODUCTION
Arsenic and its derivatives exhibit two sides for human health. Chronic exposure to arsenic (especially inorganic arsenic compounds) can cause a number of diseases,
e.g., cancers, diabetes, as well as hepatic and digestive diseases [
1]. Thus, the International Agency for Research on Cancer (IARC) and the US Environmental Protection Agency (EPA) have defined arsenic as a human carcinogen. Additionally, the US Agency for Toxic Substances and Disease Registry (ATSDR) has ranked arsenic as No. 1 on the Substance Priority List for many years. Arsenic and its derivatives also have significant medical applications, especially in treating hematologic diseases [
2]. For instance, the US Food and Drug Administration (FDA) and the European Medicines Agency (EMA) have approved arsenic trioxide combined with all-trans retinoic acid for the treatment of acute promyelocytic leukemia (APL). Because the combination use of arsenic trioxide and all-trans retinoic acid for APL patients shows a superior five-year survival rate [
3]. In addition to the excellent clinical performance on APL, arsenic also exhibits a broad-spectrum anti-cancer ability against both hematologic malignancies and solid tumors [
4].
To understand the mechanism for the anti-cancer ability of arsenic and its derivatives, we start from the arsenic-binding proteins in human cells. By now, only several arsenic-binding proteins are experimentally confirmed in human cells. Promyelocytic leukemia protein (PML) is a tumor suppressor protein in APL. Trivalent arsenic directly binds the RING-finger domain of PML and PML-retinoic acid receptor
a (RAR
a) fusion protein in APL cells to induce PML oligomerization, enhancing the interaction between PML and UBC9, further accelerating the ubiquitination and degradation of PML and PML-RAR
a fusion protein [
5]. The ubiquitin E3 ligases c-CBL (Casitas B-lineage lymphoma) is another experimentally confirmed arsenic-binding protein. Similar to PML, arsenic directly binds the RING-finger domain of c-CBL to inhibitor its self-ubiquitination and degradation, up-regulating the protein expression of c-CBL [
6]. In addition to PML and c-CBL, arsenic has a significant impact on the thioredoxin system by directly binding to the thioredoxin reductase (TrxR) [
7].
The identification of arsenic-binding proteins is limited by lacking effective experimental methods and theoretical tools. Position specific score matrix (PSSM) and machine learning algorithms have been successfully used to predict the binding sites of the co-factors of proteins and enzymes,
e.g., metal ions and acid radical ions [
8]. These computational methods need to utilize a large amount of experimental data as training sets to generate accurate computational models. For instance, the prediction of calcium-binding proteins needs 179 proteins with 329 experimentally validated calcium-binding sites as a training set to reach an accuracy of more than 90% with PSSM and machine learning algorithms [
9]. However, not enough experimentally validated arsenic-binding proteins and arsenic-binding sites are available for prediction. Only four human proteins with arsenic-binding sites have been experimentally validated (
e.g., mass spectrum data), which is not enough for a training set for machine learning.
To address this issue, we used binding free energies for the single mutation to replace the multiple sequence alignment in the PSSM generation. Firstly, we constructed a structural model for arsenic binding based on the crystal structure of the arsenic methyltransferase ArsM and generated a number of single mutations based on the structural model for arsenic binding. Subsequently, we performed molecular dynamics simulations on these mutations and calculated the free energy profile for arsenic binding using the free energy perturbation approach. The analysis of molecular dynamics simulations and binding free energies for arsenic binding can provide useful information to understand the molecular mechanism for the binding specificity of arsenic. Based on these calculations, we predicted all the potential arsenic-binding sites and arsenic-binding proteins in human cells, further performing systematical analysis on these arsenic-binding proteins and providing a novel insight into the biological functions of arsenic in human cells.
RESULTS AND DISCUSSION
Molecular basis for the arsenic-binding specificity
As trivalent arsenic shows a high affinity to sulfhydryl groups, arsenic is inclined to bind cysteine residues in proteins. As shown in Figure 1, the residue contribution to the free energy for arsenic binding indicates that the cysteine residue shows the most favorable binding free energy for arsenic (−5.0 kcal/mol). In addition to cysteine residues, histidine (−1.0 kcal/mol), lysine (−1.2 kcal/mol), asparagine (−1.0 kcal/mol), glutamine (−1.1 kcal/mol), as well as arginine (−0.8 kcal/mol) residues are preferred for arsenic binding. All the residues favored for arsenic binding are polar amino acids and can be classified into two types: non-ionized amino acids (cysteine, asparagine, and glutamine) and alkaline (histidine, lysine, and arginine) amino acids. These polar amino acids can provide an additional lone pair electrons for arsenic binding.
Both experimental and computational evidence indicate that arsenic specifically binds to reduced cysteine residues. How does arsenic choose the specific cysteine to bind? To answer this question, we generated ten single-cysteine motifs and four double-cysteine motifs based on the template structure (VISNCVCNLS), and calculated their free energies for arsenic binding. As a result, the double-cysteine motifs exhibit more favorable free energies for arsenic binding than the single-cysteine motif (Supplementary Table S1). This observation provides an indication that adjacent cysteine residues are preferred for arsenic binding. To figure out the favorable binding motif for arsenic, we constructed a single mutation free energy profile for arsenic binding using the free energy perturbation approach. Based on the single mutation free energy profile for arsenic binding, we generated a PSSM for arsenic binding (as shown in Table 1), which can be further used to predict arsenic-binding sites and arsenic-binding proteins. After analyzing the single mutation free energy profile, we obtained a consensus sequence motif xxxCxxCxCx for arsenic binding (Figure 2A).
Computational validation for the arsenic-binding PSSM
To examine the predicting ability and accuracy of the arsenic-binding PSSM, we collect nine proteins with experimentally verified arsenic-binding sites for the computational validation. For detailed information, please see the Supplementary Table S2. A slidable window with a width of 10 AA is applied to slide along the local sequence of the collected proteins, to generate a number of 10-AA segments. The 10-AA segments that contain the experimentally verified binding sites for arsenic are considered as the arsenic-binding motifs. Conversely, those without experimentally verified binding sites for arsenic are considered as non-binding motifs. The independent test set is composed of the arsenic-binding motifs and non-binding motifs mentioned above, and the single mutation free energy profile based arsenic-binding PSSM is subsequently applied to the test set for assessing its predicting ability and accuracy. As a result, the single mutation free energy profile based arsenic-binding PSSM shows 88.7% sensitivity, 90.6% specificity and 92.3% accuracy.
To further check the predicting ability of the single mutation free energy profile based arsenic-binding PSSM, we also used two independent datasets (named as MALDI-MS dataset [
10] and protein microarray dataset [
11] respectively) for validation. For detailed information, please see Supplementary Table S3 and Table S4. Only eight MALDI-MS proteins are overlapped with protein microarray datasets. The arsenic-binding PSSM is subsequently carried out on the proteins in MALDI-MS and protein microarray datasets. As a result, 94.0% proteins in the MALDI-MS dataset and 91.9% in the protein microarray dataset are successfully predicted as the arsenic-binding proteins (Figure 2B). These findings suggest that our computational model can repeat the majority of previous experimental results. Fairly reliable prediction of identifying arsenic-binding proteins is possible by usage of the tool.
Identifying arsenic-binding proteins in the human genome
Using the single mutation free energy profile based arsenic-binding PSSM, we scan the entire human genome to search for arsenic-binding proteins in human cells. In total, 11,315 arsenic-binding sites are identified with a threshold of q-value less than 0.05. Together with the chromosome location information, the identified arsenic-binding sites are mapped into genes using the BioMart community portal [
12]. After mapping these computational predictions into the human genome, nearly 2,800 proteins are finally discovered as arsenic-binding proteins. Additional function analysis indicates that these arsenic-binding proteins in the human genome are mainly enriched in signaling transduction and related pathways, especially in Notch signaling, Wnt, and TGF-beta signaling pathways (Figure 2C).
In the Notch signaling pathway, 16 arsenic-binding proteins are enriched with a significant P-value of 8.6×10
-35 (Figure 3 and Supplementary Table S5). The Notch signaling is essential for many fundamental cellular processes, such as cell proliferation and apoptosis, and its dysfunction can result in many diseases,
e.g., degenerative diseases, fibrosis, and cancers. Our computational prediction indicates that arsenic directly binds to Notch receptors (Notch1–4) and Notch ligands (Delta-like proteins and Jagged families). Additionally, ADAM proteases, and transcription factor CSL, which is regulated by the released intracellular domain of Notch receptors, are also detected to have arsenic-binding sites. Thus, arsenic binding to these proteins will lead to the dysfunction of Notch signaling, further affecting cell proliferation and apoptosis. This opinion is also supported by the experimental results in human keratinocyte cell [
13] and glioma cell lines [
14].
In addition to Notch signaling pathway, arsenic-binding proteins are found to enrich in the Wnt and TGF-beta signaling pathways (Figure 3 and Supplementary Table S5). In these signaling pathways, arsenic directly binds to Wnt family proteins, TGF-beta, as well as their regulators, such as porcupine homolog (PORCN), secreted frizzled related proteins (SFRPs), Wnt inhibitory factors (WIFs), latent transforming growth factor beta binding proteins (LTBPs), and thrombospondins (THBSs). Besides, sclerostin (SOST) and dickkopf Wnt signaling pathway inhibitors (DKKs) are also detected to have arsenic-binding sites. These proteins are regarded to have significant impacts on the Wnt signaling pathway by regulating the expression and biological functions of Wnt receptors (frizzled class receptor FZD and LDL receptors). Thus, arsenic binding to these proteins is able to induce significant inhibitions on their biological activities, further regulating the corresponding signaling pathway, which is partly supported by mouse model studies [
15].
Transcriptional regulation is an essential process in all living organisms. The dysfunction of transcriptional regulation can lead to a variety of diseases including cancers. Many biological studies confirm that arsenic is able to bind the RING finger domain of transcriptional factors [
16]. In the current study, functional analysis shows that a number of arsenic-binding proteins are transcription factors, enriched in the transcriptional misregulation pathway. For instance, a series of zinc finger and BTB domain-containing proteins (
e.g., ZBTB16, ZBTB17, and ZBTB27) are identified as arsenic-binding proteins. These proteins contain C
2H
2 zinc finger domain and BTB domain (also known as the POZ domain), functioning as transcriptional factors in many diseases,
e.g., leukemia and neuroblastoma [
17].
In addition to these transcription factors, some significant fusion proteins, which are considered to play crucial roles in cancers, are also identified to have arsenic binding sites. Among these fusion proteins, PML-RAR
a and ZBTB16-RAR
a fusion proteins are essential in APL, both of which are predicted to have arsenic binding sites by the arsenic PSSM. PML-RAR
a fusion protein has been demonstrated to be a direct target for arsenic binding in APL cells, having a significant impact on its ubiquitination and degradation, to achieve an effective strategy against APL. ZBTB16-RAR
a fusion protein is a transcriptional repressor to recruit the SMRT/N-CoR deacetylase transcriptional repression complex [
18]. The binding of arsenic causes dysfunctions of ZBTB16-RAR
a fusion protein, going against APL. Another interesting fusion protein having arsenic binding sites is TMPRSS2-KCNH2 fusion protein. This fusion protein can induce the activation of the androgen receptor pathway and prostate cancer [
19]. If Arsenic binding could accelerate the degradation of TMPRSS2-KCNH2 fusion protein, it would be a new strategy to treat prostate cancer.
CONCLUSION
In summary, we developed a single mutation free energy profile for arsenic binding. Based on this binding energy profile, we generated a position-specific score matrix for identifying arsenic-binding proteins. Both computational validation and experimental evidence show that our computational model is able to predict arsenic-binding proteins with satisfying accuracy. Thus, we scanned the human genome to identify all the potential arsenic-binding proteins. Additional functional analysis indicates that the arsenic-binding proteins in the human genome show a wide range of biological functions, especially in signaling conduction and related pathways. Hopefully, these findings can provide a genomic insight for in-depth understanding of the effective ability of arsenic against malignancies and solid tumors.
MATERIALS AND METHODS
The structural model for arsenic-binding
The structural model for arsenic-binding was constructed based on the crystal structure of arsenic methyltransferase (ArsM) from
Cyanidioschyzon sp. 5508 (RCSB PDB entry 4 fsd) [
20]. ArsM is a homolog of human AS3MT that catalyzes transfer to the methyl group of S-adenosylmethionine (SAM) to trivalent arsenic. It is also an important methyltransferase for arsenic biomethylation. The crystal water molecules and non-polar hydrogen atoms in the crystal structure 4 fsd were removed [
21]. pKa values for each residue were calculated based on a Poisson-Boltzmann solver with a dielectric constant of 4.0 [
22]. Hydrogens were subsequently added based on the computational pKa values using H++ web server [
23]. The structural template was then solvated in a rectangular simulation box with explicit TIP3P water molecules, and subjected to steepest descent energy minimization for 5,000 steps, followed by conjugate gradient for the next 5,000 steps.
Estimating arsenic-binding free energy
Amber force field parameters were used to parameterize the structural models of arsenic-binding [
24]. The force field parameters of arsenic were obtained from the previous density function calculation of arsenic compounds [
25]. The free energy perturbation (FEP) approach [
26] was used to estimate arsenic-binding free energies. In the FEP calculations, the free energy distinction between the arsenic binding state and the free state is expressed by the equation:
where
kB is the Boltzmann constant,
T is the temperature, and
Hc(
r,p) and
Hf(
r,p) are the Hamiltonians characteristic of the arsenic binding state and free state, respectively.<…>
f denotes an ensemble average over configurations representative of the arsenic free state. Molecular dynamics simulations with the periodic boundary condition and NPT ensemble using NAMD package [
27] were used to generate the configurations for arsenic binding as well as free states. The weak-coupling algorithm [
28] and Berendsen thermostat [
29] were applied to maintain the simulation system at a constant temperature and pressure. The SHAKE algorithm [
30] was used to constrain all the chemical bonds, and atom velocities for start-up runs were obtained according to the Maxwell distribution at 310 K. The electrostatic interaction was treated by the PME algorithm [
31] with an interpolation order of 4.0 and a grid spacing of 0.12 nm. The van der Waals interaction was computed using a cut-off of 12 angstroms. All the molecular dynamics simulations were performed with a time step of 2 fs, and the simulation coordinates were saved every 1 ps.
Generating binding free energy profiles for single mutations
Based on the template model of arsenic-binding mentioned above, we abstracted a 10-amino-acid (10-AA) segment (VISNCVCNLS) as the template structure for arsenic-binding. Subsequently, we mutated each residue of the template structure into the other 19 residues to generate a number of single mutation segments. Arsenic-binding free energies for each single mutation segment were then calculated using the free energy perturbation approach mentioned above. The binding free energy profile for single mutations was subsequently converted into a 10 × 20 matrix, representing the free energy distinction for arsenic-binding from the 10-AA segment (VISNCVCNLS) in the template structure. This matrix can be also used as a position-specific scoring function . Ms,i is the score of the residue S at the ith position in the binding free energy profile for single mutations and Si is the residue at the ith position of the 10-AA segment. d is represented for the binding free energy distinction for arsenic-binding from the 10-AA segment (VISNCVCNLS) in the template structure.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature