1 Introduction
Stem cells are often related to as primitive, undifferentiated cells that have the ability to reproduce themselves indefinitely (self-renewal) and can generate various types of cells on reception of appropriate external or internal cues (pluripotency or multipotency) [
1]. Stem cells are classified into two main groups, embryonic stem cells (ESCs) and adult stem cells. ESCs are mostly derived from the inner cell mass of the blastocyst and give rise to the fetus. In particular, ESCs have a nearly infinite self-renewal ability and the potential of differentiating into almost any cell type [
2,
3]. Adult stem cells, including somatic and germline stem cells, can maintain, replenish, and regenerate the tissue from which they originate in mature organisms [
2]. Both ESCs and adult stem cells have been intensively and widely studied in several fields of science and medicine; the results promise to be very useful in many clinical applications, bringing new treatments and perhaps even curing some currently incurable diseases.
It is of biological and clinical importance to explore the molecular mechanisms involved in stem cell self-renewal, proliferation, and differentiation. Recent advances in so-called “omics” technologies have provided researchers with new opportunities for an overall understanding of biological features of stem cells. Omics covers an increasingly wide range of biology branches to perform precise analyses of biological processes and structures in an increasingly large number of life science fields. These fields range from genomics (examining protein-coding genes, noncoding regions, and regulatory elements), transcriptomics (analyzing transcription, gene expression, and alternative splicing), proteomics (protein identification, quantification, and posttranslational modifications) to epigenomics (DNA methylation, histone modification, and chromatin remodeling). The “omics” technologies are largely responsible for dramatic advances in the postgenomic biology and medicine [
4,
5]. Therefore, comprehensive, genome-wide analysis combining the techniques of genomics, proteomics, transcriptomics, and epigenomics will provide new insights in the field of stem cell biology and potential clinical applications. In this study, we focus on transcriptomics and proteomics of stem cells.
The development of the transcriptomic and proteomic technologies has enabled the investigation of stem cells using systems biology tools. In particular, high-throughput screening techniques have generated large amounts of data, facilitating systemic understanding of relationships between molecular components [
6].
In this article, various aspects of transcriptomic and proteomic studies of stem cells are reviewed, and important findings regarding stem cell self-renewal, proliferation, and differentiation have been highlighted and discussed.
2 The characteristics of stem cell transcriptomics
Transcriptome is the complete set of RNAs in a cell at a particular developmental stage or under specific physiological conditions [
7]. Research on the transcriptome is a crucial step in discovering the functional components of the genome, revealing the molecular features of particular cells and tissue, and understanding developmental processes and mechanisms underlying diseases [
7]. Pluripotent stem cells are characterized by high levels of global transcriptional activity, leading to their plasticity, while loss of pluripotency and lineage specification cause considerable reduction in the transcription of few portions of the genome [
8].
2.1 mRNA
Comparative analyses, with data mining, of transcriptional profiles of ESCs can quantify the alterations in the expression of each transcript and identify the key factors involved in stem cell self-renewal, proliferation, and differentiation. Examining undifferentiated human embryonic stem cells (hESCs), we can identify the differentially and specifically expressed genes in these cells [
9]. Ginis
et al. compared gene expression profiles of mouse and human ESCs and revealed differences in molecular signatures associated with maintaining pluripotency. These differences are species-specific rather than arising from differences in cell culture conditions [
10]. Few genes that are expressed exclusively or predominantly in ESCs have been identified and verified in many studies. These are genes such as
OCT4,
SOX2,
NANOG,
REX-1,
UTF1,
TERT,
ABCG2,
NODAL,
TDGF1,
LEFTB,
BEX1, and
GATA4, and some components of signaling pathways, such as FGF, WNT, and BMP. These genes are important for maintaining the pluripotency [
9,
11–
13]. In addition, there are different factors involved in stem cell differentiation into different lineages. Djouad
et al. compared the transcriptomes of multipotent human mesenchymal stem cells (MSCs) and MSC-derived chondrocytes cultured in micropellets. They observed that the expression of 676 genes was upregulated in MSC-derived chondrocytes in comparison with the original MSCs. In particular,
Foxo3A was highly expressed at day 21 of the culture and in mature chondrocytes. Furthermore, they suggested that upregulation of
Foxo3A expression during chondrogenic differentiation plays a dual role; it inhibits the differentiation toward hypertrophy and promotes cell apoptosis [
14].
However, in different expression profile studies, there is little overlap between the lists of the genes overexpressed in ESCs [
13]. For example, Ivanova
et al. [
15] and Ramalho-Santos
et al. [
16] have independently identified>200 genes upregulated in ESCs, but there are only six genes in common among these two studies, although they used the same cell types and identical microarray chips [
17]. We also downloaded different data sets about stem cells from GEO and used hierarchical cluster analysis to sort the expressed genes in ESC.1 (H1 cell line, GEO accession number: GSM817221), ESC.2 (H1 cell line, GEO accession number: GSM1006724), ESC.3 (H7 cell line, GEO accession number: GSM1273672), ESC.4 (human embryonic stem cell line, GEO accession number: GSM922224), HESC (human hematopoietic stem cells, GEO accession number: GSM1185603), 48hrESC (ESC.2 differentiated for 48 h, GEO accession number: GSE41009), MSC (mesenchymal stem cells, GEO accession number: GSE37521), CSC cells (cancer stem cells, GEO accession number: GSE33912) (Fig. 1) [
18–
23]. Interestingly ESC.1, ESC.2, ESC.3 and ESC.4 were not clustered together, while differentiated cells (48hrESC) and pluripotent cells were clustered together. Genetic diversity among different cell lines, differences in cell culture conditions, and sampling issues may account for such discrepancies in various lines of gene expression data. However, we cannot exclude the possibility of flawed methods employed in such comparative analyses [
13]. Therefore, combined analyses of multiple experiments using appropriate statistical methods are essential in reaching coherent conclusions.
It has been established that alternative splicing plays a critical role in regulating ESC pluripotency and differentiation [
24].Using Solexa sequencing system, Wu
et al. demonstrated that greater splice junction diversity is observed in hESCs than in the cells undergoing neural differentiation [
25]. This suggests that this high diversity of isoforms may contribute to the pluripotency of hESCs. The presence of large numbers of specialized transcripts, highest in undifferentiated hESCs and decreasing upon differentiation, is a part of the phenomenon termed isoform specialization [
25].
Global transcriptional profiling also enables the detection of unknown RNAs. For example, Brandenberger
et al. [
26] identified 16 000 (approximately 50% of total tags) potentially novel expressed sequence tags (ESTs) in hESC, and Anisimov
et al. [
27] identified 16 000 (approximately 35% of total tags) potentially novel tags by serial analysis of gene expression (SAGE) in mouse ESCs. These transcripts may have functions or regulate the key factors which help stem cells maintain their particular characteristics.
2.2 MicroRNAs
Although the functions of many protein-coding genes have been extensively studied, little is known about the regulatory effects of microRNAs (miRNAs) on transcription. miRNAs are a family of small, noncoding RNAs that can bind to the 3′ nontranslational region (3′UTR) of target mRNAs to regulate their expressions [
28]. miRNA studies can give us a new insight into the molecular mechanism of stem cell functions.
Various cloning studies have demonstrated that miR-368, miR-200c, miR-154*, miR-371, miR-372, miR-373, and miR-373* are expressed specifically in hESCs to maintain stem cell self-renewal [
2,
29]. miR-301, miR-374, miR-21, miR-29, and miR-29b play crucial roles in stem cell differentiation and their expression increases after the induction of differentiation [
2,
29]. However, not all miRNAs in the genome have been characterized. A thorough exploration of the global expression profiles of miRNAs (miRNome) during stem cell self-renewal, proliferation, and differentiation would generate profound influence in stem cell research. miRNome analysis of ESCs and epiblast stem cells (EpiSCs) derived from mouse embryos has shown that miR17-92, miR290-295, and a large repetitive cluster on chromosome 2 are highly expressed in ESCs, whereas miR-302d, miR-34c, miR-367, and let-7e are highly expressed in EpiSCs [
30]. These data may indicate that miRNAs play dual roles of redundant and specific factors in the fine-tuning of pluripotency during stem cell development [
30]. During T cell development, miRNA expression is an extremely highly-regulated and dynamic process. Using next-generation sequencing technology, 645 miRNAs were obtained from these cells [
31]. In addition, Marson
et al. generated an accurate positioning genome map for pluripotency factors Oct4, Nanog, Sox2, and Tcf3, and the histone modification H3K4 me3 of mESC occupancy (ChIP-seq). Their studies demonstrate that Oct4, Nanog, Sox2, and Tcf3 promote the ESC miRNA expression program; thus, integrate miRNAs into the regulatory network that controls ES cell identity [
32].
2.3 lncRNAs
Non-protein-coding RNAs (or noncoding RNAs) participate in many processes, such as cellular regulation, development and disease [
33]. Except for miRNAs, there is another noteworthy class of potential regulatory RNAs in the transcriptome. We refer to this class as long noncoding RNAs (lncRNAs), which are non-protein coding transcripts longer than 200 nucleotides. The limited number of functional studies of lncRNAs suggest that they play important roles in stem cell pluripotency and differentiation. lncRNAs display their function in many ways, such as expression, regulation, and mutation. Some lncRNAs have been identified to play important roles in pluripotency of stem cells by regulating the expression of some key factors. Mohamed
et al. found that two of these lncRNAs, AK028326 (Oct-activated) and AK141205 (Nanog-repressed), were direct targets of Oct4 and Nanog in mESCs, in addition to alterations in cellular lineage-specific gene expression and in the pluripotency of mESCs [
34]. In human ESC, the transcription of most lncRNA genes is coordinated with transcription of protein-coding genes, which implies that these lncRNAs have positive transcriptional regulation functions [
19]. While some lncRNAs may have functions other than transcriptional regulation. Dinger
et al. identified 945 ncRNAs expressed during embryoid body differentiation, of which 174 were differentially expressed, many correlating with pluripotency or specific differentiation events, in some cases through engagement of the epigenetic machinery [
35]. Two such lncRNAs,
Six3os and
Dlx1as, are also found to play important roles in the glial-neuronal lineage specification of multipotent adult stem cells [
36].
3 The characteristics of stem cell proteomics
Transcriptome approaches provide genome-wide coverage of the mRNAs and miRNAs. However, because of posttranscriptional events, these methods do not always reflect protein dynamics in stem cells. Proteomic analysis supplies the relative quantitation of proteins and peptides, identification of proteins, their subcellular localization, and identifies protein-protein interactions and posttranslational modifications (PTMs) [
37]. The application of proteomics to study the processes controlling stem cell self-renewal, proliferation, and differentiation will provide valuable insight into the molecular mechanisms of the factors involved in the differentiation of these cells to specific lineages [
38].
3.1 Proteins
Most of the stem cell proteomics studies aimed to examine the changes in the cytoplasmic protein content to identify markers, novel key proteins, and protein interaction maps during different stages of stem cell development [
39,
40]. Nagano
et al. identified markers of ESC such as transcription factors Oct-3/4 and UTF-1; alkaline phosphatase; and others including nidogen 2, hepatoma-derived growth factor (HDGF), cadherin 1, catenin α1, transgelin, and disabled homolog 2 [
20]. Comparing monkey ESCs during proliferation and at different stages of spontaneous differentiation (days 3, 6, 12, and 30), Nasrabadi
et al. observed changes in the expression of novel key proteins involved in transcription regulation, cell proliferation (CDV3, RCN1, PCNP and homolog), Ras signaling (G3BP and TTC1), and chromatin remodeling (RUVBL1 and HDGF) [
40]. SILAC proteomics of planarians identifies Ncoa5 as a conserved component of pluripotent stem cells [
41]. Besides, the key proteins and biomarkers of cancer stem cell are also widely studied. DAC2 and CTNNB1 are detected as prognostic markers in the malignant transformation of hESCs in a recent study [
42]. With proteome strategy, p63 is found to play an important role in cancer development by regulating the key steps of glycolysis in colon cancer stem cells [
43].
Approximately 3400 genes have been predicted to encode single-pass transmembrane or secreted proteins in mammalian cells [
44]. It is thus necessary to explore the physiological activities of the extracellular proteome during stem cell self-renewal, proliferation, and differentiation [
45]. Gonzalez
et al. analyzed the complete extracellular proteome of hESCs and suggested that activation of the pigment epithelium-derived factor (PEDF) receptor-Erk1/2 signaling pathway by the PEDF is sufficient to maintain the self-renewal of undifferentiated hESCs [
45]. Moreover, ERK1/2 is also identified as a potential pathway correlated with processes that characterize tumorigenic potential and stemness of cancer stem cells in osteosarcoma, which exhibit a surface protein signature different from differentiated cells [
46].
3.2 Phosphorylation
Cell-fate determination is also regulated by protein phosphorylation, a critical determinant of cell signaling [
47,
48]. Phosphorylation status exhibits dynamic changes during the differentiation period. Four recent phosphoproteomic analyses of hESCs, using different cell culture conditions and different technologies, have identified 3067, 2546, 11 995, and 23 522 protein phosphorylation sites [
47–
50]. Approximately 50% of these sites presented dynamic changes in the phosphorylation status during 24 h of differentiation [
47,
50]. Among the dynamically phosphorylated proteins, CDK1/2 was identified as a central factor in controlling stem cell self-renewal and lineage specification [
47]. Brill
et al. also discovered that 389 proteins contained more phosphorylation site identifications in undifferentiated hESCs, whereas 540 proteins contained more such identifications in differentiated derivatives [
48]. Moreover, numerous phosphoproteins in receptor tyrosine kinase (RTK) signaling pathways were present in hESCs [
48]. Understanding the phosphorylation landscape that controls stem cell pluripotency, self-renewal and differentiation will also improve our ability to develop stem cell-based therapies.
4 Systemic analysis of transcriptomic and proteomic data
With the progress of high-throughput approaches, such as RNA sequencing and high-throughput protein studies, the omics data sets are increasing rapidly, demanding new developments in bioinformatics approaches for further analysis of these data. Databases for stem cell omics data encourage researchers to share their experimental stem cell data globally (Table 1). Increasing numbers of computational methods and open source or commercial software packages are being developed. Comparisons of omics data obtained for stem cells and other kinds of cell lines at different regulatory levels give us an increasingly comprehensive view of the molecular mechanisms underlying self-renewal, proliferation, and differentiation of stem cells.
4.1 mRNA-seq data analysis
Numerous technologies have been applied to detect and quantify the transcriptome of stem cells at different differentiation and developmental stages. These include EST, SAGE, massively parallel signature sequencing, microarray analysis, and high-throughput sequencing [also known as “next-generation sequencing (NGS)”] [
63–
66]. Each of these technologies has its own advantages and limitations.
Microarrays have been widely used for obtaining genome-wide expression profiles of stem cells at different stages [
67]. However, microarray technology suffers from insufficient sensitivity, narrow dynamic range, and nonspecific hybridizations [
68]. In addition, this technology can only provide information regarding the transcripts hybridizing with the probes included on the array. Unlike microarrays, SAGE is a
de novo sequencing method, which can identify novel genes; this method needs very little knowledge of sequences for probe construction [
64]. However, the cloning and sequencing steps in this technique are laborious, which significantly limits its use [
64]. The NGS technology, using SOLiD sequencing system, Solexa genome analyzer, and 454 GS FLX sequencer, overcomes the limitations of the traditional sequencing technologies and provides a high-speed, high-throughput, yet low-cost method for both mapping and quantifying transcriptomes [
7,
66]. Researchers often combine several technologies for transcriptome study of stem cells.
With the developments of NGS technology, several tools for NGS data analysis have been rapidly emerging. One of the critical steps for RNA-seq experiment is mapping the short reads to reference genomes. So mapping tools with different strategies have appeared to overcome this difficulty. TopHat is a fast mapping tool to align RNA-seq short reads into the reference genome using high-throughput sequence aligner Bowtie; the splice junctions between exons can be determined by analyzing the results of the mapping [
69]. PALMapper combines powerful mapping tool GenomeMapper with splice alignment tool QPALMA, so it can exploit quality information of RNA-seq reads and predict splicing sites, which improves the accuracy of alignment [
70]. SeqMap can detect multiple substitutions and insertions/deletions of the nucleotide bases in the sequences [
71]. Different from the above tools, MapSplice is characterized by the sensitivity and specificity of splice detection, and the effective use of CPU and memory [
72]. The algorithm used by this tool is independent of splice site features or intron length, so the novel canonical and noncanonical splices can be detected [
72]. Other programs such as Scripture, SpliceMap, SOAP and BWA are also often used for RNA-seq mapping [
73–
75].
Other methods and software packages have been developed for further analysis, such as transcript assembly, FPKM/RPKM estimation, finding significant changes in transcript expression, identifying gene fusions, and alternative splicing. Cufflinks is a widely used tool for RNA-seq data analysis. It estimates transcript abundances, assembles transcripts and identifies differential expression, and regulation in RNA-seq samples [
76]. Bioconductor package (www.bioconductor.org) is also widely used. It is an open source program for the analysis of genomic data, and it includes packages for RNA-seq analysis. The combination of several such tools will facilitate a rigorous RNA-seq data analysis.
4.2 miRNA-seq data analysis
Deep-sequencing technologies, such as miRNA-seq, provide a powerful strategy to explore miRNA populations with high specificity and sensitivity. For miRNA-seq data analysis, multiple computational approaches have been established to analyze miRNA-seq data, allowing differential expression analysis, identification of known and novel miRNAs, and prediction of miRNAs targets. miRDeep is a software package for miRNA-seq data analysis to determine known and novel miRNAs [
77]. It scores compatibility of the position and frequency of sequenced RNA with the secondary structure of the miRNA precursor by constructing a probabilistic model that simulates miRNA biogenesis process [
77]. miRNAkey is special in achieving the basic functions of miRNA-seq data analysis, and adding some unique characteristics such as multiple read determination and data statistics. The tool provides an innovative platform for the data mining of deep-sequencing of miRNAs [
78]. miRanalyser and DSAP are commonly used web server tools for dealing with deep-sequencing data of miRNA [
79,
80]. Few online databases such as TargetScan [
81], PicTar [
82], Miranda [
83], and DIANA-microT [
84] are often used for the prediction of miRNA targets.
4.3 Proteome and phosphoproteome data analysis
Various issues associated with the proteome, such as abundance of proteins and peptides, stability, subcellular localization, PTMs, and their interactions, can be elucidated using different technologies [
38]. Two-dimensional polyacrylamide gel electrophoresis (2D-PAGE), mass spectrometry (MS), and liquid chromatography(LC)-MS/MS techniques are widely applied to proteomic analyses. 2D-PAGE is a common tool for isolating proteins from a complex mixture on the basis of two independent parameters in two distinct steps. High-resolution 2D-PAGE of proteins is the fundamental technique of proteomics and can simultaneously analyze thousands of proteins [
38]. Although 2D-PAGE has been broadly used for proteome analysis, it has several limitations such as low resolution and low sensitivity [
85]. Application of MS has been a significant breakthrough in proteomics. This technique can identify proteins in the femtomole to picomole range and has replaced the classic Edman N-terminal sequencing, which is less sensitive, less automated, and requires an unblocked N terminus [
86,
87]. Liquid chromatography-mass spectrometry is now routinely used for the identification of peptides [
6]. The approaches of this technique are preferred over 2D PAGE-based approaches to detect proteins. New chemical methods such as isotope-coded affinity tag (ICAT), stable isotope labeling with amino acids in cell culture (SILAC), and isobaric tag for relative and absolute quantification (iTRAQ) can further enhance the sensitivity [
6]. We can also generate a global view of stem cell proteome dynamics using protein microarrays, a high-throughput technique for obtaining protein abundance and functional data [
6,
88].
The analysis of the dynamics of protein expression during stem cells self-renewal and differentiation can provide important clues regarding progression of the stem cell differentiation processes [
89]. Different bioinformatics tools are developed for analyzing proteomic data from different resources. MSQuant, which is capable of handling multiple labeling strategies and supports several vendor data formats, is widely used for SILAC proteomics data analysis [
90]. Other tools such as ASAPRatio [
91], XPRESS [
92], MaxQuant [
93], and PVIEW [
94] can also be used for SILAC proteomic data analysis. For data analysis using isobaric labels, Multi-Q [
95], iTracker [
96], IsobariQ [
97], and Libra [
98] are freely available software programs. These programs can import preprocessed MS/MS data from Sequest or Mascot. Another proteomic technique, label-free quantification, is a widely used alternative to label-based approaches. Software tools for label-free quantification, such as Corra [
99], IDEAL-Q [
100], MSQuant [
101], and MaxQuant [
93] also allow the analysis of low-resolution data. However, because of large dynamic range covered by most of the complex protein extracts, the biophysical properties of protein, and post translational modification, the coverage of a proteome is still not comprehensive [
102].
4.4 Combined omics data analysis
The future of genomic and proteomic technologies holds great expectations. Nonetheless, for transcriptomic and proteomic data to achieve their potential, computational integration must be performed to link together all the information generated. A few computational algorithms and software packages have been recently developed, which can utilize multiple-dimension experimental data sets for stem cells to construct their models and regulation networks. A general strategy to integrate mRNA and microRNA expression profiles is to perform correlation analysis. First, we can use software to predict mRNA targets for each miRNA, which is differentially expressed. For each differentially expressed miRNA, we should perform a statistical test to identify whether the number of predicted target mRNAs that are differentially expressed is higher than that expected by chance (
P<0.01/0.05). Furthermore, we can perform gene ontology (GO) analysis and network analysis using a variety of bioinformatics databases and software [
103].
Despite data quality of proteome is not as satisfactory as transcriptome, comprehensive analysis of the two data sets is becoming more and more widely used. Above all, mapping for short reads of transcriptome raw data and amino acid sequences from proteome data to the reference genome is necessary. Comparison and integration of transcriptomic and proteomic data show that, except for a small number of examples, the two data sets are complementary rather than comparable [
102,
104,
105]. For instance, Liu
et al. identified around 40%–60% of the proteins detected in
S. japonicum were consistent with the transcripts [
104]. The reason why the two data sets are much different from each other is not only the imperfect technologies for omics analysis. There is another important cause that the presence and qualities of transcripts and their corresponding protein products depend on a series of post-transcriptional regulation and metabolic processes [
102]. So the overlaps of the two data sets are not expected too much. Where the differences are we may find the post-transcriptional regulation and metabolic processes occur and the actions between transcription and translation will be figured out in the near future. Moreover, Unwin
et al. found the proteome and transcriptome change in generally the same direction by a comparison of data on large numbers of mRNA transcripts and the levels of expression of their associated proteins in dynamic systems of primary hematopoietic stem cells [
106].
Several laboratories have used other omics data sets to perform a comprehensive analysis of stem cell data. To analyze transcriptome and epigenomic data altogether, Xu
et al. developed a classifier to predict self-renewal and pluripotency of mESCs stemness membership genes, using support vector machines [
107]. The Stem Cell Discovery Engine (SCDE), a new platform for analysis of multiomics data, has allowed the users to consistently describe, share, and compare multiomics data at the gene and pathway level [
108].
5 Perspectives
With the continuous development and improvement of experimental techniques and computational methods, omics research of stem cells has substantially progressed. This has been prompted primarily by major breakthroughs in stem cell biology, the potential of stem cells for biomedical application, and the awareness that transcriptomics and proteomics may be able to accelerate this progress further and possibly open yet unexplored areas of research. At the same time, these achievements bring us new and greater challenges. One of the major problems is how to utilize the existing experimental data more efficiently in the high-level analysis. To achieve this, we must eliminate the discrepancies caused by differences between various platforms and technologies. Only then we can make useful parallel comparisons of data from different sources. To date, we are no more than halfway to achieving this goal. However, the integration of the multilayered omics data are not the end. Our final goal should be the formulation of new hypotheses based on the results of transcriptomic and proteomic data analysis, and testing them in a low-throughput setup to obtain functional verification. The field will be able to move ahead more quickly to uncover the characteristics of stem cells, benefiting clinical applications such as transplants of stem cells and alternative therapies (Fig. 2).
Higher Education Press and Springer-Verlag Berlin Heidelberg