1 Introduction
Since being discovered in the late 1970s [
1,
2], mouse lymphocyte antigen-6 (Ly-6) had been regarded as a prototype of the concept of “differentiation antigens,” which predicted that all cells/tissues in a single individual should possess, and could be distinguished by, different cell-surface antigens [
3]. Interestingly, the complexity of the Ly-6 serological pattern implied the existence of “Ly-6 complex” antigens [
4], and subsequent molecular and genomic studies indeed confirmed that these antigens are encoded by multiple genes located in clustered chromosomal regions [
5,
6]. Structurally, all Ly-6 antigens contain an evolutionarily conserved LU domain (named after Ly-6 and urokinase plasminogen activator receptor (uPAR)), which is characteristic of 10 conserved cysteine residues that form disulfide bridges and thereby create a “three-finger” structure [
7]. In addition, many Ly-6 antigens also contain a glycosylphosphatidylinositol (GPI) anchor domain, thus being tethered to the cell membrane [
8], whereas those lacking a GPI anchor domain may act as secreted proteins. According to the information provided by the Human Genome Organization Gene Nomenclature Committee (HGNC) and the Mouse Genomic Nomenclature Committee (MGNC), human and mouse genomes have 35 and 61 LU domain-containing genes, respectively [
6] (updated information on human LU domain-containing genes is available on a specific page of the HGNC website).
Although the Ly-6 antigens were originally identified from lymphocytes, their expression turns out not to be restricted to a single tissue. Indeed, the founding member of the family, mouse Ly-6A (encoded by the
Ly6a gene), has currently been widely known as stem cell antigen-1 (Sca-1) because it has been commonly used as a cell-surface marker of several types of tissue stem cells, including hematopoietic stem cells [
9], mammary stem cells [
10], prostate-regenerating cells [
11], and lung bronchioalveolar stem cells [
12], as well as various cancer stem/initiating cells (for review, see [
13]). However, despite a considerable evolutionary conservation of the Ly-6 antigens between humans and mice, a human homolog(s) to the mouse
Ly6a gene has been paradoxically missing [
6]. In this regard, there are great mysteries why the mouse Ly-6A/Sca-1 has been selected (relative to its homologs) through evolution to be expressed in stem cells, whether humans (used to) have a homolog that may be similarly expressed in the corresponding types of cells, and how this gene(s) exerts biological functions in these cells.
In efforts to study the function of Ly-6A/Sca-1, two genetic Ly-6A/Sca-1 loss-of-function mouse models had been generated, and studies with both models showed that homozygous Ly-6A/Sca-1 null mice are essentially healthy in normal development [
14,
15], indicating that Ly-6A/Sca-1 is not required for the normal generation and functions of the cells that express it. Nevertheless, more careful phenotyping revealed that, under stress conditions (e.g., transplantation, immune stimulation, aging, injury), the Ly-6A/Sca-1 null mice exhibited notable, though sometimes controversial, phenotypes in several types of cells, including hematopoietic stem and progenitor cells [
14,
16,
17], T cells [
14,
18,
19], B cells [
14,
20], bone marrow stromal cells [
21], and skeletal muscle cells [
22,
23]. These observations therefore open perspectives for understanding the potential function of Ly-6A/Sca-1 under stress and disease-like conditions. Indeed, studies with mouse models have led to new observations that Ly-6A/Sca-1 could be potentially associated with cancer-related mechanisms, such as the c-Kit [
16,
17] and TGF-β signaling pathways [
24,
25]. Thus, these findings imply that, although Ly-6A/Sca-1 seems to be nonessential for normal development, it may play an important role in tumorigenesis.
In this study, we performed an exhaustive survey of LU domains in mouse and human genomes and clarified a total of 130 LU domains which belong to 37 human and 64 mouse LU domain-containing genes. Among them, we discovered a previously unannotated human gene that likely encodes the ortholog of mouse Ly-6A/Sca-1, which was determined by the “reciprocal best hit” analysis. We cloned the full-length cDNA of this human LY6A gene from pituitary tumors, in which LY6A is aberrantly overexpressed compared with normal pituitary tissues and may contribute to tumorigenesis. Protein expression experiments validated that the LY6A protein can act the same as mouse Ly-6A/Sca-1 to be properly processed and tethered on the cell membrane through its GPI anchor domain. Interestingly, LY6A largely overlaps with, and is transcribed from the opposite strand of, a known lncRNA gene. The possible mutual regulation between these genes and technical issues in conventional transcriptome analysis may explain why this gene is rarely detected and thus has been “hidden” for decades.
2 Materials and methods
2.1 “Reciprocal best hit” analysis
The “reciprocal best hit” algorithm was used to search and define orthologs between different species as previously described [
26,
27]. In brief, by using the amino acid sequence of the mouse Ly-6A protein as a search query, a “tblastn” analysis was performed against the human genomic sequence. The matched human homologous genomic regions were used as queries to search the mouse nr protein database with “blastx”. Then, the retrieved and original amino acid sequences were compared with “blastp”. If they represented the same protein, the original mouse gene and the matched human homolog(s) were defined as orthologs. The GENSCAN program was used to predict exon/intron structures [
28], which were further validated by mapping RNA-seq reads and molecular cloning.
2.2 Sequence alignment and phylogenetic analysis
Amino acid sequences of the LU domains (or the full-length Ly6a family proteins) were aligned with the ClustalX 2.1 program [
29] with the BLOSUM series matrix; the end gap separation option was turned on. The resulting alignments were manually investigated using the BioEdit program. Based on the alignments, phylogenetic trees were constructed with the Molecular Evolutionary Genetics Analysis program (version 6) [
30] using the neighbor-joining method with 1000 bootstrap replicates.
2.3 RNA-seq and data analyses
Pituitary tumor RNA-seq libraries were generated with mRNA purified with poly(T) oligos using the VAHTS Universal V6 RNA-seq Library Prep Kit for Illumina (Vazyme, NR604-01/02) and sequenced with the HiSeq X Ten or the Novaseq 6000 S4 platform (Illumina). Gene expression profiling of the pituitary tumor samples was performed as described, and our comprehensive analysis of the samples of 180 patients has recently been published [
31]. The RNA-seq data of acute myeloid leukemia (AML) samples from patients and cell lines were derived from previously published studies [
32–
35]. To validate the exon/intron structures, all the RNA-seq reads that were mapped to the genomic loci were retrieved, and each of them was manually examined to be exactly aligned to the genome by adhering to the “GT-AG” rule of the splice site pairs. This analysis allowed us to distinguish the correct transcripts from the opposite strands and exclude any contamination from genomic DNA or nonspecific transcripts. RNA-seq data were retrieved from the Encyclopedia of DNA Elements (ENCODE) project and the Genotype-Tissue Expression (GTEx) project to obtain a comprehensive profile of
LY6A expression in different human tissues. The raw RNA-seq data were searched for
LY6A-specific splicing junction reads with the “blastn” program using the sequences of exons 1–2 (5′-ATGGAGAGAGCTCAGGGTCT-3′) and exons 2–3 (5′-GTCTTTGGCAGTAAAGTGAG-3′) as queries. The numbers of junction reads were normalized to counts per million reads mapped. Gene set enrichment analysis and hierarchical cluster analysis of the RNA-seq data were performed as previously described [
36].
2.4 RT-PCR analysis of gene expression
Total RNAs of various cells/tissues were extracted using MagZol reagent (Magen, R4801), and cDNA was synthesized with HiScript II Q Select RT SuperMix for qPCR (+ gDNA Wiper, Vazyme, R233). PCR was performed with 2× Taq Master Mix (Dye Plus, Vazyme, P112). A 341-bp fragment of the LY6A cDNA was amplified with a pair of primers (5′-CCTGCTGAGCATGGAGAGAG-3′ and 5′-AGAGGAAGACTGACCCCAGG-3′), and a 632-bp full-length LY6A cDNA was amplified with a pair of primers (5′-GTGCAGGACTCATTCGATCC-3′ and 5′-CACATCCTTCCCCTGATCTC-3′). The PCR products were analyzed with agarose gel electrophoresis and further validated by sequencing. As a loading control, GAPDH cDNA was amplified with the following pair of primers: 5′-GGAGCGAGATCCCTCCAAAAT-3′ and 5′-GGCTGTTGTCATACTTCTCATGG-3′.
2.5 Full-length cDNA cloning
A pituitary tumor sample was used for cloning the full-length LY6A cDNA. Total RNA was extracted with TRIzol reagent (Thermo Fisher Scientific, 15596018), and cDNA was synthesized with the PrimeScript RT Reagent Kit with gDNA Eraser (Takara, RR047A). PCR was performed with Ex Taq DNA Polymerase (Takara, RR001B). Full-length LY6A cDNA was amplified by PCR with the following primers: 5′-GTGCAGGACTCATTCGATCC-3′ and 5′-CACATCCTTCCCCTGATCTC-3′. The PCR product was purified with the Axygen AxyPrep PCR Clean-up Kit (Corning, AP-PCR-250), cloned into the T-Vector pMD19 (Takara, 3271), and subjected to sequencing. The sequence of the full-length cDNA was deposited in NCBI GenBank with accession number MW603613.
2.6 Induced expression by interferon-γ (IFN-γ)
Primary pituitary tumor cells were treated with recombinant human IFN-γ (10 ng/mL; Novoprotein, C014) for 24 h. RNA was extracted using MagZol reagent (Magen, R4801), and cDNA was synthesized by HiScript II Q Select RT SuperMix for qPCR (+ gDNA Wiper, Vazyme, R233). TB Green Premix Ex Taq II (Tli RNase H Plus, TaKaRa, RR820) was utilized to perform RT-qPCR with primers 5′-TGAAGACCTTGTCCCTGGTCC-3′ and 5′-GAGAGCTTGTTCTCCCCTCTCA-3′ on the Applied Biosystems ViiA 7 system. GAPDH was analyzed as the internal reference and for normalization with primers 5′-GGAGCGAGATCCCTCCAAAAT-3′ and 5′-GGCTGTTGTCATACTTCTCATGG-3′.
2.7 Protein expression and detection
The cDNAs encoding human LY6A (hLY6A) and mouse Ly-6A (mLy-6A), as well as those with deletion of the GPI anchor domain (ΔGPIa), were cloned into the pLVX-3 × FLAG–Halo vector. HEK293T cells in six-well plates were transfected with the indicated plasmids and cultured for 48 h. The culture medium was replaced with new medium containing 100 nmol/L HaloTag Alexa Fluor 660 Ligand (Promega, G8471) and incubated for 1 h. The cells were washed twice with PBS and lysed with RIPA buffer (Yoche Biotech, YSD0100) containing protease inhibitors (NCM Biotech, P001). The total protein concentrations of whole-cell lysates were analyzed with a BCA protein assay kit (Yoche Biotech, YSD-500T) and measured with a Biotek 800 TS microplate reader at the wavelength of 562 nm. Same amounts of the whole-cell lysates of different samples were loaded with SDS‒PAGE loading buffer (250 mmol/L Tris-HCl pH 6.8, 10% SDS, 50% glycerol, 0.05% bromophenol blue, and 5% β-mercaptoethanol), separated by SDS‒PAGE, and transferred to polyvinylidene difluoride membranes (Millipore, ISEQ10100). The fluorescent signals of the HaloTag Alexa Fluor 660 Ligand on the membranes were analyzed with a LI-COR Odyssey XF Infrared Imaging System. Then, the membranes were subjected to immunoblot analysis with the indicated antibodies and imaged using the Tanon 5200 Chemiluminescent Imaging System.
2.8 Generation of monoclonal stable cell lines
Monoclonal stable cell lines were generated with a lentiviral system. Briefly, HEK293T cells were transfected with lentiviral plasmids and the packaging plasmids, pMD2.G and pSPAX2, respectively. Virus-containing media were collected at 48 and 72 h after transfection, mixed, filtered through 0.45 μm nitrocellulose filters (Millipore, SLH-V033RS), and used to infect HEK293T cells in the presence of 5 μg/mL polybrene (Yeasen, 40804ES76). The infected cells were selected with 2 μg/mL puromycin (Beyotime, ST551). One-hundred cells were plated in 10 cm dish (Corning, 430167) and cultured for 10 days to obtain the monoclonal stable cell lines. When the single cell-derived clones were clearly visible in the dishes, individual clones were picked for expansion to generate monoclonal cell lines. Every cell line was validated for proper protein expression by immunoblot analysis.
2.9 Cell proliferation assays
Trypsin-digested cells were centrifuged and resuspended in culture medium, and cell numbers were counted by the Beckman Coulter Z2 cell counter. Three-thousand cells per well were seeded in 96-well plates, and cell viability was measured with a cell counting kit-8 (CCK-8) kit (APExBIO, K1018) after 0, 24, 48, 72, and 96 h. After addition of the CCK-8 solution, the plate was incubated at 37 °C for 1–4 h, and the absorbance at 450 nm was determined with the BioTek 800 TS microplate reader.
3 Results
3.1 Systematic survey and comparative genomic analysis of human and mouse LU domain-containing genes
A multiple-round, exhaustive “blast” search of human and mouse LU domains was performed against the nonredundant (nr) protein and genome databases to identify full-length proteins/genes and individual LU domains. First, for the annotated protein sequences, we identified 36 human and 64 mouse LU domain-containing proteins (Tab.1), which are more than those provided by the HGNC and MGNC databases. Second, many of these proteins contain multiple LU domains; thus, we researched the individual LU domains in these proteins by elaborate homology analysis. We recognized 129 unique LU domains, including many previously unrecognized domains (Tab.1 and Fig.1–1C). For example, homology analysis demonstrated a complicated arrangement of multiple LU domains in mouse and human CD177 proteins (Fig.1), and indeed, they were clarified to contain 8 and 4 LU domains, respectively (Fig.1). Third, we found several matched hits located in genomic regions (e.g., human chromosome 8q24.3) where no annotated protein-coding gene resides; this notion may suggest the existence of novel genes or pseudogenes that lost their open reading frames (ORFs) (see below for further analysis).
Because some of the LU domain-containing proteins have multiple LU domains and nonhomologous regions, they are not readily applicable for direct sequence alignment and evolutionary analysis. Therefore, we performed a phylogenetic analysis with all the individual LU domains (Fig.1). The results revealed that, while most of the human and mouse genes can form clear orthologous groups (though several gene expansion events occur within the groups), mouse
Ly6a is classified in an exceptional mouse-specific cluster without a human ortholog (Fig.1). In the mouse genome, the
Ly6a cluster resides in a well-defined region in chromosome 15qD3 that contains several other
Ly-6 family genes, and this region is syntenic with a region of human chromosome 8q24.3 (Fig.2). Immediately adjacent to the mouse
Ly6a cluster are
Ly6e and
Ly6l, both of which clearly have their human orthologs and, according to the phylogenetic relationships, neither could serve as a parental gene of the
Ly6a cluster through evolution. Notably, however, in the human genomic region corresponding to the
Ly6a cluster, there is a single annotated human-specific gene named
C8orf31 (also known as
LINC02904) (Fig.2), which is currently considered as a lncRNA gene and thus a possible regulator of gene expression [
37].
3.2 Identification of the missing human LY6A gene
As our LU domain searching revealed several novel hits residing in the syntenic 8q24.3 region of the human genome, we set out to determine whether this region may harbor unannotated gene(s) homologous to the mouse
Ly6a cluster genes and whether the human
C8orf3/
LINC02904 gene has any possible relevance. We used the “reciprocal best hit” approach [
27,
38] to reanalyze this ~63.5 kb region, followed by validation of the precise splice sites, molecular cloning, and structural analysis of the candidate homolog gene(s) (Fig.2). As a result, we first verified three regions of the human genomic sequence whose encoded amino acid sequences showed considerable homology with the LU domain of mouse Ly-6A, especially in the conserved cysteines (Fig.2). Two of these regions lying ~1.3 kb apart matched successive parts of the query Ly-6A protein, ranging from the very N-terminus of the LU domain to a part of the GPI anchor domain; the third region, being located ~17.4 kb away, matched the N-terminal part of the LU domain with relatively lower homology (Fig.2). In combination with GENSCAN exon‒intron prediction, the results suggest that two putative genes may be located within this region, and each of them contains notable potential exons. Interestingly, putative gene #1 resides in the
C8orf3/
LINC02904 gene locus but is transcribed from the opposite strand, which implies an unusual regulatory pattern.
We reasoned that the exon‒intron structures of these genes could be clarified by analyzing the sequences of specific reads in RNA-seq data. This approach could also provide convincing evidence for the expression of these genes. To this end, we analyzed our RNA-seq raw data of AML and pituitary tumor samples. From each sample, we first retrieved all the reads that were mapped to these loci and then performed a manual examination of how individual reads were exactly aligned to the genome (Fig.2). Notably, only the reads spanning spliced exon‒exon junctions were applicable for this purpose, because (1) these reads could exclude contaminations from genomic DNA or nonspecific transcripts and (2) their transcription directions could be determined by sticking to the “GT-AG” rule of the splice site pairs, which was important to distinguish the transcripts from opposite strands (Fig.2). As a result, we found that putative gene #1 was properly spliced in some pituitary tumors (though not in the AML samples), whereas putative gene #2 was not spliced in any sample. Furthermore, detailed analysis of the spliced reads revealed that, while the C8orf3/LINC02904 gene has 10 exons (including alternative exons) and produces multiple splicing isoforms, putative gene #1 contains 3 exons that jointly form a major ORF (Fig.2). Therefore, putative gene #1 was likely the only active Ly6a homologous gene within this region and was thus named human LY6A.
3.3 Full-length cloning, interferon induction, and gene expression profiling of LY6A
We then set out to clone the full-length cDNAs of LY6A from pituitary tumor samples whose RNA-seq data have already been validated for LY6A expression. Given that the three exons of LY6A largely overlap with those of C8orf3/LINC02904, we found that in many conditions, the PCR amplicons contained both LY6A and C8orf3/LINC02904 cDNAs, but they could eventually be distinguishable according to the splicing patterns. Only the use of LY6A-specific primers, which must be located in very limited regions (e.g., the 5′ end of exon 3 of LY6A), could produce specific LY6A cDNAs (Fig.2). This analysis showed that LY6A was expressed in pituitary tumors but not in AML cell lines (Fig.3). Full-length LY6A cDNA was cloned from a pituitary tumor, and the sequence was deposited in NCBI GenBank with accession number MW603613.
An important characteristic of the expression of mouse
Ly6a is that it can be induced by IFNs [
39,
40]. We therefore tested whether human LY6A also possesses this characteristic. We found that treating pituitary tumor cells with IFN-γ indeed remarkably increased
LY6A expression (Fig.3). Data mining and RT-PCR validation showed that
LY6A expression could also be upregulated by IFN-γ in other cell types, such as the stomach carcinoma cell line MKN-1 (data not shown). These results indicate the conserved transcriptional regulatory mechanism of
LY6A between mice and humans.
Direct sequencing of the
LY6A-specific PCR amplicons validated the splice sites (exons 1–2 and 2–3) of
LY6A (Fig.3). Using these splice junction site sequences as queries, we were able to obtain an overall expression profile of
LY6A in searching the ENCODE [
41] and GTEx databases [
42]. The results showed that
LY6A has relatively higher expression in the spleen, testis, and subtypes of adipose tissues (Fig.3), which appeared distinct from the expression patterns of other LU domain-containing genes in human tissues (Fig.3). These patterns were validated by our RT-PCR analysis of several collected human tissues (Fig.3 and data not shown). Furthermore, we obtained a publicly available data set of RNA-seq analysis of human spleen tissues [
43], which includes the sequencing results of multiple samples of the total RNA, polyA RNA, and small RNA of spleen tissue, splenic endothelial cells, and splenic fibroblasts. Our reanalysis of this data set revealed that
LY6A-specific reads were detected in the total RNA and polyA RNA (but not small RNA) of the total spleen tissue. By contrast, no
LY6A-specific read was detected in the samples of splenic endothelial cells and fibroblasts (Fig.3). Therefore, these results further support the notion that the
LY6A gene can indeed be actively transcribed into polyadenylated mRNA within spleen tissue, whereas the absence of
LY6A mRNA from splenic endothelial cells and fibroblasts may imply cell-type specificity.
3.4 Aberrant LY6A overexpression in pituitary tumors and its potential relevance to tumor growth advantage
The gene expression profiles suggest that
LY6A might have very low or no expression in normal pituitary tissues (Fig.3). This result was confirmed by our RT-PCR experiments, in which, for example,
LY6A could be clearly amplified from the spleen and a subtype of adipose tissue but was not detectable in two pituitary tissue samples (Fig.3). However,
LY6A was found to be remarkably expressed in many pituitary tumors (Fig.2 and data not shown). In particular, our RNA-seq analysis of pituitary tumor samples revealed that 3 out of 5 patients with Cushing’s disease had relatively high
LY6A expression levels (Fig.4). Importantly, gene set enrichment analysis and unsupervised hierarchical cluster analyses of these tumor samples showed that tumors with high
LY6A expression possess transcriptomic features of higher energy metabolism, proliferation, and neuron synapses and lower innate immune response (Fig.4 and 4C). This aberrant
LY6A overexpression in human pituitary tumors is consistent with a recent study with a mouse pituitary tumor model, in which Ly-6A/Sca-1
+ cells showed tumor growth advantages [
44]. Thus, these findings collectively suggest the possible functional contributions of
LY6A to tumorigenesis.
3.5 Evolutionary conservation of the LY6A-encoded protein sequence, domain architecture, and exon‒intron structure and identification of a high-frequency nonsynonymous coding single-nucleotide polymorphism (SNP)
We then set out to analyze whether LY6A encodes a functional protein. Alignment of the encoded protein by LY6A with those of the mouse Ly6a cluster showed substantial conservation among these family members, including the conserved cysteines and the three functional domains (i.e., signal peptide, LU domain, and GPIa domain; Fig.5). Moreover, the residue positions corresponding to the exon‒exon junctions of human LY6A are also well conserved compared with all the mouse Ly6a cluster genes (Fig.5). Furthermore, the phylogenetic analysis of the LY6A protein showed that it is clearly classified into a clade of the Ly6a cluster (Fig.5). Therefore, in combination with the abovementioned “reciprocal best hit” analysis and the conserved IFN induction of expression, these results collectively suggest that the human LY6A gene may represent not only a common ancestor of the mouse Ly6a cluster genes, but also the gene encoding the candidate ortholog of mouse Ly-6A/Sca-1 protein.
Interestingly, in our patients with pituitary tumor, we found a genetic variation of LY6A: two alleles of LY6A differ in a single nucleotide that changes the amino acid residue 112 from glycine (G) to serine (S) (Fig.5 and 5C). Although this variation has been annotated as a noncoding SNP (accession number rs11136300) in dbSNP, it should now be defined as a nonsynonymous coding SNP according to its ability to change the encoded amino acid sequence. Moreover, the two alleles of this SNP have relatively high frequencies (overall, 64% of the “G” allele and 36% of the “S” allele), and their frequencies seem to be remarkably different between races of people (Fig.5 and data not shown), implying the potential role of LY6A in human evolution.
3.6 Proper expression, processing, and membrane tethering of the LY6A protein
Biosynthesis of GPI-anchored proteins requires complicated, tightly regulated mechanisms, including transmembrane, signal peptide cleavage, posttranslational modification, and processing in the endoplasmic reticulum. We expressed LY6A in HEK293T cells with insertions of two tags, namely, a 3×FLAG tag for detecting the total protein and a HaloTag for detecting its membrane tethering (as the HaloTag could be labeled by a cell-impermeant, fluorescent ligand), to validate its proper protein expression, processing, and GPI anchoring outside of the cell membrane (Fig.6). Mouse Ly6a was used as the positive control. The results of immunoblot and fluorescent imaging analyses showed that the human and mouse proteins can be properly expressed, processed, and tethered to the membrane and that their membrane tethering is definitely dependent on their GPIa domains (Fig.6 and 6B). To evaluate the effect of the enforced LY6A expression on cell proliferation, we generated multiple monoclonal HEK293T cell lines that stably express LY6A and used cell lines expressing the empty vector gene or mouse Ly6a as the controls (Fig.6). As a result, the cell lines expressing human LY6A exhibited slightly increased proliferation compared with the empty vector control cell lines (Fig.6). Notably, the HEK293T cell lines expressing mouse Ly6a did not show this increase in proliferation probably because of the interspecies differences between humans and mice. Together, the results indicate that the human LY6A gene identified herein is complete and intact in its gene/protein structure; its tightly regulated transcription, protein biosynthesis, and trafficking; and its residing in the proper place to exert its biological functions.
4 Discussion
In this study, we identified and cloned the long-missing human
LY6A gene. Concrete evidence of the existence and identity of this gene includes (1) the syntenic relationship of the gene clusters on human and mouse chromosomes; (2) the shared IFN-inducible expression; (3) the similar exon‒intron structures with precise splice consensus sequences; (4) the high homology of the encoded proteins, especially the conserved functional domains/residues; (5) the well-supported phylogenetic relationships; and (6) the proper protein expression, processing, and membrane-tethering. It is surprising that a novel human gene encoding the candidate ortholog of mouse Ly-6A/Sca-1 was discovered over 20 years after the complete sequencing of the human genome [
45,
46] and over 40 years after the first identification of the mouse Ly-6 antigens with great interest [
1,
2]. This notion may reflect that our understanding of the human genome and its function and regulation is still far from complete. Indeed, the present analysis of
LY6A has raised several interesting questions on the evolution, expression, and function of the
Ly-6 family genes, as well as some technical problems in the widely used transcriptome analysis technologies, which may miss some “hidden” genes.
The evolutionary origin of
LY6A was investigated by the “reciprocal best hit” algorithm, a definitive method for the prediction of one-to-one orthologs [
38]. The results showed that human
LY6A can only be reciprocally matched with mouse
Ly6a but not with any other
Ly-6 family members, thus defining the one-to-one orthologous relationship between the human
LY6A and mouse
Ly6a genes. Notably, in combination with our phylogenetic analysis, these findings suggest two interesting notions: (1) the human
LY6A gene represents a common ancestor of a cluster of expanded mouse homologous genes, and within this cluster, the mouse
Ly6a gene seems to be the most evolutionarily conserved (thus possibly the “oldest”) member; and (2) the tendency that the “older”
Ly6a and its “younger” homologs play their roles in stem cells versus differentiated progeny cells, respectively, may reflect the “Evo-Devo” discipline as a coordinative programming between evolution and development [
47]. These notions thus imply the possible evolutionary history of the
Ly-6 family genes that could be relevant to their functions.
Notably, compared with the overall high-level evolutionary conservation between humans and mice, the Ly-6 family genes are relatively divergent. This divergence is reflected by the existence of different gene numbers in humans and mice (37 vs. 64; including the human LY6A identified herein), as well as their possibly different expression patterns. The underlying mechanistic prerequisites for these phenomena may include the following: (1) the emergence of many of these genes during evolution was actually posterior to the specification of the cells that currently express them; and (2) these genes individually might not be essential for the survival and basic functions of the cells. These prerequisites may allow the (re)distribution or functional replacement of these genes in different cell types through evolution and thus lead to divergence in gene numbers and expression patterns.
How is
LY6A expression regulated? Based on current knowledge, the expression specificity of
Ly-6 family genes falls into two categories: cell/tissue specificity and cytokine inducibility. Notably, mechanistic analysis of the
cis-regulatory elements of the mouse
Ly6a gene showed that the DNase I hypersensitive sites responsible for IFN induction largely overlap with those for its natural expression in specific cells [
48]. Therefore, the conserved IFN-induced expression of
LY6A may suggest that the human and mouse genes share transcriptional regulatory mechanisms and that an evolutionary analysis of these mechanisms will be helpful to understand how the specific expression patterns of these genes are regulated.
Another interesting question is how the dual genes (i.e.,
LY6A and
C8orf3/
LINC02904) originate and cooperatively evolve. Considering the long evolutionary history of the
Ly-6 gene family,
LY6A likely emerged much earlier than
C8orf3/
LINC02904. Subsequently,
C8orf3/
LINC02904 seems to dominate their competition, whereas
LY6A still doggedly survives. However, despite such competition, they must withstand selective pressure together, because their opposite overlapping pattern indicates that many nucleotides of the “noncoding”
C8orf3/
LINC02904 are in fact selected based on the “coding”
LY6A [
49]. Indeed, although thousands of overlapping gene pairs (including those on the same and opposite strands) have been observed in mammalian genomes, a gene pair overlapping with the majority of their exonic sequences is rare [
50]. Thus, the identification of the human
LY6A gene provides a rare and interesting model gene for evolutionary studies.
A unique feature of the human
LY6A gene is that it is well “hidden” within the noncoding RNA gene
C8orf3/
LINC02904, and this situation may provide possible explanations for why human
LY6A has been missed for decades: from a mechanistic perspective,
LY6A expression may be specifically repressed by
C8orf3/
LINC02904 at the transcriptional and posttranscriptional levels [
51], and, technically, the short reads produced by next-generation sequencing may be misreported as the transcripts of
C8orf3/
LINC02904 but not the currently unannotated
LY6A. Indeed, our analyses clearly indicate that only a spliced read that spans an intron ended by splice donor and acceptor sites can be defined as a
LY6A transcript, whereas the nonspliced reads cannot be defined correctly at all; therefore, using a regular standard (e.g., FPKM, RPKM) to evaluate the expression levels of
LY6A or
C8orf3/
LINC02904 is not reliable.
Lessons from this study reflect several technical issues in transcriptome analysis technologies that may lead to data misinterpretation and missing “hidden” genes. First, because the current method for sequence alignment usually does not consider the “GT-AG” rule of the splice sites, some reads could be split incorrectly, leading to the misalignment of exons. Therefore, the findings suggest that sequence alignment methods could be optimized at least in some settings to involve the algorithms that recognize splicing site pairs (as well as other gene features), as did the GENSCAN program for accurate gene prediction [
28]. Second, for genomic regions containing overlapping genes (which are not rare if considering many noncoding transcripts), unsplit reads should be used with caution for evaluating gene expression levels because distinguishing which genes they are derived from is impossible. In this situation, what is suggested is to use only split reads that fit the “GT-AG” rule or to consider full-length sequencing approaches as optional methods. Third, because the majority of the exonic regions of the
LY6A and
C8orf3/
LINC02904 mRNAs overlap, they can form a stable double-stranded RNA “duplex”. It has been known that stable double-stranded or highly structured RNAs may affect the efficiency of cDNA synthesis (e.g., through reverse transcriptase drop-off), adapter ligation, and PCR and thus affect the analysis of the quantity and integrity of RNAs in regular experimental conditions [
52]. Therefore, optimized methods need to be developed to address these problems and to achieve a faithful profile of the vast variety of RNAs in cells [
53,
54].
Finally, the identification of human
LY6A may provide a new biomarker for cancer. As many
Ly-6 family genes show highly specific expression patterns in a cellular differentiation stage- and lineage-specific manner [
6], they are commonly used as cell-surface markers for the isolation of specific cell populations [
13]. In cancer, abnormal expression patterns of these genes are frequently observed and associated with prognosis, but the underlying mechanisms remain unclear [
55]. In this regard, the identification of human
LY6A may provide a new biomarker for these purposes. Although
LY6A was not detectable in normal pituitary tissues, its expression in pituitary tumors and apparent correlation with tumor-promoting transcriptomic features suggest that
LY6A could be repressed in normal pituitary tissues but derepressed through a certain mechanism, and might play a functional role, during tumorigenesis.