Introduction
Microarray technology was introduced in the 1990s, and enables researchers to measure the expression levels of thousands of genes simultaneously. In 1999, Golub et al. [
1] successfully applied microarray technology in a leukemia study. In the same year, Alon et al. [
2] used microarray technology in a study of colon. Since then microarray technology has become a standard tool for cancer study and almost all of common cancers have been profiled [
1-
7]. A typical microarray experiment contains several dozen to hundreds of samples and each sample has the expression values of thousands of genes, so researchers face the problem that the feature number is far larger than the sample size. The characteristic of the problem increases the difficulties in microarray data analysis. In the past decade, pattern recognition methods have been applied or developed, to address these problems. An overview of these methods will not only help us understand the state of arts in this area, but also may inspire us with better ideas to solve these problems.
In this review, we will focus on the microarray data analysis methods used in oncology studies. To interpret the microarray data analysis problem more easily, we formulate the problem as follows.
A microarray dataset can be described as a data matrix BoldItalicN×M, where Mis the number of samples and N is the number of features/genes. Each column represents a sample Si, i∈{1,2,…,M} and each row represents a gene Gi, i∈{1,2,…,N}. For most problems there will be a vector BoldItalic, which includes the clinical information of each sample. The clinical information can either be discrete (e.g., normal vs cancer or different subtypes of cancer) or be continuous (e.g., survival time).
To avoid the effect of noise and shorten the computation time, researchers usually perform preliminary gene filtering before the analysis to remove the non-informative genes. There are four commonly used criteria for defining non-informative genes. First, the fold change (defined as the ratio between the expression value of treated and that of control samples) is less than a pre-determined cutoff value. In this case, the treated samples can be a group treated with a particular drug or a group of cancer patients and the control samples are a different group of untreated or non-cancerous patients. Second, the percentage of missing data is higher than a pre-determined cutoff value. Third, the percentage of present call is less than a pre-determined cutoff value. Last, the standard deviation is less than a pre-determined cutoff value. The choice of the cutoff value is usually arbitrary and different studies may use a different combination of criteria. For example, van’t Veer et al. filtered the genes by requiring that the informative gene must have a two-fold difference and a p-value larger than 0.01 in more than 5 tumors [
3]. Kapp et al. removed the genes which had a standard deviation of less than 1.5 and had less than 90% of data present [
8].
Based on whether clinical information is incorporated in the data analysis scheme, the data analysis methods can be separated into two categories —unsupervised analyses which only focus on expression data, and supervised analyses which utilize information other than expression data. We will discuss these two categories separately.
Unsupervised analysis methods
The main purpose of this type of method is to aid the class discovery, and here we not only mean the class for samples, but also class for genes. One of the most commonly used unsupervised methods in microarray data analysis is hierarchical clustering [
9-
11]. Because this type of analysis clusters genes and samples step by step and generates a clustering tree to indicate the distance between genes or samples, it is often used as the first step when analyzing data and it gives researchers an overall view of the data. For clustering methods, there are three key issues, distance measurement, optimization function, and evaluation of clusters. Different choices may need different solutions and may lead to different discoveries.
Distance measurement
Different genes have very different scales of expression values, so it does not have too much biological meaning to measure the distance in Euclidean space. In biological studies, it is more meaningful if two genes or samples have the same expression pattern (i.e., two genes that always have high or low expression values in the same sample may imply that these two genes have a similar function). Because of this, the Pearson correlation coefficient is always used as a distance measurement [
9,
12]. Though direct measurement of the distance in Euclidean space may not be feasible, standardizing the data into the same scale or using ratio values makes the measurements more reasonable. Instead of measuring the Euclidean distance with microarray data, Nilsson et al. used an Isomap algorithm to measure the geodesic distance between two points [
13], which is to measure the length of the shortest path between these two points. Recently, with more and more gene annotation information available, Boratyn et al. proposed to measure the distance based on both expression data and biological data, which include the biological functions of each gene [
14]. The distance by the new measurement is the sum of distance measured by expression data and distance measured by biological data. Here the distance measured by biological data is defined based on three criteria, the distance between two genes with similar function is smaller than two genes with different functions, the distance between annotated and un-annotated genes is smaller than that between two un-annotated genes, and the distance between two un-annotated genes is smaller than that between two genes with different functions. Though we do not know the correct answer in most oncology studies, these improvements in the distance measurement provide interesting applications.
Optimization function
Almost all clustering problems are due to optimization problems. The aim of optimization is to maximize the distance between different clusters and to minimize the distance within the same cluster at the same time. To solve this optimization problem, different optimization functions are proposed and solutions are provided. Bagirov et al. designed the optimization function to find the cluster centroids by minimizing the sum of the deviations of all tissue samples [
15]:
where
q is the total number of clusters and
C1,
C2,...,
Cq are the centroids of clusters. This optimization problem was solved by iteratively adding one centroid at a time. Gao et al. proposed to use the sparse non-negative matrix factorization method to do the clustering [
16]. In this method, the data matrix
BoldItalic is factored into the product of two non-negative matrices
BoldItalicN×k and
BoldItalick×M, where
k is the number of clusters. Then given the sparseness control parameter
λ, the optimization function becomes
which can be solved iteratively. After calculating the
BoldItalick×M, sample
j is placed in cluster
i if
hij is the largest entry in column
j.
Sometimes there are other background information available. Though the information is not enough for supervised analysis, it will help the clustering process if we have a good way to utilize it. Sese et al. added additional features to each sample. The optimization function becomes maximizing for the interclass variance with the restriction that only allow those splits that can be explained by a common feature [
17]. Dotan-Cohen et al. proposed a hierarchical tree snipping method which incorporates the gene annotation information from Gene Ontology (GO) to obtain clusters that are substantially enriched in genes participating in the same biological process [
18].
Since a gene can function in several pathways and a sample can have multiple clinical statuses, limiting each sample or gene to be present in only one cluster may lead to loss of some information. Fuzzy clustering methods have been applied in this area (e.g., Belacel et al. successfully applied fuzzy
K-means and fuzzy
J-means clustering to breast cancer data [
19]).
The optimization functions discussed above only focus on a gene or sample at a time, but sometimes we need to consider both the gene and sample simultaneously. The coupled two-way clustering (CTWC) method introduced by Getz et al. tries to address this issue [
20,
21]. The method does the clustering in an iterative way, which is first to cluster genes and samples separately, then iteratively perform clustering to the sub-matrix defined by the pair of clusters generated in gene and sample dimension. Kluger et al. applied spectral bi-clustering methods to tackle this problem [
22]. They model the expression of each gene as
xij=
eij+
ρi+
χj, where
eij is the base expression level for gene
i in sample
j,
ρi is the tendency of gene
i to be expressed in all samples,
χj is the tendency of all genes to be expressed in sample
j. So the objective in the simultaneous clustering of genes and samples is, given data matrix, to find the underlying block structure of
E.
Evaluation of clusters
Since there is no golden standard for the unsupervised problem, it is hard to evaluate the clustering results especially when the number of clusters is unknown. Usually, researchers choose the number of clusters based on their knowledge of the data. Or they try several possible numbers to select those with obvious patterns. Such choices are always used for hierarchical clustering,
K-means clustering, self-organizing map (SOM), etc. [
9-
12,
23-
25]. Currently, there are many methods that focus on estimating the number of clusters and evaluating the clusters. Some of them work in an iterative way. For example, Hsu et al. initially separated the samples into
k classes, identified the differentially expressed genes (e.g., the expression values of the genes have different distribution in different classes) and built the classifier to predict the label of each sample. Finally, they iteratively refined the clustering results [
26]. This method can both suggest the number of clusters and generate stable clusters. Li et al. also provided a program called SamCluster to iteratively discover stable clusters [
27].
Some of the evaluation methods are based on re-sampling [
28-
32]. The basic idea of these methods is to re-sample several datasets from original data and get the clustering results, then evaluate the stability of the cluster result based on some statistics. Instead of using a re-sampling technique, Swift et al. used a set of clustering methods to generate clusters and evaluate their stability [
33].
Besides clustering methods, there are also some unsupervised methods used in oncology studies for reducing the dimension, such as principal components analysis (PCA), singular value decompositions (SVD), and independent components analysis (ICA) [
12,
34-
37].
Supervised analysis methods
Like microarray studies, the feature/gene number is always far larger than the sample size. Reducing the feature dimension is usually the first step in a supervised analysis. In this section, we will review the dimension reduction methods and subsequent analysis methods separately.
Dimension reduction
Reducing the feature dimension not only avoids over-fitting in model building, but also saves computation time and provides better interpretation of the data. Among these methods, feature selection is a commonly used method. The purpose of feature selection is to remove the less informative features and keep the more informative ones. The most common way to measure whether a gene is informative in microarray studies is to test whether the gene is differentially expressed in different classes. Usually, feature selection methods can be separated into two categories — filter methods and wrapper methods.
Filter methods evaluate each feature only based on the data. In most cases, the features are ranked by some statistics and given a cutoff value, the top ranked features are selected. If we assume that gene expression values in different classes belong to two Gaussian distributions with different mean values. The t statistic is often used to evaluate the performance of each gene for discriminating two classes [
38-
42]. Some other statistical methods are used for selecting differentially expressed genes, such as Wilcoxon rank test [
41-
43] or F-test [
44]. Besides these statistical methods, signal-to-noise calculation which is defined as
where
μki is mean value of gene
i in class
k,
σki is standard deviation of gene
i in class
k [
1,
45,
46], ratio between class sum of squares to within class sum of squares [
46], correlation between gene expression
Gi and class label
Y [
42,
47], entropy-based method [
48] are applied in feature selection process. Because most of the filter methods only consider one gene at a time, they will fail to identify the combination effect of several genes. Bø et al. proposed a method to evaluate the discriminating ability of a pair of genes [
49].
Wrapper methods embed the feature selection into the model optimization process. Usually, the performance of the features selected is evaluated by the classification error rate. Since feature selection is an NP-hard problem and the feature number in microarray studies is very large, many optimization methods have been applied to this problem. Support vector machine (SVM), recursive feature elimination (RFE) [
50], recursive-SVM [
51], and entropy-based recursive feature elimination (E-RFE) [
52] all use similar search methods to get optimal results. For these methods, in each iteration the classifier is trained with the features left. Features are ranked based on different criteria, and then one or more features with the worst performance are eliminated until a stop criterion is reached. Heuristic algorithms sequential forward selection (SFS), sequential forward floating search (SFFS), genetic algorithm (GA) and evolutionary algorithm (EA) are all incorporated with different classifiers to select features. For example, SFS with Fisher’s linear discriminant analysis (LDA) [
53], SFS with IB1, Naïve Bayes, C4.5 and CN2 [
54], SFS, SFFS with LDA, logistic regression and SVM [
55], GA with SVM [
56,
57], GA with
k-nearest neighbor (
k-NN) [
58],GA with maximum likelihood classification method [
59], EA with
k-NN [
60,
61], EA with LDA [
61].
Other than the filter and wrapper methods, there are some feature selection methods that can rank the features in the model training process and select important features. Saeys et al. called them embedded methods [
62]. Krishnapuram et al. used a sparse Bayesian approach to model the problem, and then a set of important features were selected based on the weight of each feature [
63]. Also, BLogReg [
64] adopted a similar approach to this problem.
Similar to the PCA approach, partial least squares (PLS) has also been used for dimension reduction when sample labels are available [
65,
66].
Class prediction
Usually microarray data class prediction problems are similar to class prediction problems in other areas, so most of the classic class prediction methods have been applied to microarray data analysis — LDA projects the samples into one dimension space to maximize the distance between classes and minimize the distance in each class at the same time [
46];
k-NN predicts the test sample based on the class labels of the
k samples near the test sample [
37,
45,
46,
58,
60]; decision tree is a classifier in the form of a tree structure and each node either indicates the label of the test sample or specifies some test to select which sub-tree to go [
67]; SVM aims to find an optimal hyper-plane to separate the two classes and maximize the distance between the hyper-plane and the closest data pointing to the hyper-plane [
24,
47,
50-
52,
56,
57]; and artificial neural networks take genes as input nodes and class label as the output node to learn the parameters connecting to nodes of different layers and predict the unknown samples [
68-
70]. Some modifications to the classic methods are also applied to this problem, such as Linder et al. who proposed the subsequent artificial neural network (ANN) method, which has two levels of ANN. The first level functions as a pre-selection which will narrow the possible predicted label of test sample and let the ANN of second level to give a final prediction [
71].
Besides these classic methods, there are other methods used in microarray class prediction, such as the weighted voting scheme introduced by Golub et al. [
1]. The basic idea of this method is that each selected informative gene predicts the class label based on which class the test sample is near and the sample is predicted to the class with more votes. Van’t Veer et al. successfully used a correlation-based classification method on a breast cancer dataset [
3]. West et al. applied a Bayesian regression model to predict the clinical status of breast cancer patients [
35]. Zhang et al. introduced a vector
α={
α1,
α2,…,
αn} to indicate whether the class label is switched, e.g.,
αi =1 indicates the label of sample
i is switched. With this new variable and regression method, Zhang et al. tried to find the mislabeled samples in an iterative way [
72]. Gevaert et al. proposed three ways to integrate clinical data into the Bayesian network learning — full integration, which means to put clinical data and expression data together to train the Bayesian network; decision integration, which means to train two Bayesian networks separately and integrate the prediction together; and partial integration, which means to learn the structure of the Bayesian network separately then integrate the structures together for parameter learning and prediction. The results show that only partial integration and decision integration perform significantly better than each data source separately [
73].
Survival analysis is a specific class of class prediction problem. Unlike the problems we have discussed above, the class label here is a continuous value. Several regression-based methods have been applied to this kind of data analysis [
74-
78]. All these methods are derived from the same basic model
where
λ0(
t) is a baseline function,
BoldItalic={
β1,
β2,…,
βn} is the regression coefficients and
BoldItalic={
x1,
x2,…,
xn} is the expression level of the sample studied. According to different focuses, these methods also have different modifications to that model. Goeman et al. put an extra assumption on the regression coefficients that the regression coefficients of the genes were random and independent with mean zero and common variance
τ2 [
74]. Kaderali et al. proposed to solve the regression problem with a hierarchical Bayesian approach which introduced a prior distribution for regression coefficients [
77]. One advantage of regression analysis is that it can handle different sources of data simultaneously (i.e., Fernandez-Teijeiro et al. incorporated both gene expression and clinical information in the survival analysis [
79]).
Though feature selection and sophisticated analysis methods can somewhat avoid over-fitting and improve classification results, we cannot be sure whether improved results are the consequence of improved methods or chance. There are several methods aimed at evaluating the analysis results or increasing the stability of the classifier. Testing the classifier with an independent dataset is usually the unbiased way to measure the performance of the classifier. If the independent test set is unavailable, cross validation is also a common way to estimate the performance of the classifier,
n-fold cross validation means separating the original dataset into
n subsets and each time choosing one subset as test set and others as training set, leaving one out cross validation means each time leaving one sample as test set and others as training set to evaluate the performance of classifier. For microarray data analysis, feature selection steps are external to the cross validation process; there are two schemes to evaluate the performance when feature selection is used. As defined by Zhang et al. [
51], CV1 is the scheme where feature selection is done with all the samples and cross validation is only for the classification step, CV2 is the scheme where the cross validation is used both for feature selection and classification which means leaving the test set first then performing feature selection and classification with the other samples. Since CV1 scheme uses all the samples to do the feature selection which provides information for the classification step, the cross validation result of CV1 scheme will be optimistic and that of CV2 is more faithful. Though cross validation provides a way to evaluate the performance of the data analysis methods, there is still a large variation in cross validation results. Sometimes permutation method is often used to permute the label of samples to estimate the null distribution of the statistics or classification accuracy. It is a non-parametric way to test the significance of the analysis result [
45,
80,
81]. Boosting is used to increase the accuracy and stability by increasing the weight of misclassified samples [
43,
82] and bagging is used for the same purpose by combining the several classifiers trained on bootstrapping datasets [
82]. There are also some methods to detect possible mislabeled samples in an iterative way [
72,
83].
Biology behind data
Simply clustering the data, selecting some differentially expressed genes, or building a classifier for microarray analysis is not always enough. Sometimes we need to know more about the biology behind the data, such as which pathway or biological function is associated with some specific type of cancer. As we learn more about gene function, methods have been proposed to analyze the expression data with known gene annotations.
Gamberoni et al. proposed a novel approach that studies the correlations between genes and their relation to Gene Ontology (GO) [
84]. The method first selects a significantly correlated pair of genes, and then counts the number of gene pairs linked to the same GO term. Finally, the method uses a bootstrap analysis to evaluate whether the numbers of gene pairs linked to the same GO term is significantly higher than that in a simulation study. Subramanian et al. proposed a method called gene set enrichment analysis (GSEA) to evaluate whether a pre-defined group of genes have different expression patterns in two classes of samples [
85]. Similar to the GSEA approach, Al-Shahrour et al. proposed a method to find groups of genes that have common functional labels which are significantly over- or under-expressed as a block [
86].
Discussion
Microarray data analysis is a biology-driven problem; different biological purposes will need different analysis methods. Though the existing methods have met data analysis requests, there are still many directions that need more focus. For example, most of the existing classification methods cannot use clinical data and expression data simultaneously. This leads to the question of how to effectively remove the dominant factor in the dataset.
Higher Education Press and Springer-Verlag Berlin Heidelberg