Recently, an increasing number of algorithms are proposed. scMerge (Lin
et al. 2019) firstly identifies the single-cell stably expressed genes (scSEGs) as “negative controls” for estimating the unwanted factors, which are then used to correct batch effect by fastRUVIII algorithm (Risso
et al. 2014), an RNA-seq normalization method by factor analysis of control genes or samples. ZINB-WaVE (Risso
et al. 2018) uses a negative binomial (ZINB) model accounting for zero inflation (dropouts), over-dispersion, and the count nature of the data. ZINB-WaVE is an extension of the RUV model, which includes observed and unobserved sample-level covariates and enables normalization for batch effects. Although ZINB-WaVE models sophisticated single-cell data, it fails to handle large scale data sets with tens of thousands of cells. Matching mutual nearest neighbors (MNN) (Haghverdi
et al. 2018) identifies cells that have mutually similar expression profiles between different experimental batches or replicates. The authors inferred that any differences between these cells in the high-dimensional gene expression space are driven by batch effects and do not represent the underlying biology of interest. The systemic difference between these cells is extracted by the algorithm and used to correct the batch effects. The idea of MNN has a great impact on the latter algorithms. MNN was originally designed to batch correction for pairwise datasets, if there are more than one datasets to be integrated, all the datasets will be projected to one user selected dataset sequentially. The result is inconsequent if there are no common cell states for some pair of datasets. Scanorama (Hie
et al. 2019) integrates data from heterogeneous scRNA-seq experiments by finding common cell types among all pairs of datasets. The mutually linked cells between two datasets are kept. This procedure excludes spurious links from the neighbor searching. Conos (Barkas
et al. 2019) also apply MNN to integrate multiple datasets in low-dimensional space. Instead of correction of gene expression, Conos aims to construct a joint graph of cells. Since the MNNs were detected using L2 normalized gene expression, significant differences between batches may obscure the identification of MNNs. To overcome this, Seurat v2 (Butler
et al. 2018) uses canonical correlation analysis (CCA) to integrate datasets, which projects the cells into the most correlated components between two data sets. The rare cells that cannot be explained by CCA are flagged for further analysis. A nonlinear “warping” algorithm is then used to align the data sets into a conserved low-dimensional space. Although CCA can identify shared biological markers and conserved gene correlation patterns, the different cell types should not be aligned together. Therefore, Seurat v3 integrates CCA and MNN and introduces “anchors”, cell pairs that encode the cellular relationships across datasets. The anchors are important to determine the correction vector, which is used to correct the gene expression profile. LIGER (Liu
et al. 2020a) takes as input multiple datasets of batches and learns a low-dimensional space using integrative non-negative matrix factorization. LIGER enables the identification of shared cell types across batches, as well as dataset-specific features, offering a unified analysis of heterogeneous single-cell datasets. Besides learning a low dimensional space, Harmony (Korsunsky
et al. 2019) introduced an iterative procedure to soft cluster and correct clusters center of cells. After clustering, each dataset has cluster-specific centroids that are used to compute cluster-specific linear correction factors. Based on the linear nature of the clustering and correction algorithm, Harmony scales with large datasets. Although the batch effects can be successfully corrected for some particular scenarios, most methods presume that datasets share the same cell type. The uncorrected anchors identified by some algorithms between two datasets will lead to erroneous correction.