Introduction
Although traditional and molecular epidemiological approaches have solved many problems of infectious disease epidemiology, such as the identification of outbreak or transmission, the phylogenetic origins and pathogenic potential of etiological agent and the delineation of individual transmission events in an outbreak cannot be revealed by traditional and molecular epidemiology alone [
1].
During the last decade, the rapid development of sequencing technology, especially the transition from traditional Sanger sequencing to next-generation sequencing (NGS), has provided new opportunities for applications in various fields, including infectious disease epidemiology. With the aid of NGS, considerable advances in speed, high throughput, and cost savings in DNA sequencing have made bacterial whole-genome sequencing (WGS) feasible even in small research and clinical laboratories. The use of genomic data in epidemiological analyses of bacterial infectious diseases is a new means for disease prevention and control [
2]. The development of NGS and the intensive use of its methods in various aspects of infectious disease studies have provided a new set of terms to epidemiology of infectious diseases. Genomic epidemiology is the latest breakthrough in molecular epidemiology [
3]. Classic cases of genomic epidemiology were reported in succession between 2011 and 2012 [
4–
8]. Since the publication of these studies, genomic sequencing of bacterial pathogens has become a widely accepted method to investigate bacterial disease outbreaks. This technology also plays a crucial role in furthering the understanding of infectious disease outbreaks in health-care setting at community, national, and international scales. In this article, we review the development and application of genome sequencing in epidemiological investigation of emerging and re-emerging bacterial infectious diseases.
Discovering cluster cases based on the genomic sequences of bacterial pathogens
Bacterial genomes, referring to the DNA sequences of chromosomes and plasmids of bacteria, are generally smaller than those of animals and eukaryotes, and they range from 130 kbp to 14 Mbp. Even within one species, the genomes of different strains present remarkable diversity. Analysis of over 2000
Escherichia coli genomes revealed an
E. coli core genome of approximately 3100 gene families and a total of approximately 89 000 different gene families [
9]. Unlike eukaryotes, which evolve mainly through modification of existing genetic information, bacteria acquire a large percentage of their genetic diversity by the horizontal transfer of genes [
10].
WGS is suitable for the discovery of genetic variations between outbreak and non-outbreak strains; this technique can also establish a detailed comparison among isolates within an outbreak or cluster [
9]. The use of genomic information during routine monitoring and long-term and/or large geographical surveillance can lead to the discovery of cluster cases with the same or largely similar genomic sequences, which results in the assumption of an outbreak [
11]. Consequently, the source and route of the outbreak can be accurately inferred, thereby providing support for disease prevention and control in the early stages of outbreaks or epidemics. Additionally, given that the genomes of bacterial pathogens are obtained in the survey, important genetic characteristics, such as antibiotic resistance and virulence factors, and other new genetic characteristics of the genome of pathogenic bacteria will be revealed [
11].
Methods and strategies of bacterial genome sequencing in surveillance and outbreak investigation
On the basis of the development of sequencing technology [
12], high-quality sequencing data of the whole genome or nearly whole genome of bacteria can be obtained rapidly and easily. Sequence variations, such as single-nucleotide polymorphisms (SNPs), insertions/deletions, and accessory genes, can be identified rapidly and precisely based on different comparative strategies from raw read data or genome assembly [
13]. These variations can be used as biomarkers to identify and subtype pathogens, analyze phylogenetic relationship, and further provide the molecular basis for discovering epidemiological association [
14,
15]. Applications of genome sequencing in surveillance and outbreak investigation can be narrowly defined to infer the relationship of isolates based on genomic data and consequently discover the transmission chain, trace the outbreak source, interpret the origination and spread of epidemics combined with information on the natural and social environments, and understand the transmission pattern of pathogens and the emerging and spreading mode of diseases [
16–
19].
The general workflow of this procedure should include the following: (1) obtain isolates; (2) generate high-throughput sequence data; (3) screen genomic polymorphisms and identify informative loci; (4) infer the genetic relationship among isolates based on genomic information; and (5) reconstruct the transmission chain that is combined with biological characteristics (phenotype and/or genotype from other methods), results from field epidemiological investigations, and big data on natural and social factors (Fig. 1) [
18,
19]. The first step depends on biological procedures, and the methods in the second step are well established. Multiple types of data from different sources are required in the fifth step. However, no standards and well-established procedures are available, and this step should be improved on the basis of the development of big data technology. The third and fourth steps are core processes in the translation of high-throughput genomic data to reliable epidemiological information by bioinformatics and genomics strategies. Several strategies, including sequence-based (core genome, whole/core genome SNPs) [
4–
8], allele-based (whole/core genome multilocus sequence typing (wgMLST/ cgMLST)) methods [
20–
30], have been used in recent years.
Sequence-based methods focus on using genomic sequences directly to obtain information and estimate the relationships among a large number of isolates. Considering that the alignment of whole-genome sequences is commonly dependent on a high-performance computer or not time effective, identifying informative loci, mainly SNPs, by comparing each isolate with a certain reference genome, screening high-quality loci, and concatenating the bases of these loci from each isolate is an alternative method [
4–
8]. Their concatenated sequences can be further used to perform subtyping, compute genomic similarity, and construct phylogenetic relationships. This process includes the following: (1) sequencing and assembly, (2) identification of SNPs, (3) creation of a multiple sequence alignment from selected SNPs, and (4) inference of phylogeny from multiple sequence alignment. Several pipelines, including, GenomeTrakr’s CFSAN SNP Pipeline [
31], SnpFilt [
32], and Lyve-SET [
33], can be used to identify SNPs.
Allele-based methods are similar to multilocus sequence typing (MLST), but thousands of genetic loci are used. These gene-by-gene approaches, that is, wgMLST or cgMLST, depend on the used gene sets and extend the MLST concept to the genome level on the basis of genomic comparison to construct a set of indexed loci and allele variants [
20–
22]. These approaches show a good typing ability of several pathogenic bacteria, including
Campylobacter jejuni [
21],
Neisseria meningitidis [
23],
Listeria monocytogenes [
24],
Mycobacterium tuberculosis [
25],
Staphylococcus aureus [
26],
Legionella pneumophila [
27],
Enterococcus faecium [
28], and
Klebsiella pneumoniae [
29,
30]. PulseNet International recently used wgMLST in a project performing real-time NGS-based typing of
L. monocytogenes [
11], and they recently committed to the wgMLST approach for their routine surveillance of foodborne diseases [
34].
Whole-genome-based methods can be used to resolve the transmission dynamics of an outbreak in more detail than that of traditional genotyping methods, such as MLST, PFGE, and MLVA, in which only a small fraction of the genome is used to distinguish individual strains and infer phylogenetic relationships. These WGS-based methods also exhibit excellent typeability, reproducibility, and typing system concordance, and they can also be utilized to establish public databases and realize the goal of network applications.
Identifying outbreaks on the basis of bacterial genome sequencing
Some subtyping methods for analyzing the genetic diversity of isolates of pathogens have been used in laboratory and epidemiological assays to identify outbreaks. PFGE is the most frequently used genotyping method for outbreak detection in routine monitoring; this method is remarkably useful in detecting clusters [
35–
39]. However, PFGE provides no phylogenetic information [
40,
41]. Sequencing of isolates yields high-resolution, nucleotide-level data that can provide the necessary information to identify these cases [
11]. The resolution of WGS-based methods is sufficient to distinguish isolates from different outbreaks and provide additional information to the investigations within an outbreak [
42]. WGS is also an excellent tool for detecting national [
35,
36,
43–
47] and global outbreaks [
38,
45,
48].
An overview of how WGS can be used to identify suspected clusters and outbreaks is outlined in Fig. 2. In routine monitoring, isolates from sporadic cases that are identified through traditional epidemiological information (such as isolated times or isolated sites) and possibly any environmental and/or foodborne isolate are sent for WGS. Isolates with the same or considerably similar whole-genome sequences are identified through bioinformatic analysis and comparison, thereby indicating a clustered case or outbreak. Laboratory WGS information and epidemiological information are communicated back to the infection investigation team to start an epidemiological investigation and obtain remarkably targeted interventions and informed policy development.
One recent application case is a project performing real-time NGS-based typing of
L. monocytogenes [
11]. WGS can identify outbreaks and clustering cases and the correlation between patient and food strains, which have not been identified by PFGE. Furthermore, WGS can demonstrate that certain PFGE-defined clusters consist no highly related isolates [
11,
49,
50]. Real-time WGS and typing are used to investigate infections caused by
Salmonella Agona [
51], methicillin-resistant
S. aureus (MRSA), vancomycin-resistant
E. faecium, multidrug-resistant (MDR)
E. coli, and MDR
Pseudomonas aeruginosa [
52]; these techniques have produced precise actionable results and helped researchers detect suspected outbreaks and clustered cases, which facilitate immediate preventive and control measures.
A key challenge in detecting outbreaks and clustered cases in routine monitoring by using WGS is determining how much genetic variation can exist within an epidemiologically related cluster. New criteria should be established on the basis of existing genetic diversity within each species, its molecular clock, and the characteristics of the outbreak and endemicity.
Tracing pathogen transmission
The resolution of WGS methods is higher than that of PFGE for large-time-scale traceability of pathogens. A severe cholera outbreak occurred in Haiti in 2010. As of July 7, 2011, 386 429 cholera cases, including 5885 deaths, were recorded [
4]. The PFGE method cannot distinguish between the strains of this outbreak and those from South Asia and other regions. WGS-based typing was used to analyze cholera isolates in Nepal and Haiti. The isolates from Haiti and Nepal were clustered together in a phylogenetic tree based on a WGS-based typing method, with only 1–2 SNP differences, which are significantly less than those of others [
5]. In another case, a comparative genomic analysis of 133
Yersinia pestis strains from China and other regions identified 2326 SNPs; the phylogenetic tree based on these SNPs shows that the plague may have originated in the eastern Tibetan Plateau and spread to other parts of the world through the Silk Road, Tea Horse Road, and Tangfangudao ancient trade routes [
53,
54].
WGS-based typing methods can be used to identify the source of an outbreak. In 2006–2008, a tuberculosis outbreak occurred in Canada, and the traditional MIRU-VNTR method cannot identify the transmission relationship among isolates. Thirty-two outbreak isolates and four other isolates with the same MIRU-VNTR type were sampled and sequenced. Genome comparison showed that these strains possess 206 SNP sites, and they can be clearly distinguished into two groups, which are sourced from the same clone. According to this phylogenetic result combined with a traditional epidemiological investigation, a “Super spreader” who infected several people in this outbreak was found [
55]. In 2010, a large-scale outbreak of
S. Paratyphi A involving 601 cases of paratyphoid fever occurred in the Yuanjiang County in China. PFGE and WGS of the
S. Paratyphi A strains isolated from patients and environmental sources were performed to facilitate transmission analysis and source tracking. The maximum likelihood tree of 22
S. Paratyphi A strains based on 270 SNPs showed that the vegetables contaminated by hospital waste water cause this outbreak; these vegetables are also the main risk factors promoting a contamination cycle [
56].
An analysis of the relationships between samples from patients and medical staff in different periods by genomic epidemiology can help rapidly determine the sources and transmission routes of nosocomial infections. In a 2010 study, different MRSA strains were isolated from a hospital in Thailand region; moreover, a considerably close genetic distance of five isolates from adjacent wards is observed, which suggested that the physical distance in this nosocomial infection case is directly related to disease transmission [
57]. A similar genome comparison method is used to trace the source of two outbreaks of
C. difficile in 2012–2014 in a hospital in China. Twenty-two
C. difficile isolates were sampled from a ward from March 2012 to May 2014, and a total of 62 SNPs were identified in these NAP1/BI/027 strains, which help trace the transmission among people. For example, only one SNP difference is identified between strains 4 and 7, which illustrated their close relationship. The phylogenetic relationship that defines the candidate source of this case is from a toilet [
58].
Discovering new modes or ways of transmission
The accuracy of epidemiological studies is remarkably improved by using molecular methods and technologies to detect pathogens and perform subtyping and investigation [
59,
60]. Whole-genome sequence data are highly effective to determine whether individuals are part of the transmission chain, determine or confirm epidemiological association, and identify transmission patterns and long-term dynamics [
16], thereby covering the gaps of former methods, such as MLST.
WGS-based methods revealed many epidemiological characteristics in an outbreak among humans infected by
Streptococcus suis that occurred in Sichuan, China in 2005 [
60] (Fig. 3). During the eight weeks of the outbreak, 215 cases with 39 deaths were reported in 203 villages of 12 cities. Investigations initially identified one case per village in 194 villages within a short time, but the cases are distributed over a large geographic area. Consequently, the epidemiological relevance is difficult to determine. The source is identified as pigs infected by or carrying
S. suis through traditional epidemiological studies. The outbreak is caused by pig-to-human direct transmission. The pathogen is isolated from patients and infected pigs, but no suggestive transmission information is obtained from the laboratory data because all isolates show the same serotype, PFGE, and MLST types [
61]. The genomes of the isolates are sequenced and classified into six clades by using 160 whole-genome SNPs [
62]. A molecular clock analysis revealed that the clades responsible for the 2005 outbreak emerge separately from February 2002 to August 2004. A total of 41 lineages emerge by the end of 2004 and rapidly expand to 68 genome types through single-base mutations. This finding may suggest that patients are also infected in respective geographic sites. This conclusion is supported with other information, such as the relationships between the breeding mode and geographic, traffic, and economic factors [
62]. In another study exploring the transmission patterns of
S. Paratyphi A in China [
63], a genome-wide SNP analysis was conducted on strains from four provinces with high morbidity. On the basis of the genomic and epidemiological data, the relationships of isolates and patients were inferred. Results showed that dominant isolates in these provinces, which may have first been imported from south-east Asia, originated from Zhejiang Province. The transmission in China showed two modes: sprawling spread from coastal areas to inland and jumping spread directly from coastal areas to nonadjacent inland provinces. In the mid-1990s, paratyphoid fever becomes an emerging problem in Zhejiang Province, and most of inter-province transmission occurs between 1998 and 2002. This finding is also supported with other social factors, such as population migration and economic transition [
63]. The transmission patterns of an outbreak or epidemic can be different or altered due to differences in natural and social environments and changes in human activities, e.g., an investigation of a tuberculosis outbreak that occurred over a three-year period in a medium-sized community in British Columbia, Canada by using WGS and social network analysis concluded that socioenvironmental factors affect the transmission of tuberculosis, and the use of crack cocaine may have played a role in triggering and sustaining the outbreak [
55]. Transmission patterns can be clearly understood by combining genomic epidemiological studies with detailed epidemiological data, including natural and social factors.
Identifying new clones or clades
In many cases, new epidemic clones and outbreak isolates display genomic characteristics not present in previous isolates, such as new virulence or antimicrobial resistance genes or gene clusters. These new genomic characteristics often exhibit important phenotypes causing high mortality and treatment failure in epidemics or outbreak. Determining genomic characteristics is highly helpful for the timely control of infectious diseases. In addition, WGS can rapidly obtain these genomic characteristics. In the 2010 Haiti cholera outbreak, the characteristics of
V. cholerae virulence genes, including structural variations in the super integron, VSP-II, SXT, and
ctxB regions, are determined by WGS [
4]. The structural variations within VSP-II, SXT, and
ctxB regions are characteristics of variant strains that have emerged and replaced previously dominant strains of the seventh pandemic in South Asia. These structural variations are revealed in the early stages of the outbreak and identified as characteristics for traceability of Haiti outbreak strains in the following studies [
64,
65]. In the Europe
E. coli O104:H4 outbreak in 2011, virulence and antimicrobial resistance genes, including
stx2 prophage and pESBL TY2482 plasmids, are identified by WGS [
8]. Virulence and antimicrobial resistance play important roles in explaining the pathogenesis of isolates and avoiding treatment failure, thereby reducing the fatality rate of diseases. With traditional methods, only known characteristics can be obtained by phenotypic experiments and molecular detection of genes. WGS may provide panoramic genomic characteristics in detail. The terms toxome and resistome were created to represent all toxin and antibiotic-resistant genes, respectively, obtained by WGS [
66]. Describing virulence genes (toxome) and resistance genes (resistome) at the level of the whole genome allows the further understanding of the characteristics of outbreak isolates and contribution to subsequent prevention and control.
WGS-based typing can also be used to identify clones with special public health significance, including epidemic clones, which show potential to cause severe clinical infections and large-scale outbreaks or possess high pathogenicity. With bacterial evolution, some biological markers with important public health significance, including those associated with virulence or antimicrobial resistance, will be formed. Minimum core genome (MCG) typing of
S. suis is established [
67], and
S. suis is classified into seven MCG groups. All the isolates causing severe human infections, death, and outbreaks fall into MCG group 1. The MCG typing system for
S. suis separates the group containing human-associated isolates from the groups containing animal-associated isolates and outbreak isolates from sporadic isolates and those causing severe clinical infections from less severe clinical infection isolates [
67]. Phylogenetic analysis and virulence gene identification demonstrated that
S. suis may have progressively gained additional virulence genes during its evolution to becoming a human pathogen [
67]. MCG typing is also applied to distinguish
L. pneumophila isolates with high or low intracellular growth ability [
68]. A population structure analysis based on MCG divides the 53
L. pneumophila isolates into nine MCG groups; eight of these isolates show high intracellular growth ability, and one presents low intracellular growth ability. The isolates with low intracellular growth ability can be used as vaccine candidates in the future. Twenty-two lethal mutations (premature stop, damaged start codon, and damaged stop codon) associated with 19 genes in the genomes of the low-intracellular-growth-ability group isolates are identified by comparative genomic analysis. Lethal mutations are detected in the low-intracellular-growth-ability group isolates. Thus, lethal mutations may have caused the pathogenicity reduction. MCG typing distinguishes isolates with different virulence phenotypes and identifies epidemic clones, such as the highly virulent isolates of the ST1 clone and the epidemic isolates of the ST7 clone of
S. suis [
67]. ST1 is the most prevalent ST among clinical and environmental
L. pneumophila isolates worldwide [
68].
Applications of bacterial genomic sequencing in regional and global surveillance networks for infectious diseases
The advent of WGS of bacterial pathogens has revolutionized the application of microbiology in public health. Given that infectious diseases can spread regionally and globally, WGS has been used as a technique to detect pathogens and outbreaks by some organizations and construct and maintain surveillance among different laboratories in different regions. The Food and Drug Administration (FDA) of the United States of America used WGS of
S. Tennessee to describe genomic diversity both across the serovar and among and within outbreak clades of strains associated with contaminated peanut butter [
46]. FDA considered that the use of WGS-based methods can delimit contamination sources for foodborne illnesses across multiple outbreaks and reveal otherwise undetected DNA sequence differences essential to the tracing of bacterial pathogens as they emerge. Public Health England implemented WGS as a typing tool for public health surveillance of
Salmonella and adopted a wgMLST approach as a replacement for traditional serotyping [
69].
WGS-based subtyping methods exhibit superior sensitivity, specificity, and timely resolution to those of outbreak clustering. Hence, PulseNet USA, a national public health laboratory network for the detection of bacterial outbreaks, and the PulseNet International, which performs similar global surveillance of foodborne illnesses, are using WGS in the network to identify, characterize, and subtype foodborne pathogens. WGS replaces existing phenotypic and molecular methods in support of preparedness and responsiveness to foodborne illnesses at the local, national, regional, and global levels. Currently, PulseNet USA has been expanded as PulseNet International, which comprises the national, regional, and subregional laboratory networks of Africa, Asia Pacific, Canada, Europe, Latin America and the Caribbean, the Middle East, and the United States. PulseNet International is being developed to share information in real time within regional and national laboratory networks to support surveillance and outbreak response globally; this network also implements outbreak investigations spanning across borders by using the standardized protocols and analysis procedures by all participants [
70].
Global Microbial Identifier (GMI) aims to assist laboratories and partners globally to perform WGS to the highest degree of quality [
71]. GMI envisions a global system of DNA genome databases for microbial and infectious disease identification and diagnostics. Such a system will benefit those addressing individual problems at the frontline, such as clinicians, veterinarians, policy makers, regulators, and industries. A professional response to health threats will be available for all countries with basic laboratory infrastructure by enabling access to this global resource.
WGS improves molecular epidemiological studies, public health laboratory surveillance, and infectious disease control [
55,
72]. Bacterial genomic sequencing data are suitable for surveillance and outbreak detection purposes and addressing other questions, including those on antimicrobial resistance, transmission pattern, and population structure. The translation of WGS from research centers to public health and clinical laboratories has already begun. Ideally, WGS will enable all countries to detect current and emerging infectious diseases in real time and at low cost and share information in a standardized manner [
73]. Although WGS can provide additional resolution to the relatedness of isolates, it should not be used as the sole source of evidence. Careful review of all available epidemiological, retrospective, and laboratory data is critical for determining the source of outbreaks when enhanced molecular techniques are implemented.
Future directions
Rapid and accurate acquisition of pathogen genome data is critical for the development of a highly efficient tool for infectious disease surveillance and outbreak detection. Culture-independent metagenomic technology is a direction for future development. The current analytical capacity still requires the culturing of bacterial isolates, which is time consuming, especially for slow-growing bacteria, such as M. tuberculosis. Although culture-independent metagenomic technology displays disadvantages, such as contamination from environmental sources and the production of a large amount of non-target data, this technology can shorten response times considerably, which is beneficial when dealing with outbreaks.
Systematic evaluation and protocol standardization of WGS-based subtyping methods for bacterial pathogens applied in outbreak investigations are also needed. A set of model isolates, including representative strains from sporadic cases and outbreaks, of each species should be used to evaluate the typing ability of various methods comprehensively. Minimal evaluation parameters should include discriminatory power, reproducibility, and epidemiological concordance. A reference standard for determining the relationship among isolates during outbreaks, such as Tenover’s criteria for PFGE [
74], is also needed to interpret results. Furthermore, basic population, natural environment, and clinical information for each case should be added to the information system to be comprehensively analyzed, along with laboratory and epidemiological information, to formulate strategies for infectious disease prevention and control.
Data sharing among different laboratories will be achieved because genomic data are most remarkably interpreted through comparisons with others. A large number of improved public platforms must be developed to facilitate this process. As a prerequisite for genome comparison within the network, a genome sequence database should be established to collect the genome data of bacterial strains obtained during surveillance in different regions. This database may provide a platform for collecting genome sequence data, quality estimation, sequence comparison, cluster identification, and result dissemination among different laboratories in the network. Such genome sequencing and searching network should be established by country or region. Global networking can subsequently be developed on the basis of national or regional networks.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature