Molecular Evolution of the Mycobacterium tuberculosis Complex

Chapter 2: Molecular Evolution of the Mycobacterium tuberculosis Complex

by Nalin Rastogi and Christophe Sola

2.1. A basic evolutionary scheme of mycobacteria

Mycobacteria are likely to represent a very ancient genus of bacteria. Probably, the Mycobacterium genus originates from a common ancestor whose offspring specialized in the process of colonizing very different ecological niches. The evolutionary relationships between organisms of the genus Mycobacterium have been investigated on the basis of the analysis of derived similarities (“shared derived traits”, synapomorphies).

Since no contemporary living species may directly stem from another contemporary species, it is advisable to speak of «common ancestors», by building cladograms rather than genealogical trees when comparing a monophyletic group. Such cladistic analysis (the word clade is derived from the ancient Greek klados, meaning branch) forms an ideal basis for modern systems of biological classification. Cladograms so generated are invariably dependent on the amount of information selected by the researcher.

An ideal approach takes into account a wide variety of information in order to form a natural group of organisms (clade) which share a unique ancestor that is not shared with other organisms on the tree, i.e., each clade comprises a series of characteristics specific to its members (synapomorphies), and absent from the group of organisms from which it diverged. Such distinction involves the notion of outgroups (organisms that are closely related to the group but not part of it). The choice of an outgroup constitutes an essential step, since it can profoundly change the topology of a tree. Similarly, much attention is needed to distinguish between characters and character states prior to such analysis (e.g., “blue eyes” and “black eyes” are two character states of the character “eye-color”). A character state of a determined clade which is also present in its outgroups and its ancestor is designated as plesiomorphy (meaning “close form”, also called ancestral state). The character state which occurs only in later descendants is called an apomorphy (meaning “separate form”, also called the “derived” state). As only synapomorphies are used to characterize clades, the distinction between plesiomorphic and synapomorphic character states is made by considering one or more outgroups.

A collective set of plesiomorphies is commonly referred to as a ground plan for the clade or clades they refer to; and one clade is considered basal to another if it holds more plesiomorphic characters than the other clade. Usually, a basal group is very species-poor in comparison to a more derived group. Thus, conservative (apomorphic) branches, defined as anagenetic branches represent species whose characteristics are closer to those of the ancestor than others.

Possibly, the founder of the genus Mycobacterium was a free-living organism and today’s free-living mycobacterial species (and also some saprophytic species?) represent the conservative branches of founding mycobacteria. The more distant organisms are probably the ones that live in association with various multicellular organisms. It has been suggested that the mycobacteria that created a long-lasting association with marine animals (probably placoderms) are at the root of this phylogenetic branch. Thus, Mycobacterium marinum would stem from the conservative branch, whereas other vertebrate-associated mycobacteria would build the anagenetic branch. Grmek speculates that the association of a mycobacterial species with a marine vertebrate may have occurred during the superior Devonian (300 million years ago) (Grmek 1994). Figure 2-1 shows the phylogenetic position of the Mycobacterium tuberculosis complex species within the genus Mycobacterium based on a tree of the gene coding for the 16S ribosomal ribonucleic acid (rRNA).

more… (PDF)


Download of the entire textbook
(Tuberculosis 2007, 687 pages, PDF, 8.3 MB)

In the past, mycobacterial systematics used to rely on phenotypic characters; more recently, however, genetic techniques have boosted taxonomic studies (Tortoli 2003). The first natural characters used to distinguish between mycobacterial species were growth rate and pigmentation. Rapid growers (< 7 days) are free, environmental, saprophytic species, whereas slow growers are usually obligate intracellular, pathogenic species. The slow-fast grower division, which virtually always relies on the possession of one or two rRNA operons (rrn operon) (Jy 1994), was shown to be phylogenetically coherent (Stahl 1990, Devulder 2005). In the ’50s, the hypothesis of co-evolution, or parallel evolution, between hosts and mycobacteria looked no more likely than the alternative hypothesis of «multiple, casual (furtive) introductions» of various saprophytes into different hosts. The traditional epidemiological belief for tuberculosis (TB) is that the anthropozoonosis due to M. tuberculosis may find its origin in a zoonotic agent, i.e., Mycobacterium bovis (Cockburn 1963). This view is still sustained by some authors (Smith 2006a). However, genetics brought some new clues into the debate (Brosch 2002). For example, the sequencing of the Mycobacterium leprae genome, by its defective nature, confirmed the previous history-driven hypothesis that M. leprae was a younger pathogen than M. tuberculosis (Cole 1998, Cole 2001). In the case of the M. tuberculosis complex, comparative genomics has also shown that the M. bovis genome is smaller than the M. tuberculosis genome, opening the way to a new scenario for the evolution of the tubercle bacillus (Brosch 2002). M. bovis genomic reduction (loss of genes) indeed suggests that it could be a younger pathogen than M. tuberculosis or, in other words, that human TB disease preceded bovine disease (Brosch 2002, Cockburn 1963). Figure 2-2 shows that the common ancestor of members of the M. tuberculosis complex is close to three of its branches: “Mycobacterium canettii”, Mycobacterium africanum and the ancestral East-African-Indian (EAI) clade. However, according to Smith et al., “until it is demonstrated that strains of M. africanum subtype I can be maintained in immunocompetent cells, the host-association of the most recent common ancestor of the M. tuberculosis complex remains unsolved” (Smith 2006b). Figure 2-2: Scheme of the proposed evolutionary pathway of the M. tuberculosis bacilli illustrating successive loss of DNA in certain lineages (reproduced with permission from Brosch et al. 2002) Ancient humans, bovids and mastodons experienced erosive diseases caused by M. tuberculosis. As an alternative to the classical hypothesis of TB spread being driven by human migration, bovids, mastodons, or simply diet might well be considered to be the natural epidemiological vehicle of TB. In this way, a poorly pathogenic environmental or animal Mycobacterium spp. would have progressively acquired some human-specific virulence traits (Rotschild 2001, Rotschild 2006a). The association of hyperdisease and endemic stability may have promoted a smooth and long-term transition from zoonosis to anthropozoonosis (Coleman 2001, Rotschild 2006b). Other complex anthropological parameters, such as the history of agriculture and livestock domestication, may also have been mediators of TB spread (Smith 1995, Bruford 2003). In this sense, it is also logical to compare the pathogenicity of the various M. tuberculosis complex members in various host species. Interestingly, it has been observed that M. africanum apparently elicits a more attenuated T cell response to the 6 kiloDalton (kDa) early secreted antigen (ESAT-6) than M. tuberculosis in patients with TB. M. africanum could thus be considered to be an opportunistic human pathogen. If confirmed, these findings are new evidence that strain differences affect human interferon-based T cell responses (de Jong 2006). Strain-related differences in lymphokine (including interferon-gamma) response in mice with experimental infection were also reported in 2003 (Lopez 2003). 2.2. M. tuberculosis complex population molecular genetics Until recently, the question of individual genetic variation within the M. tuberculosis complex gained little attention and most research on M. tuberculosis was organism- rather than population-centered. The advent of molecular methods, and their widespread use in population studies, introduced both new conceptual and new technological developments. The inference of phylogenies from molecular data goes back to the early ’90s with the development of software such as PHYLIP and PAUP (Felsenstein 1993, Swofford 1990, Swofford 1998). In particular, the study of the M. tuberculosis complex phylogeny closely followed the development of increasing numbers of sophisticated genotyping methods. The way was opened by M. tuberculosis fingerprinting by restriction fragment length polymorphism based on insertion sequence IS6110 (IS6110 RFLP) (van Embden 1993). However, the use of IS6110 RFLP in evolutionary genetics discovery was of limited value for many reasons: · fast variation rate of this evolutionary marker (de Boer 1999) · complexity of forces driving its transposition and risk of genetic convergence (Fang 2001) · nature of experimental data produced which requires sophisticated software for analysis · difficulty to build large sets of data (Heersma 1998, Salamon 1998) The discovery in 1993 of the polymorphic nature of the Direct Repeat (DR) locus, and the subsequent development of the spoligotyping method based on DR locus variability, introduced more modern concepts and tools for M. tuberculosis complex genotyping (Groenen 1993, Kamerbeek 1997). Our research group bet that the highly diverse signature patterns observed by spoligotyping could indeed contain phylogenetical signals, and the construction of a diversity database was started de novo (Sola 1999). Today, a total of 62 M. tuberculosis complex clades/lineages are detailed in the Fourth International Spoligotyping Database (SpolDB4) which describes 1,939 shared-types representing a total of 39,295 M. tuberculosis strains from 122 countries (Brudey 2006). This database is available on the internet at SITVIT ( Some of the major M. tuberculosis complex clades and their spoligotype signatures are described below under section 2.9. The assumption that the DR locus was neutral still remains speculative; however, the finding of other clustered regularly interspersed palindromic repeats (CRISPR) loci in both Archae and Bacteriae has become a hot issue (Jansen 2002, Pourcel 2005, Makarova 2006). Spoligotyping was immediately followed by the discovery of tandem repeat loci in the M. tuberculosis complex and the Variable Number of Tandem Repeats (VNTR) genotyping technique (Frothingham 1998). Later, the Mycobacterial Interspersed Repetitive Units (MIRU) technique (Supply 2001) was developed, which is also designated as Multiple Locus VNTR analysis (MLVA). Multi-Locus Sequence Typing (MLST) was introduced as an alternative method (Baker 2004). More recently, systematic Single Nucleotide Polymorphism (SNP) genotyping (Filliol 2006, Gutacker 2006) was described followed by Large Sequence Polymorphism (LSP), the latter performed either by microarray or real-time Polymerase Chain Reaction (PCR) (Mostowy 2002, Tsolaki 2005). 2.3. Co-evolution of M. tuberculosis with its hosts Simulation models reported in 1988 suggested that a social network with a size of 180 to 440 persons is required for TB to occur with endemicity. In such conditions, host-pathogen coexistence would be maintained in populations (McGrath 1988). The concept of endemic stability, already mentioned above, suggests that an infectious disease may reach an epidemiological state in which the clinical disease is scarce, despite high levels of infection in the population (Coleman 2001). Clearly, this concept may apply to TB since it is most likely to have been a vertically transmitted disease before being responsible for large outbreaks. The question of how many isolated communities of between 180 to 440 persons may have experienced, sequentially or concomitantly, the introduction of one or more founding genotypes of M. tuberculosis complex (each one with its own specific virulence), in other words, how TB was “seeded” is of paramount importance. To provide the initial conditions of a dynamic epidemic system we must understand how these early founding genotypes spread in low demographic conditions. Today, we can observe a phylogeographically structured global epidemic, built as a result of millennia of evolution. Some clones are extinct, others have an increased risk of emergence (Tanaka 2006). The evolution rate of TB is likely to have been successively slow (human and cattle migration and low endemicity or hyperendemicity but little or no disease), then moderate (five centuries of post-Columbus sail-based migration) with important anthropological changes that may have created bursting conditions linked to demographic growth and migration, and lastly, fast (since the introduction of air transportation), i.e. within the five decades of increasing movements of strains and people, concomitantly to new outbreaks in demographically active and resource-poor countries where the great majority of cases is now present. Consequently, the worldwide bacterial genetic snapshot of the TB epidemic is the result of a combination of slow, medium, and fast evolving superimposition pictures of various outbreak histories. Such a jigsaw puzzle will be difficult, if not impossible, to reconstruct. However, looking for rare and isolated genotypes, which may have undergone a slower evolution, as well as searching for ancient desoxyribonucleic acid (DNA) may constitute two complementary scientific strategies in attempting to reach this goal. One recent success of the first strategy is exemplified by the finding of a peculiar highly genetically diverse “M. canettii” in the Horn of Africa. “M. canettii” was likely to be the most probable source species of the M. tuberculosis complex, rather than just another branch of it (Fabre 2004). Further results confirm that, despite its apparent homogeneity, the “M. canettii” or “M. prototuberculosis” genome is a composite assembly resulting from horizontal gene transfer events predating clonal expansion. The large amount of synonymous single nucleotide polymorphism (sSNP) variation in housekeeping genes found in these smooth strains of “M. prototuberculosis” suggests that the tubercle bacilli were contemporaneous with early hominids in East Africa, and may have thus been evolving with their human host much longer than previously thought. These results open new perspectives for unraveling the molecular bases of M. tuberculosis evolutionary success (Gutierrez 2005). The second strategy has also provided interesting results that support the notion of TB’s ancient origin. The isolation and characterization of ancient M. tuberculosis DNA from an extinct bison, dated 17,000 years B.C., suggest the presence of TB in America in the late Pleiostocene (Rotschild 2001). The extensive infection of many individuals of the Mammut americanum species with the M. tuberculosis agent also suggests that, apart from Homo sapiens, mastodons and bovids may have spread the disease during the Pleistocene (Rotschild 2006a, Rotschild 2006b). When looking at human remains, several DNA studies served to trace back the presence of TB to Egyptian mummies, where M. tuberculosis and also M. africanum genotypes were identified (Zink 2003). Figure 2-3 shows an ancient Egyptian clay artefact with a traditional kyphosis suggestive of Pott’s disease. The presence of TB in America before the arrival of the Spanish settlers is also well demonstrated both by paleopathological evidence and studies on ancient DNA (Salo 1994, Arriaza 1995). Recent paleopathological evidence also suggests the presence of leprosy and TB in South East Asian human remains from the Iron Age (Tayles 2004). Taken together, these results may argue that the limited number of different genogroups that we observe today are likely to stem from those that were seeded in the past, have remained isolated by distance during millennia, and have had time to co-evolve independently before gaining reasonable statistical chances to meet. Figure 2-3: Egyptian clay artefact of an emaciated man with a characteristic angular kyphosis suggestive of Pott’s disease (reproduced from TB, Past, Present, 1999, TB Foundation) 2.4. M. tuberculosis through space and time The concept of phylogeography was originally introduced by Avise (Avise 1987), as “the history of processes that control the geographic distribution of genes and lineages by constructing the genealogies of populations and genes”. The term was introduced as a way to bridge population genetics and molecular ecology and to describe geographically structured signals within species. This concept might well be applied to studies on the global spread of M. tuberculosis through time. If the ancestor of M. tuberculosis adapted specifically and slowly to human beings, it may have had the time to develop, via an extreme clonality, a deeply rooted and peculiar phylogeographical structure reflecting both the demographic history and the history of TB spread. The geographic distribution of bacteriophage types was the only method to detect the geographic subdivision of the M. tuberculosis complex species during the ’70s and the ’80s (Bates 1969, Sula 1973); however, no phylogenetic relationships could be inferred at that time using mycobacteriophages. A numerical analysis of M. africanum taxonomy also suggested differences between isolates from West and East Africa (David 1978). The naming of two M. africanum variants (subtype I and II) created confusion and the status of M. africanum as a homogeneous sub-species of M. tuberculosis complex is still uncertain. The existence of some major geographical and epidemiological significant genetic variants of the M. tuberculosis complex was also recognized as early as 1982 (Collins 1982). Among these were the Asian, the bovine and the classical variants, in addition to africanum I and africanum II variants. Lateral genetic transfer was presumed to be minor in M. tuberculosis, and the clonal structure of the M. tuberculosis complex was formally demonstrated by the finding of strong linkage disequilibrium within MIRU loci (Supply 2003). Only recently has the issue of M. tuberculosis complex lateral genetic transfer gained interest, particularly in regard to its links to genetic diversity and to potential acquisition of virulence (Kinsella 2003, Rosas-Magallanes 2006, Alix 2006). The importance of lateral genetic transfer in one species’ history is of primary importance to better understand its specificity. As for the members of the M. tuberculosis complex, with the exception of M. canettii, there is no evidence for this kind of transfer or for housekeeping gene recombination (Smith 2006a). Indeed, recent evidence argues in favor of the existence of lateral genetic transfer in the precursor of the M. tuberculosis complex, and in favor of environmental mycobacteria being the source of certain genetic components in the M. tuberculosis complex. These findings reinforce the idea that the ancestor of the M. tuberculosis complex was an environmental Mycobacterium (Rosas-Magallenes 2006). Another source of exogenous DNA may be plasmids that have been shown to be present in modern species of mycobacteria, and sometimes to carry virulence genes (Le Dantec 2001, Stinear 2000, Stinear 2004). The mosaic nature of the genome of ancestral “M. prototuberculosis” species also argues in favor of numerous gene transfer events and/or homologous recombination within ancient species of the M. tuberculosis complex (Gutierrez 2005). 2.5. Looking for robust evolutionary markers When looking for robust evolutionary markers, the evolutionist will first choose markers that are assumedly neutral in order to avoid debates on function or potential selection, whether positive or stabilizing. For the M. tuberculosis complex, the very existence of an obligate intracellular life, which provides a stable chemical and metabolic environment, suggests that a classical metabolic selection scheme must have played a minimal role in the evolution of the M. tuberculosis complex genome (Musser 2000). Host specialization and niche adaptation may have been more important. Changes towards acquisition of an intracellular life style may also be responsible for loss of function and hence, loss of genes. Silent mutations in housekeeping genes were the first candidates to be selected as evolutionary markers. However, the amount of genetic diversity found in the genes selected in that original study was unexpectedly low, which led to the hypothesis that TB had spread only recently from a unique precursor. Indeed, the rate of genetically neutral synonymous mutations (sSNP) was shown to be as low as 1/10,000 whereas the rate of non-synonymous mutations (nsSNP) outnumbered sSNPs by almost 2 to 1 (Sreevatsan 1997). As for spoligo- and MIRU typing, at first glance it seems reasonable to consider these markers as neutral. No evident role for the DR locus, a member of CRISPR sequences, has been proven yet; however, there is an increased interest in CRISPR and the CRISPR-associated genes cas, which may mean to the bacterial world what silencing RNAs means for the eukaryotic world (Makarova 2006). Apart from the senX3-regX3 double component system, which was presumably involved in virulence, the function of MIRU loci remains poorly investigated (Parish 2003). In all cases, the phylogenetical information content obtained by studying the DR and the VNTR loci was previously shown to be rich (D. Falush 2003 – Prague, European Concerted Action Meeting, unpublished data). 2.6. Why repeated sequences were so useful at the beginning The description of repeated sequences goes back to the early age of molecular biology (Britten 1968). Their role in the selection of new vital functions in life is indeed of paramount importance for genetic evolution (Britten 2005). In the M. tuberculosis complex, repetitive DNA sequences were used as probes and showed to be useful for fingerprinting strains in epidemiological studies (Eisenach 1988). Shortly after the characterization of the insertion sequence IS6110 (Thierry 1990), an international consensus method IS6110 RFLP was adopted almost concomitantly to the World Health Organization declaration of TB as a public health emergency (van Embden 1993). IS6110 RFLP changed the traditional belief that no more than 10 % of TB cases were due to recent transmission, and sparked a new hope for disease eradication by contributing to the adequate surveillance and prevention of TB transmission (Alland 1994, Small 1994). For diverse reasons, however, the use of IS6110 was of little help in solving the phylogenetic structure of the M. tuberculosis complex because it turned out to be a poor phylogenetic marker (Fleischmann 2002). A rapidly emerging issue was that IS6110 was ineffective in a large part of the world, including South-East Asia (Fomukong 1994). Another insertion sequence, IS1081, was also suggested as an interesting potential phylogenetic marker; however, its generalized use in M. tuberculosis complex population genetics was also hampered, among other reasons, by the RFLP format (van Soolingen 1997, Park 2000). 2.7. Regions of differences (RDs) and SNPs in M. tuberculosis One approach to understanding the molecular evolution of the M. tuberculosis complex and looking for virulence genes is to identify regions of difference (RD) between M. tuberculosis complex genomes (Inwald 2003) or to look for Single Nucleotide Polymorphisms (SNPs). Substractive genomic hybridization was initially used to identify three distinct genomic regions between virulent M. bovis, M. tuberculosis, and the avirulent M. bovis bacille Calmette-Guérin (BCG) strain, designated respectively as RD1, RD2, and RD3 (Mahairas 1996). One of these regions, RD1, was shown to contain important virulence genes including the two immunodominant T-cell antigens ESAT6 and culture filtrate protein 10 (CFP10) (Pym 2002). In another study (Gordon 1999), restriction-digested bacterial artificial chromosome (BAC) arrays of H37Rv strain were used to reveal the presence of 10 regions of difference between M. tuberculosis and M. bovis (RD1 to 10); 7 of which (RD4-RD10) were deleted in M. bovis. The deletion pattern of M. africanum is closer to that of M. tuberculosis than to the pattern of M. bovis (Gordon 1999). Brosch et al. analyzed the distribution of 20 variable regions resulting from insertion-deletion events in the genome of the tubercle bacilli in one hundred strains belonging to all sub-species of the M. tuberculosis complex (Brosch 2002). The authors showed that the majority of these polymorphisms resulted from ancient irreversible genetic events in common progenitor cells, the so-called Unique Event Polymorphisms (UEP). Based on the presence or absence of an M. tuberculosis specific deletion 1 (TbD1, a 2 kb sequence), M. tuberculosis can be divided into “ancient” TbD1 positive and “modern” TbD1 negative strains. This classification superimposes well with the previous principal genetic group (PGG) classification (Sreevatsan 1997); however, only two groups of strains, the EAI and the M. africanum strains are TbD1 positive. The RD9 deletion identifies an evolutionary lineage represented by M. africanum, M. microti and M. bovis that diverged from the progenitor of the present M. tuberculosis strains before TbD1 occurred (Brosch 2002). These findings contradict the long-held belief that M. tuberculosis evolved from a precursor of M. bovis, suggesting a new evolutionary scenario of the M. tuberculosis complex. Since M. canettii and other ancestral M. tuberculosis complex strains lack none of these regions, they are supposed to be direct descendants of the tubercle bacilli that existed before the M. africanum-M. bovis lineage separated from the M. tuberculosis lineage (Brosch 2002). This scenario was confirmed in a follow-up study in which in silico and macroarray based hybridization experiments confirmed the existence of a core set of 219 conserved genes shared by M. leprae and M. tuberculosis. Among these new phylogenetical markers is the pks 15/1 gene, which encodes one of the polyketide synthase enzymes required for the lipid metabolism of cell wall building. All modern strains show a 7-base pair (bp) frameshift deletion in this gene that induces a knock-out of the enzyme. M. canettii, most PGG1 ancestral EAI, and Beijing strains add two amino acids that do not interfere with pks function, whereas strains in the M. bovis lineage bear a 6-bp DNA deletion that involves deletion of these two extra amino acids (Constant 2002). Three recent studies provide landmarks in TB molecular and phylogenetic population studies. The first one suggests the existence of six phylogeographical lineages, each associated with specific sympatric human populations (Gagneux 2006). These observations show that mycobacterial lineages are adapted to particular human populations. Whether these results are considered from either a “splitter” or from a “gatherer” perspective, they endorse the idea that there are probably just a small number of founding genogroups of the M. tuberculosis complex. Also, these results support previous results on M. tuberculosis complex genetic diversity and our hypothesis that M. tuberculosis complex is an ancient pathogen that co-evolved with its hosts (Sola 2001a, 2001b, Sebban 2002). Two SNP-population-based phylogenies also provided similar results, i.e. a limited number of M. tuberculosis complex phylogeographical genogroups (Figure 2-4). According to a study led by Musser’s group, eight deeply branching genetic groups (I to VIII) were found; however, this was still not representative of the worldwide genetic diversity of M. tuberculosis because of a biased sampling, e.g., lack of Central Asian (CAS) strains (Gutacker 2002). A second study corrected this bias by creating one new subgroup for the CAS lineage (Gutacker 2006). This lineage is close to the root, which suggests that the Indian subcontinent played a major role in TB evolution and expansion. Figure 2-4 Phylogenetic tree obtained on SNPs, adapted from Gutacker et al. 2006 and supplemental data. In blue: spoligotyping-based nomenclature or characteristics. In red: IS6110-based clade nomenclature with some characteristics IS6110 copy number or molecular weight data. In green: Musser’s principal genetic group (Sreevatsan 1997). In black: SNP-based designation of clades with some characteristics strains (CDC1551, H37Rv, strain 210). Similar results were obtained independently by Alland et al., reinforcing the idea that unrelated lineages may acquire the same number of IS6110 by homoplasia (Alland 2003). The same group recently analyzed 212 SNPs in correlation with MIRU and spoligotyping on a worldwide representative collection of clinical isolates. Their results are illustrated in Figures 2-5 (A to C). The M. tuberculosis complex tree presented four main branches containing six SNP cluster groups (SCG1 to SCG6) and five subgroups as depicted in Figure 2-5 B (Filliol 2006). These results provide good congruence with spoligotyping and, to a lesser extent, with MIRU12, endorsing the latest genetic diversity studies on spoligotyping (Brudey 2006). Still, it can be argued that in both SNP-based studies, identical bias could have been introduced since the SNPs analyzed in both cases were selected based on the four M. tuberculosis complex genome sequences available to date: M. tuberculosis strains 210, CDC1551, H37Rv and M. bovis strain AF2122. Figure 2-5, A to C: (From Filliol et al. 2006 J. Bacteriol., reproduced with permission). A: a distance-based neighbor-joining tree on 159 sSNPs resolves the 219 M. tuberculosis complex isolates in 56 sequence types (ST). STs are indicated by a dot with numerical value and color code for SNP Cluster Group (SCG) belonging. B: Model-based neighbor-joining tree based on a data set with 212 SNPs, which resolves 327 M. tuberculosis complex isolates into 182 ST with identical cluster (compare with A). SNP Cluster Groups are indicated by colors. Principal Genetic Groups (1 to 3) are also highlighted. C: distribution of the spoligotype clades on the SNP-based phylogeny. Table 2-1 provides a nomenclature correlation between M. tuberculosis complex groups defined by spoligotyping and those defined by sSNPs. As shown in this table, the most ancient clade, EAI defines SCG 1 or sSNP-I according to Alland’s or to Musser’s designation, respectively. SCG 2 and sSNP-II define the Beijing lineage. SCG 3a or sSNP-IIa defines the CAS or Delhi genogroup. SCG 3b or sSNP-III defines the Haarlem family of strains. SCG 3c and SCG 4, or sSNP-IV and sSNP-V, define the “IS6110 European low-banders” or X genogroup (Sebban 2002, Dale 2003, Warren 2004). SCG 5 or sSNP-VI is mainly constituted by the Latin American and Mediterranean (LAM) genogroup (Sola 2001a). SCG 6a and SCG 6b (sSNP-VII and sSNP-VIII) define the poorly characterized Principal Genetic group 3 lineage that also includes some ill-defined T genotypes (Filliol 2002). Last but not least, SCG 7 defines the bovine and seal M. tuberculosis complex subspecies whereas no counterpart is provided in Musser’s classification (Filliol 2006). Table 2-1: Comparison of spoligotype and SNP terminology PGG (Sreevatsan 1997) Spoligotyping-based (Filliol 2003) SCG-based (Filliol 2006) SNP-based (Gutacker 2006) PGG EAI SCG 1 sSNP-I PGG1 Beijing SCG 2 sSNP-II PGG1 CAS SCG 3a sSNP-IIA PGG 1 Bovis SCG 7 M. tuberculosis complex PGG2 Haarlem SCG 3b sSNP-III PGG2 X1 SCG 3c sSNP-IV PGG2 X1,X2,X3 SCG 4 sSNP-V PGG2 LAM SCG 5 sSNP-VI PGG3 T (Miscellaneous) SCG 6 sSNP-VII sSNP-VIII PGG = Principal Genetic Group EAI = East African Indian SCG = SNP cluster group CAS = Central Asian (or Delhi)l SNP = Single nucleotide polymorphism 2.8. Looking for congruence between polymorphic markers The concept of molecular clock, attributed to Zuckerkandl and Pauling in 1962, was originally based on hemoglobin evolution and later generalized to DNA evolution (Zuckerkandl 1987). As for M. tuberculosis, we are dealing with polymorphic markers, i.e. repeated sequences, which ar