Chapter 4: Genomics and Proteomics
by Patricia Del Portillo, Alejandro Reyes, Leiria Salazar, María del Carmen Menéndez and María Jesús García
4.1. Impact of new technologies on Mycobacterium tuberculosis genomics
A new wave in the analysis of the physiological secrets of microorganisms started more than a decade ago with the reading of the first complete genome sequence, corresponding to the bacterium Haemophilus influenzae (Fleishman 1995). Nowadays, the accessibility to hundreds of bacterial genome sequences has changed our way of studying the bacterial world, including bacterial pathogens such as M. tuberculosis.
The overwhelming information displayed by genome sequences started the era of “omics” technologies. These technologies are in accordance to the currently fast times. A quick search in PubMed, limiting results to the last 10 years, showed more than 27,000 papers devoted to “omics” issues: more than three thousand concerning bacteria, and almost three hundred concerning Mycobacterium tuberculosis. Up to five different “omics” methodologies have been described so far, all concerning the global study of the target organism, analyzing all its genes, transcriptional products, proteins, etc.
- Genomics involves the study of all genes that are present in the genomes
- Transcriptomics concerns the analysis of the cellular functions at the messenger ribonucleic acid (mRNA) level
- Proteomics refers to the detection and identification of all proteins in a cell
- Metabolomics comprises the complete set of all metabolites formed by the cell and its association with its metabolism
- Fluxomics compares the cellular networks (Fiehn 2003, Nielsen 2005)
In the tuberculosis (TB) field, only papers concerning genomics, transcriptomics, and proteomics have been published. Integration of data derived from the several “omics” by bioinformatics will probably allow a rational insight into M. tuberculosis biology and its interactions with the host, leading to true control of the disease.
Undoubtedly, the biggest step in our knowledge on TB during the last decade was the description of the complete genome sequence of the laboratory reference M. tuberculosis strain H37Rv (Cole 1998a). For example, the identification of genes involved in the bacterial cell wall biosynthesis, the routes for lipid metabolism, the location of insertion sequences and the variability in the PE_PPE genes allowed scientists to merge the fragments of knowledge derived from the pre-genomic era in a more comprehensive way. The sequence of the genome, and its comparison to sequences of other microorganisms reported in several databases, allowed the assignation of precise functions to 40 % of the predicted proteins and the identification of 44 % of orthologues (genes with very similar functions in a different species), leaving 16 % as unique unknown proteins.
The elucidation of complete genome sequences and the development of microarray-based comparative genomics have been powerful tools in the progress of new areas by the application of robotics to basic molecular biology. Comparative genomics and genomic tools have also been used to identify factors associated with the pathogenicity of M. tuberculosis, such as virulence factors and genes involved in persistence of the pathogen in host cells. Moreover, these tools allowed a description of the evolutionary scenario of the genus (see Chapter 2).
Download of the entire textbook
(Tuberculosis 2007, 687 pages, PDF, 8.3 MB)
Structural genomics was the starting point. As more accurate technologies became available, the interest was focused into functional genomics. Thus, information on specific mRNA actively synthesized by bacteria inside macrophages or during in vitro starvation, opened ways to the analysis of gene expression. Microarray technology was applied to the detection of global gene activity in M. tuberculosis under several environmental conditions. However, bacterial function cannot be understood by looking at the mRNA level alone. A major barrier for genomic studies has been the great number of genes with unknown function that have been identified. Up to 60 % of the open reading frames (ORFs) had unknown functions after the initial annotation of genes (identification of the protein unrevealed by the corresponding ORF’s amino acid sequence) (Cole 1998a). The elucidation of protein function was possible with the global analysis of bacterial proteins, giving insights into the functional role of several so far unknown proteins. Thanks to the joint contributions of biochemical techniques and mass spectrometry, up to 1,044 non-redundant proteins were reported in different cellular fractions (Mawuenyega 2005). The upcoming task will be to assign them all a functional role. As more results are obtained from the proteomic analysis, it is expected that the function of more ORFs will be unveiled with the aid of new data on transcriptomics and proteomics. Genomics and other molecular tools allowed studies on gene expression and regulation, which were unthinkable years ago. M. tuberculosis is a restricted human pathogen; therefore it must have developed mechanisms enabling its quick and efficient adaptation to a variety of “intra-human” environments, which are, in fact, its natural habitat. Understanding how the bacillus regulates its different genes according to environmental changes will probably lead to the comprehension of many interesting aspects of M. tuberculosis, including latency and host-adaptation. This chapter will address the general basics, as well as the state-of-the-art genomics, transcriptomics and proteomics in relation to M. tuberculosis. Finally, a general overview will be made on lipids, the most peculiar metabolites of this bacterium. 4.2. M. tuberculosis genome 4.2.1. Genomic organization and genes TB research made huge progress with the availability of the genome sequence of the type strain M. tuberculosis H37Rv (Cole 1998a). Expectations were generated on the elucidation of some unique characteristics of the biology of the tubercle bacillus, such as its characteristic slow growth, the nature of its complex cell wall, certain genes related to its virulence and persistence, and the apparent stability of its genome. This first available genome sequence of a pathogenic M. tuberculosis strain helped to answer some of these questions and, what is even more stimulating, to open many more. We describe herein the main characteristics of the M. tuberculosis genome sequences completed thus far and highlight some of the most interesting questions answered and opened with this advance in TB research. M. tuberculosis H37Rv (Cole 1998a) was revealed to possess a sequence of 4,411,529 bp, the second largest microbial genome sequenced at that time. The characteristically high guanine plus cytosine (G+C content; 65.5 %) was found to be uniform along most of the genome, confirming the hypothesis that horizontal gene transfer events are virtually absent in modern M. tuberculosis (Sreevatsan 1997). Only a few regions showed a skew in this G+C content. A conspicuous group of genes with a very high G+C content (> 80 %) appear to be unique in mycobacteria and belong to the family of PE or PPE proteins. In turn, the few genes with particularly low (< 50 %) G+C content are those coding for transmembrane proteins or polyketide synthases. This deviation to low G+C content is believed to be a consequence of the required hydrophobic amino acids, essential in any transmembrane domain, that are coded by low G+C content codons. Fifty genes were found to code for functional RNAs. As previously described (Kempsell 1992), there was only one ribosomal RNA operon (rrn). This operon was found to be located at 1.5 Mbp from the origin of replication (oriC locus). Most eubacteria have more than one rrn operon located much closer to the oriC locus to exploit the gene-dosage effect during replication (Cole 1994). The possession of a single rrn operon in a position relatively distant from oriC has been postulated to be a factor contributing to the slow growth phenotype of the tubercle bacillus (Brosch 2000a). One of the most thoroughly studied characteristic of M. tuberculosis is the presence and distribution of insertion sequences (IS). Of particular interest is IS6110, a sequence of the IS3 family that has been widely used for strain typing and molecular epidemiology due to its variation in insertion site and copy number (van Embden 1993, see Chapter 9). Sixteen copies of IS6110 were identified in the genome of M. tuberculosis H37Rv; some IS6110 insertion sites were clustered in sites named insertional hot-spots. The same strain was found to harbor six copies of the more stable IS1081, an insertion sequence that yields almost identical profiles in most strains when analyzed by Restriction Fragment Length Polymorphism (RFLP) (Sola 2001, Kanduma 2003). Another 32 different insertion sequences were found, of which seven belonged to the 13E12 family of repetitive sequences; the other insertion sequences had not been described in other organisms (Cole 1998b). Virtually all the ISs found in M. tuberculosis so far belong to previously described IS families (Chandler 2002). The only exception is IS1556, which does not fit into any known IS family (Cole 1999). Two prophages were detected in the genome sequence; both are similar in length and also similarly organized. One is the prophage PhiRv1, which in the M. tuberculosis H37Rv genome interrupts a repetitive sequence of the family 13E12. This prophage is deleted or rearranged in other M. tuberculosis strains (Fleischmann 2002). The genome of M. tuberculosis possesses seven potential att sites for PhiRv1 insertion, which explains the variability of its position between strains (Cole 1999). The second prophage, PhiRv2 has proven to be much more stable, with less variability among strains (Cole 1999). Regarding protein coding genes, it was determined that M. tuberculosis H37Rv codes for 3,924 ORFs accounting for 91 % of the coding capacity of the genome (Cole 1998a). The alternative initiation codon GTG is used in 35 % of cases compared to 14 % or 9 % in Bacillus subtilis or Escherichia coli respectively. This contributes to the high G+C bias in the codon usage of mycobacteria. A bias in the overall orientation of genes with respect to the direction of replication was also found. On average, bacteria such as B. subtilis have 75 % of their genes in the same orientation as that of the replication fork, while M. tuberculosis only has 59 %. This finding has led to the hypothesis that such a bias could also be part of the slow growing phenotype of the tubercle bacillus (Cole 1999). This conjecture, however, does not take into account the fact that E. coli, a bacterium that grows much faster than M. tuberculosis, has only 55 % of its genes in the same direction as the replication origin (Li 2005). From the predicted ORFs, all proteins have been classified in 11 broad functional groups (Table 4-1), more precisely classified into COG functional categories (http://www.ncbi.nlm.nih.gov/sutils/coxik.cgi?gi=135) according to the National Center for Biotechnology Information (NCBI) of the United States (US). The analysis of the codon usage showed a preference for G+C-rich codons. It was also found that the number of genes that arose by duplication is similar to the number seen in E. coli or B. subtilis, but the degree of conservation of duplicated genes is higher in M. tuberculosis. The lack of divergence of duplicated genes is consistent with the hypothesis of a recent evolutionary descent or a recent bottleneck in mycobacterial evolution (Brosch 2002, Sreevatsan 1997, see chapter 2). From the genome sequence it is clear that M. tuberculosis has the potential to switch from one metabolic route to another including aerobic (e.g. oxidative phosphorylation) and anaerobic respiration (e.g. nitrate reduction). This flexibility is useful for survival in the changing environments within the human host that range from high oxygen tension in the lung alveolus to microaerophilic/anaerobic conditions within the tuberculous granuloma. Another characteristic of the M. tuberculosis genome is the presence of genes for synthesis and degradation of almost all kinds of lipids from simple fatty acids to complex molecules such as mycolic acids. In total, there are genes encoding for 250 distinct enzymes involved in fatty acid metabolism, compared to only 50 in the genome of E. coli (Cole 1999). Concerning transcriptional regulation, M. tuberculosis codifies for 13 putative sigma factors and more than 100 regulatory proteins (see section 4.3 of this chapter). Among the most interesting protein gene families found in M. tuberculosis are the PE and PPE multigene families, which account for almost 10 % of the genome capacity. The names PE and PPE derive from the motifs of Pro-Glu (PE) and Pro-Pro-Glu (PPE) found near the protein N-terminus in most cases. These proteins are believed to play an important role in survival and multiplication of mycobacteria in different environments (Marri 2006). There are about 100 members of the PE family, which is further divided into three sub-families, the most important of which is the polymorphic GC-rich sequences (PGRS) class that contains 61 members. Proteins in this class contain multiple tandem repetitions of the motif Gly-Gly-Ala, hence, their glycine concentration is superior to 50 %. The PE_PGRS proteins have been found to be exclusive to the M. tuberculosis complex (Marri 2006) and resemble the Epstein-Barr virus nuclear antigens (EBNA), which are known to inhibit antigen presentation through the histocompatibility complex (MHC) class I (Cole 1999). Interestingly, the analysis of the desoxyribonucleic acid (DNA) metabolic system of M. tuberculosis indicates a very efficient DNA repair system, in other words, replication machinery of exceptionally high fidelity. The genome of M. tuberculosis lacks the MutS-based mismatch repair system. However, this absence is overcome by the presence of nearly 45 genes related to DNA repair mechanisms (Mizrahi 1998), including three copies of the mutT gene. This gene encodes the enzyme in charge of removing oxidized guanines whose incorporation during replication causes base-pair mismatching (Mizrahi 1998, Cole 1999). With the aim of making the information publicly available and the search and analysis of information easier, the Pasteur Institute (http://www.pasteur.fr/recherche/unites/Lgmb/) has created a database system incorporating not only all genes and annotation but other search tools such as Blast or FastA, that allow the user to search for homologue sequences of a query sequence inside the M. tuberculosis genome. This database is freely available for use on the Internet and is known as the Tuberculist Web Server http://genolist.pasteur.fr/TubercuList/). As more information was generated, databases grew bigger, more experimental information became available, and better and more accurate algorithms for gene identification and prediction were released. The initial genome annotation in M. tuberculosis H37Rv strain soon became out of date. For this reason, a re-annotation of that genome sequence was published in 2002. This re-annotation incorporated 82 additional genes. The gene nomenclature was not altered; the new genes have the name of the preceding gene followed by A, B or D, for example, two new ORFs were described between Rv3724 and Rv3725, hence, they were named Rv3724A and Rv3724B. The letter C was not included since it usually stands for “complementary”, which means that the gene is located in the complementary strand. As expected, the classes that exhibited the greatest numbers of changes were the unknown category and the conserved hypothetical category (Table 4-1). The re-annotation of the genome sequence allowed the identification of four sequencing errors making the current sequence size change from 4,411,529 to 4,411,532 bp (Camus 2002). As shown in Table 4-1, the information obtained from a single sequenced genome is enormous. The advances made on the analysis of such information have accelerated TB research. Table 4-1: Functional classification of M. tuberculosis H37Rv and re-annotation* Class Function Number of genes (1998) Number of genes (2002) 0 Virulence, detoxification, adaptation 91 99 1 Lipid metabolism 225 233 2 Information pathways 207 229 3 Cell-wall and cell processes 516 708 4 Stable RNAs 50 50 5 Insertion sequences and phages 137 149 6 PE and PPE proteins 167 170 7 Intermediary metabolism and respiration 877 894 8 Proteins of unknown function 606 272 9 Regulatory proteins 188 189 10 Conserved hypothetical proteins 910 1,051 * Data taken from Fleischman 2002 4.2.2. Comparative genomics In recent times, new technologies have been developed at an overwhelming pace, in particular those related to sequencing and tools for genome sequence data management, storage and analysis. As of April 2007, 484 microbial genomes have been finished and projects are underway aimed at the sequencing of other 1,155 microorganisms (http://www.genomesonline.org/gold.cgi). Mycobacteria are not an exception in this titanic genome-sequencing race; since 1998, when the first mycobacterial genome sequence was published (Cole 1998a); many genome projects have been initiated. Until April 2007, 34 projects on the genome sequencing of different mycobacterial species are finished or in-process. Of these, 15 are directed towards M. tuberculosis strains, and 5 towards other members of the M. tuberculosis complex. This information will be invaluable to improve the knowledge about M. tuberculosis in the next few years. Currently, there are only two M. tuberculosis (H37Rv and CDC1551) and two M. bovis (AF2122/97 and BCG Pasteur) genome sequences annotated and published. For this reason, these are the strains that have been used as reference strains for comparative genomics both in vitro and in silico. The pioneer of in vitro assays of comparative mycobacterial genomics involved comparison of restriction profiles using low frequency restriction enzymes and pulsed-field gel electrophoresis (PFGE). These studies allowed a rough analysis of differences among M. bovis bacille Calmette-Guérin (BCG) isolates (Zhang 1995) and most importantly, contributed to the construction of the first physical maps, which were essential for the generation of the first genome sequence (Philipp 1996). The next step in comparative genomics was the use of genomic subtractive hybridization or bacteria artificial chromosome hybridization for the identification of regions of difference among the strains under analysis (Mahairas 1996, Gordon 1999). Mahairas et al. (Mahairas 1996) used subtractive hybridization to identify regions of difference that account for the avirulent phenotype of the vaccine strain M. bovis BCG. As a result of their studies, they identified three regions of difference (RD1-RD3) in the genome of M. tuberculosis H37Rv that appeared to be absent from M. bovis BCG. Further studies of these regions showed that RD3 corresponded to the prophage PhiRv1, a sequence that has been shown to vary among M. tuberculosis clinical isolates and laboratory strains (see section 4.2.1). RD2 was only deleted in isolates of M. bovis BCG that were re-cultured after 1925. Finally, RD1 turned out to be the only sequence deleted from all M. bovis BCG strains and present in pathogenic strains. However, complementation assays did not reconstitute the full virulent phenotype in M. bovis BCG (Mahairas 1996). The RD1 region contains eight ORFs, including members of the Early Secretory Antigenic Target 6 (ESAT-6) gene cluster (Brosch 2000a). The ESAT-6 proteins have been shown to act as potent stimulators of the immune system (Brodin 2002).The genome of H37Rv contains 23 copies of ESAT-6 family proteins distributed in 11 different regions. Except for esxQ, all are clustered in pairs belonging to the ESAT-6 and CFP-10 protein families (Stanley 2003, Gey Van Pittius 2001). Gordon et al. (Gordon 1999) used ordered bacteria artificial chromosome arrays to determine genomic differences between M. tuberculosis H37Rv and M. bovis BCG. As a result, they identified 10 regions of difference, including the three previously described (Mahairas 1996). Interestingly, two of the newly described regions (RD5 and RD8) also contained members of the ESAT-6 family of proteins. In addition, RD5 contained three genes coding for phospholipase C, a gene with a putative role in mycobacterial pathogenesis (Johansen 1996). Several members of the PE and PPE family proteins were also found in the regions of difference. One copy of IS1532 was identified in RD6 and one copy of IS6110 in RD5. Furthermore, the study searched for regions present in M. bovis BCG but absent from M. tuberculosis H37Rv. Two regions with this characteristic were found and were named RvD1 and RvD2 standing for H37Rv Deleted. Almost all ORFs from these regions code for unknown proteins, so the role of these deletions has not been elucidated. Until 2002, most studies concerning comparative genomics were based on differences among the strain type M. tuberculosis H37Rv and other tuberculous bacilli (Behr 1999, Brosch 1999, Brosch 2002). Different approaches using DNA hybridization techniques, such as microarrays, allowed identification of regions of difference with more accuracy and sensitivity than previous methodologies. In total, 16 regions of difference have been found in M. tuberculosis H37Rv that were deleted from M. bovis BCG. The basic idea behind the identification of regions of difference between the avirulent strain M. bovis BCG and the virulent laboratory strain M. tuberculosis H37Rv was the identification of specific deletions in all BCG strains that could be responsible for their lack of virulence. However, nine of the regions of difference were also absent in pathogenic isolates of M. bovis. Other studies have been done comparing M. tuberculosis H37Rv to its avirulent counterpart M. tuberculosis H37Ra (Brosch 1999), in which other Rv-deleted regions were identified. These regions, named RvD3 to RvD5, were found to be products of homologous recombination of adjacent IS6110, as with RvD2. Finally, only RD1 was found to be absent in all M. bovis BCG strains and present in other members of the complex. The regions of difference were used as markers of the molecular evolution of M. tuberculosis (Brosch 2002) and are represented in Figure 4-1. The use of deletions as molecular markers has been described in Chapter 2. Besides the above mentioned deletions, two duplications were identified in the M. bovis BCG genome (Brosch 2000b). These duplications, named DU1 and DU2, apparently arose from independent events. DU1 seems to be restricted to the BCG Pasteur strain and comprises the OriC locus, indicating that BCG Pasteur is diploid for OriC and some other neighboring genes. The DU2 region has been found in all BCG substrains tested and includes the sigma factor sigH, which has been related to the heat-shock response (Brosch 2001). Some excellent reviews are available on comparative genomics, made before the publication of the second M. tuberculosis genome (Cole 1998a, Brosch 2000a, Brosch 2000c, Brosch 2001, Domenech 2001, Cole 2002a, Cole 2002b). In 2002, the second M. tuberculosis genome sequence was completed, namely the clinical strain CDC1551, which had been previously involved in a TB outbreak. This strain was considered to be highly transmissible and virulent for human beings (Fleischmann 2002). With the sequence of this second strain, a first approach to the bioinformatic analysis of intraspecies variability became possible. In the initial comparison by sequence alignment, H37Rv presented a total of 37 insertions (greater than 10bp) relative to strain CDC1551; from these, 26 affected ORFs while the remaining 11 were intergenic. On the other hand, CDC1551 presented 49 insertions relative to M. tuberculosis H37Rv; 35 affecting ORFs and 14 intergenic. A total of 80 ORFs were inserted in either genome, 25 (31.2 %) of them were hypothetical or conserved hypothetical ORFs, while 36 (45 %) corresponded to the family of PE/PPE proteins, showing the potential role of this family of proteins in antigenic variability and thus in pathogenicity. Deletion M. tuberculosis H37Rv M. africanum M. microti M. bovis M. bovis BCG RD2 RD14 RD1 RD4 RD12 RD13 RD7 RD8 RD10 RD9 RvD1 TbD1 Figure 4-1: Distribution of deleted regions in M. tuberculosis complex members. Dark gray filled cells indicate the presence in all strains tested, light gray indicate the presence in some strains, white is absence from all strains tested. Data taken from (Gordon 1999, Brosch 2002, Brosch 2000b, Marmiesse 2004) Only one major rearrangement was found, consisting of the PhiRv1 (RD3),which was found in the genome of M. tuberculosis H37Rv on coordinates 1,779,312 associated with a protein of the REP13E12 family. On the genome of CDC1551, it was found to be located on the complementary strand at coordinates 3,870,803, also associated with a REP13E12 protein. M. tuberculosis CDC1551 was found to have four copies of IS6110 while M. tuberculosis H37Rv had 16. Interestingly, four of the 16 IS6110 copies found in M. tuberculosis H37Rv lacked the characteristic 3 to 4 base pair direct repeat and were adjacent to regions deleted in M. tuberculosis H37Rv relative to M. tuberculosis CDC1551, which suggests homologous recombination. Since 2002, a large number of studies has been based on Large Sequence Polymorphisms (LSPs) and Single Nucleotide Polymorphisms (SNPs), identified by the comparison of the first two M. tuberculosis genome sequences (Hughes 2002, Gutacker 2002). These studies have been complemented with data obtained from the genome sequence of a third organism of the M. tuberculosis complex. The complete genome of Mycobacterium bovis AF2122/97, a fully virulent strain isolated from a diseased cow in 1997 in Great Britain, was published in 2003 (Garnier 2003). This genome was composed of 4,345,492 bp with a G+C content of 65.63 %, 3,952 putative coding genes, one prophage (PhiRv2), and four IS elements. As expected, similarity of more than 99.95 % was found with a complete colinearity, without evidence of extensive rearrangements. With regard to LSP, most of them have been described above as regions of difference. Sequencing confirmed the absence of 11 regions of difference, and the presence of only one insertion in comparison to the sequenced M. tuberculosis genomes: the region named M. tuberculosis specific deletion 1 (TbD1), is a reflection that deletion events relative to M. tuberculosis have shaped the M. bovis genome. The comparison of the three genomes reflects the high degree of conservation among the members of the M. tuberculosis complex, as well as the divergence of M. bovis related to M. tuberculosis strains. For specific proteins or genes that vary between M. bovis and M. tuberculosis, a detailed list can be found in Garnier et al. (Garnier 2003). However, it is important to mention that the greatest degree of variation among these bacilli is found in genes encoding cell wall components and secreted proteins. Extensive variations have been found in genes of the PE/PPE family of proteins as well as in genes from the ESAT-6 family, where six of the more than 20 members are absent or altered in M. bovis. Some other changes are registered in genes coding for lipid synthesis and secretion as the mmpL and mmpS family of genes. Deletions responsible for the M. bovis requirement of pyruvate as a carbon source were also identified (Garnier 2003). The analysis of the genome sequence of members of the M. tuberculosis complex has led to great advances in the knowledge of the biology and pathogenesis of these bacteria. The sequencing of whole genomes of Mycobacterium leprae (Cole 2001), Mycobacterium avium subspecies paratuberculosis (Li 2005) and of other members of the genus, such as Mycobacterium smegmatis and M. bovis, has also made huge contributions to the understanding of the lifestyle of mycobacteria. Recently, a report compared the metabolic pathways shared among five of the mycobacterial genomes that have been sequenced (the genome sequence of M. smegmatis was not included on this report) (Marri 2006). The characteristics of the sequenced genomes of organisms in the genus Mycobacterium are presented in Table 4-2. The main differences were found in ISs, the PE/PPE gene family, genes involved in lipid metabolism and those encoding hypothetical proteins. The members of the M. tuberculosis complex had the highest number of IS elements, which might suggest higher intra-species variability in M. tuberculosis compared to other species of mycobacteria. Table 4-2: Features of sequenced genomes of species belonging to the Mycobacterium genus* Feature M. tuberculosis H37Rv M. tuberculosis CDC1551 M. bovis AF2122/97C M. leprae M. avium subsp. paratuberculosis M. smegmatis Genome size (bp) 4,411,529 4,403,836 4,345,492 3,268,203 4,829,781 6,988,209 Protein coding genes 3,927 4,186 3,920 1,604 4,350 6,897 G+C (%) 65.6 65.6 65.6 57.79 69.3 67.40 Protein coding (%) 91.3 ~ 91 90.8 49.5 91.5 92.42 Gene density (bp/gene) 1,114 1,052 1,099 2,037 1,112 1,013 Average gene length 1,012 952 995 1,011 1,015 936 tRNAs 45 45 45 45 45 47 rRNA operon 1 1 1 1 1 2 *Data taken from Li 2005, Marri 2006 The comparison of the proteins encoded within the five sequenced genomes revealed a core, or a number of shared proteins, of 1,326 proteins, compared to the 219 core genes described by macroarray and bioinformatic analyses (Marmiesse 2004). Unique genes ranged between 966 (M. avium subsp paratuberculosis) and 26 (M. tuberculosis H37Rv) depending on the genome, and most of these proteins are hypothetical. Regarding the PE/PPE family proteins, it is worth mentioning that M. tuberculosis and M. bovis contained the highest number of these proteins, while neither M. leprae nor M. avium subsp paratuberculosis have PE_PGRS proteins. Also, a wide variation has been noted in the mmpL gene family, known to participate in lipid transport and secretion. It has been proposed that these variations could be involved in host specificity (Marsh 2005). 4.2.3. Comparing genomes of clinical strains of M. tuberculosis Genome comparison has shown that gene content can vary between strains of M. tuberculosis. The analysis of complete genome sequences identified SNPs, LSPs, and regions of difference (RDs) when clinical isolates of M. tuberculosis were compared (Fleischmann 2002, Gutacker 2002, Tsolaki 2004, Filliol 2006). The microarray approach allows the comparison of a large number of genomes, providing information on the diversity, frequency, and phenotypic effects of polymorphisms in the population (Tsolaki 2004). This kind of genomic analysis is also useful for the investigation of outbreaks. Particularly when applied to genomics, DNA microarrays allow the identification of sequences present in the M. tuberculosis reference strain, but absent from different clinical isolates. Unfortunately, the microarray technique cannot detect genes present in a clinical isolate that are absent in the reference strain. These changes can originate from small deletions, deletions in homologous repetitive elements, point mutations, genome rearrangements, frame-shift mutations, and multi-copy genes (Ochman 2001, Schoolnik 2002). Fleischeman et al. suggested that genetic variation among M. tuberculosis strains might denote selective pressure, and therefore might play an important role in bacterial pathogenesis and immunity (Fleischmann 2002). Although associations between host and pathogen populations seems to be highly stable, the evolutionary, epidemiological, and clinical relevance of genomic deletions and genetic variation regions remain ill-defined, as do the molecular bases of virulence and transmissibility (Hirsh 2004). Up to six M. tuberculosis lineages adapted to specific human populations have been described by Gagneux et al. using comparative genomics and molecular genotyping tools: the Indo-Oceanic lineage, East-Asian lineage, East-African-Indian lineage, Euro-American lineage, and two West-African lineages (Gagneux 2006, see chapter 2). Specific deletions associated with the hypervirulent Beijing/W strains of M. tuberculosis were identified (Tsolaki 2005). Evidently