Bacteriophages: Genes and Genomes

Transcript of Part 3: Mycobacteriophage genomics

00:00:01.00 Hello. My name is Graham Hatfull.
00:00:03.08 I'm a professor at the University of Pittsburgh
00:00:05.23 and a Howard Hughes Medical Institute professor.
00:00:08.21 Today we are talking about bacteriophages, their genes and their genomes,
00:00:12.23 and in part three we are going to focus in on a comparative analysis
00:00:17.00 of a particular type of bacteriophages. These are the mycobacteriophages, phages that infect mycobacterial hosts.
00:00:26.09 And so I should explain why we would want to choose phages of a particular host.
00:00:37.00 And indeed, why we would want to focus on this particular group.
00:00:40.27 So, perhaps one of the most important aspects is that phages
00:00:46.10 that infect very different bacteria tend to be very unrelated to each other.
00:00:51.04 And therefore there is not much to be learned about the detailed mechanisms
00:00:56.02 of phage evolution by comparing them.
00:00:59.13 They are so different there is little to be learned.
00:01:02.23 On the other hand if we were to focus on the phages that infect a common bacterial host,
00:01:08.01 then we would argue that they must all be in some way
00:01:12.17 in genetic, at least potentially, in genetic communication with each other.
00:01:18.22 And then comes the question as to well which bacteria host should we use
00:01:22.13 in order to isolate and characterize these viruses?
00:01:27.06 And there's many of course bacteria to choose from.
00:01:32.05 If we had to think of them as ones that would be the most useful, the most interesting,
00:01:36.20 we might want to think about focusing on some bacterial pathogens.
00:01:40.17 Or alternatively bacteria that are important for other criteria.
00:01:46.15 Environmentally important, or other key aspects of their biology.
00:01:53.17 So we focused on the mycobacteriophages.
00:01:58.07 And in part because we think that the mycobacterial hosts are of sufficient importance
00:02:06.00 that they really warrant taking advantage of the viral systems that we could develop.
00:02:13.10 Not just for understanding the viruses, but for understanding the hosts that they infect.
00:02:18.17 And so I'll mention two bacterial species within this genus.
00:02:26.22 One is Mycobacterium tuberculosis, which is the causative agent of human TB.
00:02:34.19 And I'll mention a relative of Mycobacterium tuberculosis, which is called Mycobacterium smegmatis,
00:02:41.10 and this is important because it is a very helpful surrogate for us to use in the lab.
00:02:47.07 Mycobacterium tuberculosis we can grow in the lab,
00:02:51.18 but we have to be very cautious and careful with it for two reasons.
00:02:55.25 Primarily because it is a rather nasty bacterial pathogen,
00:03:01.22 and we certainly don't want any of us working in the lab to be infected with that organism.
00:03:08.19 But is has another feature that somewhat complicates its growth and manipulation in the lab.
00:03:13.22 And that is that it grows extremely slowly. It has a doubling time of about 24 hours.
00:03:18.19 So it takes a day to go from one cell to two cells with Mycobacterium tuberculosis.
00:03:24.28 That makes research pretty slow going on M. tb.,
00:03:31.23 but you also have to be very careful about sterility and your aseptic technique
00:03:36.18 because almost everything out there grows faster than Mycobacterium tuberculosis,
00:03:41.24 and if you are not careful, you will end up growing that rather than M. tb.
00:03:46.09 Mycobacterium smegmatis, in contrast, is a non-pathogen.
00:03:51.20 It does not cause disease in healthy adult human beings,
00:03:56.20 and it grows relatively quickly. It has a doubling time of about three hours,
00:04:01.29 which means that we can grow a lawn, a smooth lawn, of Mycobacterium smegmatis
00:04:05.29 on Petri dishes in about 24 hours, and we can grown individual colonies in three to four days.
00:04:14.02 Mycobacterium tuberculosis is actually a very serious and important human pathogen.
00:04:21.13 About two million people a year die from Mycobacterium tuberculosis infections, from TB.
00:04:31.02 And it is estimated that Mycobacterium tuberculosis kills more people
00:04:35.27 in the world than any other single, infectious agent.
00:04:39.05 Many people that are infected with the organism actually don't get disease
00:04:45.09 because the bacterium establishes a latent infection and doesn't cause health problems.
00:04:53.20 Although, it can do either with old age,
00:04:58.15 or with a compromise of your immune system, such as for example with HIV infection.
00:05:04.29 These are not only a very prevalent... it's a very prevalent disease is tuberculosis,
00:05:10.24 but there is a growing and widespread concern
00:05:13.22 about drug resistance strains of Mycobacterium tuberculosis,
00:05:17.27 that are either difficult to treat or effectively untreatable.
00:05:23.18 There's clearly a need for new strategies for diagnosis, prevention, and cure of tuberculosis.
00:05:32.02 And so we think these are good reasons to focus on the phages
00:05:36.20 that infect these organisms in the hope that they could contribute towards that specific cause.
00:05:43.25 And so this is an important point. The mycobacteriophages can really lead us in two directions.
00:05:51.10 They can tell us about viral diversity and the evolution of bacteriophages,
00:05:56.09 and at the same time they can provide tools for controlling TB
00:06:01.01 and in fact can provide elements that we need to manipulate TB to understand it and to work with it.
00:06:09.05 I am not going to focus here too much on the specific applications of the mycobacteriophages.
00:06:16.27 I thought that I would just mention one in passing,
00:06:19.28 which is the use of mycobacteriophages as a novel type of diagnostic system
00:06:24.28 in order to test whether a person is infected with TB
00:06:30.21 and indeed whether it is a drug resistant or a drug sensitive strain.
00:06:35.19 This is a strategy which was first described by my colleagues Bill Jacobs and Barry Bloom.
00:06:41.07 The idea is to make so called reporter mycobacteriophages,
00:06:45.26 recombinant phages that carry a gene that can report
00:06:50.27 and tell us about the metabolism of the mycobacterial cell.
00:06:57.04 So you can construct reporter phages that carry a gene
00:07:01.11 such as firefly luciferase that will make the bacteria emit light.
00:07:06.13 Or we can make reporter phages that carry green fluorescent protein from jellyfish
00:07:13.06 that when that is introduced by infection of the host, it makes the cell fluoresce.
00:07:18.12 And we can use these properties, fluorescence or light emission,
00:07:23.10 in order to then monitor what type of bacteria a particular patient is infected with.
00:07:30.18 And so this is an idea that I think shows considerable promise
00:07:33.05 and is currently undergoing further research and development.
00:07:37.20 If we want to compare the genomes of mycobacteriophages
00:07:48.13 in order to understand how they are related to each other,
00:07:52.06 how they've evolved, what their diversity is, well,
00:07:55.23 we need to have the mycobacteriophages in order to characterize.
00:08:00.08 And so we have gone out over the past few years
00:08:03.21 to isolate new mycobacteriophages and to genomically characterize them.
00:08:10.04 And whilst this has been a major focus in my laboratory,
00:08:14.17 this has also proven a very successful approach for both high school students
00:08:22.23 and undergraduate students to become involved in research endeavors
00:08:28.28 by going out and isolating new mycobacteriophages and sequencing them.
00:08:33.03 And now with the Howard Hughes Medical Institute science education alliance,
00:08:38.27 there are hundreds of students who are contributing to this cause,
00:08:42.16 and because of this we now have many new mycobacteriophages to characterize and to compare.
00:08:51.02 The process is relatively simple.
00:08:53.11 We start with a sample of soil or compost or wherever you might think to go
00:09:01.20 and look and to find out if there are some bacteriophages present,
00:09:04.18 The sample is mixed up with some liquid. The particulate matter is removed.
00:09:12.02 And we simply incubate some of that in the presence of our permissive bacterial host,
00:09:18.12 which is Mycobacterium smegmatis.
00:09:20.07 We lay those out on a Petri dish, as shown here, and we look for plaques,
00:09:26.02 for areas where a phage that was present in our original sample
00:09:32.09 has now infected these cells to form a plaque.
00:09:34.25 We can then pick an individual plaque, purify it, remove all of the other contaminants,
00:09:43.15 and we can propagate it in the laboratory until we have a high titer or a concentrated stock.
00:09:49.12 From that we can make DNA.
00:09:51.08 The DNA can be sequenced to give us tens of thousands of nucleotide sequence information,
00:10:00.20 and then we use computational approaches and bioinformatics
00:10:04.20 to predict where all the genes are in these genomes, and then we can compare them.
00:10:11.01 So we are using Mycobacterium smegmatis as our host,
00:10:16.24 fast growing, non-pathogen, and our samples predominantly come from soil and compost.
00:10:23.06 We have usually just simply plated out the sample with our permissive host,
00:10:28.24 but because the specific phages that we're after can be present at relatively low concentrations,
00:10:37.05 there is an approach that can be used with enrichment, where you simply take your soil or your compost sample,
00:10:43.00 you mix it and incubate it with some permissive host cells,
00:10:47.21 in this case Mycobacterium smegmatis, that allows even the small number of particles
00:10:53.23 that may be present to infect, to reproduce themselves,
00:10:57.27 and so that when it comes to the plating and the identification of
00:11:03.06 plaques they're present at higher concentrations.
00:11:05.20 There's a couple of different approaches,
00:11:08.03 but this is a relatively reproducible and simple process for discovering new phages.
00:11:16.02 So by this point thousands of mycobacteriophages have been isolated
00:11:19.06 using Mycobacterium smegmatis as a host.
00:11:22.16 I should state I think that some of these infect smegmatis,
00:11:27.14 but don't infect Mycobacterium tuberculosis, whereas others do.
00:11:32.19 And so we use a surrogate strain, Mycobacterium smegmatis, as a host,
00:11:37.23 but it is likely that the host range, the cell preferences of the phages that we isolate
00:11:43.28 are going to be all over the place and at this stage are not well defined.
00:11:47.13 We've... the most recent publication that describes the characterization of these
00:11:56.07 appeared earlier this year in 2010, and described a comparative analysis of 60 of these.
00:12:04.14 But because of the impact of the science education alliance program
00:12:10.21 as well as the ongoing studies of Pittsburgh,
00:12:12.17 the number of new phages and sequenced genomes, it is positively exploding.
00:12:18.27 And at this point in the middle of October in 2010,
00:12:23.28 154 completed genome sequences and much analysis awaiting to be done.
00:12:33.11 All of these phages, it turns out, even though they don't have to be,
00:12:38.20 are double stranded DNA tailed phages.
00:12:42.14 We haven't isolated any RNA phages or any single stranded DNA phages.
00:12:48.02 They are all double stranded DNA, tailed phages.
00:12:51.12 Now in part one of this lecture we saw that perhaps the most common order of bacteriophages
00:12:59.06 are the Caudovirales, the double stranded DNA, tailed viruses.
00:13:02.21 Just like these that I showed you. I also told you that there's three common types.
00:13:09.06 The so-called Siphoviridae with the long flexible tails, the Myoviridae with the contractile tails,
00:13:14.24 and the Podoviridae that have short stubby tails.
00:13:17.21 If we just compare the morphotypes of these 60 genomes,
00:13:23.13 which have been analyzed and published,
00:13:27.09 53 of them are of this Siphovirus type, 7 of them are of the Myovirus type.
00:13:33.08 We have no Podoviruses at all.
00:13:37.21 And so these numbers appear to hold true for the larger collection
00:13:42.05 of mycobacteriophages, and therefore we have growing confidence in the idea
00:13:47.15 that there really are no Podoviruses amongst the mycobacteriophages.
00:13:53.00 We don't know whether this is because phages with the short stubby tails
00:13:58.20 are physically incapable of infecting bacteria like the mycobacteria
00:14:04.25 that have thick and chemically complex cell walls,
00:14:08.06 or whether it's just a reflection of a restriction
00:14:12.12 of evolutionary opportunities to generate those types of phages.
00:14:19.04 So that's a little bit of a mystery as we don't have any Podoviruses,
00:14:24.18 but we have lots of examples of these other two morphotypes.
00:14:28.28 When we look at the genomes there are some basic parameters
00:14:34.21 that we can see that are helpful in thinking about what these genomes are like.
00:14:38.11 First of all, the average length of all them is 72,588 base pairs.
00:14:47.01 We don't really understand why mycobacteriophages would have that particular length.
00:14:51.07 Phages of other bacterial hosts often have very different average lengths
00:14:58.02 including those that are only half as long as the average mycobacteriophage genome.
00:15:03.09 And so we don't really know what determines this parameter,
00:15:08.02 either for the mycobacteriophages or indeed for any other phages.
00:15:12.28 There's also a large range in size from a little under 42,000 base pairs
00:15:20.23 up to about 164 and a half thousand base pairs.
00:15:25.06 So there is a lot of diversity in terms of size range.
00:15:29.25 The GC content on average for all of these 60 genomes is about 63 and a half percent.
00:15:37.00 A number which closely mirrors the GC content of the bacterial host Mycobacterium smegmatis.
00:15:44.08 And that's not a surprise because it has been seen from the analysis of phages of other bacterial hosts
00:15:52.27 that the GC content of the phages often mirrors that of the hosts.
00:15:57.09 What's perhaps more surprising, however, is that the range of GC content
00:16:03.10 amongst these phages is actually really amazingly broad
00:16:07.09 spanning from 56.3% at the lower end up to 69% at the upper end.
00:16:14.28 And we've been trying to think for some time as to what this span of GC content reflects.
00:16:23.16 One attractive idea although it remains to be fully tested is that these particular mycobacteriophages
00:16:33.13 whilst they have a common host in Mycobacterium smegmatis
00:16:37.12 may not necessarily have been infecting Mycobacterium smegmatis
00:16:43.17 as their preferred bacterial host in the environment from which we recovered the phages
00:16:49.13 in their recent ecological and evolutionary times.
00:16:54.11 In other words, they may have preferences for infecting some other bacterial host
00:17:00.20 that we have yet to figure out what that is.
00:17:03.24 But that might account for the range of GC content that we would see.
00:17:09.06 And so one of the things that we would like to do to test this idea
00:17:11.28 is to actually determine the specific host range
00:17:15.18 on a whole range of bacteria that are related to the Mycobacteria
00:17:21.24 to see if we can discern a pattern or a correlation between GC content and the host preferences.
00:17:27.17 And finally if we look at the number of genes that are present,
00:17:31.21 of these 60 genomes there is a total of 6858 open reading frames or putative protein coding genes, ORFs,
00:17:40.19 about a hundred and fourteen ORFs on average per genome.
00:17:45.27 And interestingly the average ORF size, the average size of an open reading frame, is only 616 base pairs.
00:17:55.14 That's about two thirds of the average size of a bacterial gene.
00:18:02.22 And this appears to be a parameter which is true not just for the mycobacteriophages,
00:18:08.08 but for other bacteriophages that people have looked at.
00:18:11.01 And we've been interested as to why this number
00:18:13.13 should be quite so different from that of the bacterial host.
00:18:17.14 It fits, however, I think, with the idea that illegitimate recombination
00:18:23.10 is playing a key role in how these genomes evolve.
00:18:29.06 And in fact we can see that many of the segments of DNA that appear to have come in
00:18:34.26 relatively recently from other genomes tend to be on the small side.
00:18:39.07 And therefore we can think of this process of evolution, as we talked about in part two,
00:18:46.29 may actually contribute to driving the average gene size down.
00:18:52.20 So we can take our 60 genomes, and we can ask the question:
00:18:58.25 "how are they related to each other at the nucleotide sequence level?"
00:19:03.27 And we can use an approach that we saw in part two,
00:19:10.17 which is where we can compare the nucleotide sequences in a dot plot analysis.
00:19:16.20 And one way of doing this is illustrated here.
00:19:21.29 Now what we've done is to take our 60 genomes,
00:19:24.29 and we've simply joined them together end to end to make a long concatamer,
00:19:30.26 and we've done that in random order.
00:19:33.01 We've just taken our sixty sequences joined them together
00:19:36.14 to get a long span and then simply compared them with each other.
00:19:40.02 Not surprisingly there is a diagonal line from the top left to the bottom right
00:19:45.22 because that simply tells us that every phage genome is identical to itself.
00:19:51.22 That is a good thing.
00:19:53.00 And then there's a number of diagonal lines you can see
00:19:57.05 where a particular phage in this part of the array
00:20:01.24 is similar to a second phage that is sitting in a different part of the array.
00:20:09.08 And because the genomes are in a random order in this concatamer,
00:20:13.10 these various types of relationships are scattered over this dot plot.
00:20:21.06 And we can see though, I think, that we have phage genomes that are similar to each other,
00:20:27.16 but there must be many that are completely dissimilar to each other at the nucleotide sequence level.
00:20:33.06 So having done this and identified, generally speaking, who is most closely related to who else,
00:20:40.13 what we can do is we can take each of the genomes
00:20:44.03 and we can change the order in which we've arrayed them in this concatamer,
00:20:50.10 and then repeat this computational comparison.
00:20:54.09 So when we do that, this is what the plot looks like.
00:20:57.14 And so all we've done is simply to group the genomes together that are similar to each other.
00:21:04.26 So for example if you look in the top right hand corner all of those genomes that are similar to each other are positioned
00:21:10.10 next to them in the top left hand part of the plot.
00:21:13.03 We can take this gross nucleotide sequence similarity
00:21:18.02 to put the genomes together into what we refer to as clusters.
00:21:24.04 Such as Cluster A, Cluster B, C, D, E, etc.
00:21:27.10 And so those clusters go up to cluster I,
00:21:30.13 and on the right hand side where it says Sin,
00:21:36.05 this corresponds to what we refer to as singleton genomes.
00:21:41.02 And out of these 60 genomes, there are 5 that are singletons,
00:21:45.11 which means that each of those has no close relatives
00:21:49.10 either here or anywhere through the biological world.
00:21:56.00 There is some important texture to this grouping and these clusterings,
00:22:01.25 and we can readily identify some clusters as being, having more than one closely related type.
00:22:11.14 And we therefore subdivide the cluster into sub-clusters.
00:22:16.09 You can see here for the cluster C that there are many of these genomes,
00:22:21.13 in fact almost all of them are very similar to each other,
00:22:25.08 and constitute sub-cluster C1, and then there is a single genome over here
00:22:31.10 which is related to the other C cluster genomes, but less so, so that constitutes sub-cluster C2.
00:22:41.10 So we have a large number of different types of genomes,
00:22:43.27 more than twenty substantially different types of genomes,
00:22:47.02 just within this group of 60 that we are looking at.
00:22:51.21 And so each of these genomes, and you can see them identified by name here,
00:22:57.05 as we zoom in on the different clusters and sub clusters.
00:23:00.18 Here we are looking at clusters A through to E.
00:23:03.15 Sub cluster C as I indicated can be divided into sub-cluster C1
00:23:10.25 with Bxz1, Cali, Catera, Rizal, ScottMcG, and Spud.
00:23:16.26 And then Myrna is the sole member of cluster C2.
00:23:20.22 And these are the remaining clusters, F, G, H, and I.
00:23:27.24 And then here are the singletons over on the right hand side here:
00:23:31.14 Corndog, Giles, TM4, Wildcat, and Omega.
00:23:37.00 And so we can take each of these genomes that we've assorted with each other
00:23:47.17 according to their nucleotide sequence similarity,
00:23:50.18 or if they are singletons, they're one of a type.
00:23:53.14 We can generate the genome maps,
00:23:55.26 and we can see what features they have and what they look like.
00:23:58.18 This is showing Giles, which I introduced previously in part 2 of the lecture,
00:24:03.29 and you can see its densely packed genes with the rightwards transcribed genes above the DNA,
00:24:12.22 and the leftwards transcribed genes below the DNA.
00:24:15.20 It is densely packed and we've color coordinated these genes according to their relatives.
00:24:22.03 And so we now have these genome maps for all of these phage genomes
00:24:28.11 and these maps then can be compared,
00:24:31.01 and in fact the genes and the predicted proteins can be compared as well.
00:24:35.23 So we look at these 60 mycobacteriophages,
00:24:39.20 and we see that the genes are tightly packed with few non-coding regions.
00:24:42.29 There's many, many genes, but there appears to be few operons.
00:24:49.11 Meaning that we think that there may be a hundred genes, but there may be only 2, 3, or 4 sites
00:24:56.09 for transcription initiation or promoters that are used to express these genes.
00:25:02.19 We actually know very little about the patterns of gene expression of any of these phages,
00:25:07.05 but the bioinformatic predictions are that there will be blocks of genes that are transcribed together.
00:25:15.16 The virion genes, those are the genes that encode the structural components, the heads and the tails,
00:25:24.23 those genes typically tend to be grouped together in the genome,
00:25:28.19 and they have a common order or synteny
00:25:32.09 which is conserved even though the genomic sequences may be extremely different to each other.
00:25:40.23 Especially once we examine the parts of the genomes outside of these virion genes,
00:25:46.14 we find vast numbers of genes, many of them relatively small,
00:25:51.03 which have a completely unknown function.
00:25:54.25 And we have failed to predict what they can do simply from comparing them with other genomes.
00:26:01.22 And so what we've done is to create a computer program.
00:26:08.06 This was a program call Phamerator, and it was written by a colleague of mine, Dr. Steve Cresawn,
00:26:13.27 which can then begin to analyze all of the genes and how they are related to each other
00:26:19.05 by comparing them at the amino acid sequence level.
00:26:22.24 This is really important because so far I have shown you how
00:26:26.29 we can compare genomes at the nucleotide sequence level,
00:26:30.05 I also showed you that we have lots of examples because we have many different types of genomes.
00:26:36.08 that appear to not share nucleotide sequence similarity
00:26:41.10 even though they are in genetic communication with each other, at least in principle,
00:26:46.10 because of the use of the common host.
00:26:49.05 Just because they don't have nucleotide sequence similarity
00:26:53.00 doesn't mean that they are completely unrelated.
00:26:56.00 And in fact, once we start to look at the gene relationships
00:27:01.16 by comparing the amino acid sequences
00:27:04.22 we can begin to see the patterns that reflect the common origins of the phages,
00:27:10.15 even though they no longer share nucleotide sequence similarity.
00:27:14.06 And so this program that Steve Cresawn wrote
00:27:19.07 called Phamerator facilitates this in a very important process.
00:27:25.07 What it does is it takes each of these open reading frames out of 60 genomes,
00:27:30.06 we have these 6,854 genes.
00:27:33.14 It takes each of the predicted proteins
00:27:37.10 and compares them with everything else
00:27:39.23 using alignment programs such as BLASTp and Clustal.
00:27:45.23 Genes which are related to each other because
00:27:49.21 they meet a particular threshold of similarity we group together.
00:27:55.13 And we put them in groups, and those groups are called phamilies or phams.
00:27:59.20 And of these 6,858 genes we have a total of 1,523 distinct phamilies or sequences.
00:28:12.04 A large proportion of those are what we refer to as "orphams".
00:28:18.13 They are phamilies but they only contain a single member.
00:28:21.25 Not because we believe that other members don't exist
00:28:26.14 but because this population of phages appears to be very diverse
00:28:31.12 and presumably quite large,
00:28:33.10 and we simply haven't yet identified the relatives
00:28:36.29 of these orphams that constitute these phamilies.
00:28:40.03 And so this is about 45% of all of our phamilies only have a single member.
00:28:45.16 This Phamerator program is extremely helpful for generating the maps
00:28:51.25 and displaying the relationships that help us understand
00:28:54.19 the mosaic components by which these are put together.
00:28:59.10 And so here I am showing segments of four genomes,
00:29:01.22 that you can see, just parts that are aligned
00:29:07.03 showing the boxes here and the numbers above the boxes such as here at the top in the middle, 1406,
00:29:16.13 refers to a particular phamily. That's a phamily number for which that gene is a member.
00:29:22.12 And then in this display we can color coordinate the degree
00:29:27.14 of sequence similarity at the nucleotide level between the various genomes.
00:29:32.26 And this is actually reflecting a part that I showed you... a part of these genomes
00:29:36.22 that we talked about in part two.
00:29:39.18 Now we can do this type of representation with large numbers of these genomes.
00:29:47.05 When we look at particular clusters, any particular cluster
00:29:51.03 can have genomes that are very similar to each other,
00:29:55.28 or they can be actually quite diverse, depending on the particular cluster
00:30:00.25 that you look at and the degree of sequence similarity.
00:30:04.03 I am just illustrating this with the clustered G phages,
00:30:09.06 for which in our expanded set we actually have 4 members now,
00:30:13.29 and the color coordination, the purple between these four genomes illustrates how very closely related they are.
00:30:21.25 And when we compare the colors of the genes at the protein levels,
00:30:25.21 you can see that these are also very similar.
00:30:30.13 This method is very powerful in part because it is a very easy way
00:30:35.18 of seeing rather smaller differences
00:30:37.27 that nonetheless have played a key role in how these genomes have evolved.
00:30:42.20 For example, down in the right hand end you can see these convolutions
00:30:46.22 here of segments that have been lost from one genome or gained by another.
00:30:51.13 And in fact this illustrates the finding of a new mobile genetic element,
00:30:57.10 a new ultra small transposon that appears to play a role in these particular...
00:31:02.08 in the evolution of these particular genomes.
00:31:05.09 So in part two we saw a lot about how mosaicism is the key architectural feature
00:31:14.16 of bacteriophage genomes. Because now we are looking at this group of mycobacteriophages
00:31:22.12 infecting a common host,
00:31:24.15 we have lots of examples where even though there is no nucleotide sequence similarity,
00:31:29.23 we can see that the genes are shared through common amino acid sequence similarity.
00:31:37.15 And therefore we can look at patterns that are contributing in generating the process of genome mosaicism
00:31:46.09 even in the absence of substantial sequence similarity.
00:31:51.23 And what we find is a massive amount of mosaicism
00:31:56.11 where the modules that contribute to the structure of the genome often correspond to simply to single genes.
00:32:08.14 So modules correspond to single genes when we conduct this type of analysis.
00:32:15.03 And we've developed a particular tool for representing this, representing the phylogenies if you like,
00:32:24.08 where we can take individual Phams- here is one Pham 233 and here is another Pham, Pham 471-
00:32:31.27 and in these representations, we've simply drawn as points around the circle
00:32:38.24 all of the genomes that we have available to us,
00:32:41.16 and for that particular sequence family we've drawn an arc between those genomes
00:32:48.28 that have a member of that Pham.
00:32:52.29 And therefore it essentially represents or reflects the phylogeny
00:32:57.26 or the evolutionary history of this particular family of sequences.
00:33:02.09 In the top part of the figure I've just shown a small segment of phage Omega from genes 125 to 128.
00:33:11.27 Gene 126 in Omega has a relative that we can see through amino acid sequence similarity
00:33:19.23 to a gene in this genome called Cjw1.
00:33:24.23 Gene 127 in Omega has a relative in a genome called KBG.
00:33:32.11 In that case, gene 84. But, and this is important, the context, the flanking sequences in each case is different.
00:33:46.15 Ok, the sequences to the left of Omega 126, which corresponds to Omega gene 125,
00:33:53.06 are completely unrelated to Cjw1 gene 72,
00:33:58.17 which is at the left part of that gene in Cjw1.
00:34:03.10 And the same goes for the KBG comparison.
00:34:08.26 So in this case we can see that we don't have any nucleotide sequence similarity between these,
00:34:13.15 but we can dissect these evolutionary relationships that show
00:34:21.09 that these two adjacent genes in this example in Omega
00:34:24.16 have clear and distinct evolutionary histories. They have different phylogenies.
00:34:30.02 And this is one example, but we have clearly thousands of examples of Phams
00:34:36.00 which share and exhibit these types of relationships.
00:34:40.26 And this has considerable importance when you start to think
00:34:46.02 about questions of phylogeny of whole phage genomes.
00:34:50.23 Why not just take whole phage genomes and construct a phylogenetic tree
00:34:54.16 so you can see how they are all related to each other?
00:34:58.16 The problem is that all of the bits of the genomes because they are mosaic,
00:35:03.10 built from modules and pieces, and all those bits and those pieces have distinct evolutionary histories.
00:35:10.04 They have different phylogenies. There is arguably no single, clear, evident
00:35:16.17 phylogeny for a genome as a whole.
00:35:18.27 The genome represents an individual phage,
00:35:23.04 and its evolutionary history is reflected in a multiplicity of events
00:35:28.17 that have put those pieces together in that particular combination, in that order, in that particular virus.
00:35:36.02 I am just showing some other genome maps here of
00:35:43.19 some of these genomes illustrating again that for some of these genes
00:35:52.05 that are encoding the structural proteins, we know what they do.
00:35:55.20 But most of these other abundance of genes with large numbers of genes,
00:36:00.02 we really have absolutely no idea what they do.
00:36:03.01 And we would certainly like to know what their functions are and indeed what their structures are.
00:36:08.00 And so now if we expand our analysis to include the unpublished information,
00:36:13.12 and this was done for a 153 genomes that are completely sequenced,
00:36:19.09 over 17,000 open reading frames, almost 3000 Phamilies of distinct and different protein sequences.
00:36:26.12 The number of Orphams has come down slightly.
00:36:30.26 It is about 41% as we have started to find some of the relatives of genes that were previously Orphams.
00:36:38.22 And amazingly, if we take these almost 3000 phamilies,
00:36:43.11 and we compare them against the sequence databases,
00:36:46.13 we find that about 80% of them are novel genes. They are novel sequences.
00:36:52.21 There are no relatives of either other phages or anything else that has been sequenced in the database.
00:36:59.18 Even of the 20% of Phams that do match, so you know there is a related protein out there in the databases,
00:37:08.14 about half of those are for genes for which people don't know what they do anyway.
00:37:14.09 So database searching is an interesting exercise with these bacteriophage genes.
00:37:23.17 It provides rather little information as to what the functions of the genes are.
00:37:27.06 It is obviously very helpful when they do,
00:37:29.24 but the amazing thing is we just don't know what most of these genes do, and we would like to.
00:37:34.27 In this particular system we've made some headway
00:37:40.05 in developing a tool that can now help us address this question.
00:37:46.03 It is called BRED, or bacteriophage recombineering of electroporated DNA,
00:37:50.26 and it provides a simple, reproducible technique
00:37:56.19 for constructing mutants in mycobacteriophage genomes, either deletions, insertions, point mutations.
00:38:04.26 This method is published, and I won't go through its details here,
00:38:09.02 but it really just requires a simple electroporation step,
00:38:13.24 an ability to put phage DNA and a synthetic substrate together
00:38:18.04 inside a cell, and those techniques are well established
00:38:20.23 for doing so, followed by nothing more complicated than simply doing
00:38:27.27 a polymerase chain reaction or PCR screens
00:38:30.27 amongst a dozen or so of the progeny plaques that are recovered
00:38:37.04 in order to find those that have the mutation that you need.
00:38:40.11 And this is all accomplished through the establishment of a so-called mycobacterial recombineering system
00:38:47.01 that enables this to happen
00:38:49.12 at much higher frequency than you would normally see it.
00:38:52.20 And so this is very powerful because we can use that type of approach
00:38:57.05 to now go and ask what those genes do.
00:39:00.05 And indeed we can use it to try to develop applications for some of what we've found and what we are learning
00:39:07.23 that might be useful for the genetics of mycobacteria or specifically control of tuberculosis.
00:39:13.17 And I will give one brief example of that which is a couple of genes called Lysin A and Lysin B.
00:39:21.04 In this case I am again showing Giles as an example.
00:39:27.06 In part one we talked about an important step that happens at the conclusion of lytic growth,
00:39:34.01 and that is that in order for the phage particles that have been generated by infection to get out
00:39:41.06 then the cell wall needs to be compromised. It needs to be broken open. The cell needs to be lysed.
00:39:49.03 And the phage encodes the enzymes that enable that to happen.
00:39:52.09 We know very little about the process in mycobacteriophages,
00:39:58.16 but we were surprised in looking at the genomes that there are two candidate genes that are involved,
00:40:04.14 lysin A and lysin B, and we were able to use this engineering technique
00:40:12.14 to construct, to find mutations where we've removed either one of these genes
00:40:17.19 and examined what the behaviors of the phages were.
00:40:20.26 That way would enable us to figure out exactly what roles these genes are playing in lysis.
00:40:27.20 And I won't show you all the detailed experiments that gave us the conclusions as to what these do,
00:40:35.29 except I think the results are very clear.
00:40:39.19 And that is that in this portrayal of what the mycobacterial cell wall looks like,
00:40:44.13 where you have an inner membrane.
00:40:45.20 You have the peptidylglycan of the cell wall.
00:40:49.14 There is a sugar layer called arabinogalactan,
00:40:53.20 and covalently attached to this is the so called mycobacterial outer membrane,
00:40:59.14 which is composed of an interesting type of lipids called the mycolic acids.
00:41:04.23 And this is found in the mycobacteria and it is found in tuberculosis,
00:41:08.11 but most bacteria don't have this type of outer wall structure.
00:41:16.15 So not surprisingly we find that genomically the Lysin B enzyme
00:41:24.18 is found predominantly only encoded by mycobacteriophages.
00:41:28.28 And that was part of what clued us in to Lysin B playing a role in perhaps degrading this cell wall structure.
00:41:37.00 What we now know is that the Lysin A is the enzyme that degrades the peptidylglycan.
00:41:43.11 And Lysin B is this novel enzyme that actually cleaves the mycobacterial outer membrane
00:41:50.27 from this arabinogalactan layer and therefore facilitates complete lysis
00:41:56.28 of the cell during the process of release of the progeny viruses at the conclusion of the lytic cycle.
00:42:06.16 And we are obviously very interested in these enzymes
00:42:09.05 because they are enzymes that degrade the cell walls of mycobacteria.
00:42:14.27 And therefore we like the idea that these enzymes could perhaps
00:42:19.19 play potentially useful roles either in the lab to try to break open and to destroy mycobacteria.
00:42:26.29 And perhaps even in a clinical setting perhaps to either help to inactivate mycobacteria
00:42:34.05 or perhaps to act synergistically with antibiotics
00:42:37.14 to make them work better and quicker in killing
00:42:40.11 Mycobacterium tuberculosis in an infected patient.
00:42:44.21 So I gave you just one example there of how we can begin to identify what these genes do
00:42:51.21 and how some of them may be useful.
00:42:53.07 We have seen that mycobacteriophages are highly diverse.
00:42:56.14 They have these architecturally mosaic genomes,
00:42:59.17 and we can dissect this mosaicism not just by looking at the nucleotide sequence similarities,
00:43:05.10 but by comparing amino acid sequence similarities, a feature that is really greatly enhancing
00:43:12.26 and aided by the fact that we have now this large number of phages
00:43:19.01 and phage genomes that infect a common host.
00:43:22.06 And I think that that raises the idea that there is probably a lot to be learned
00:43:27.12 from generating similar collections of bacteriophages that infect other bacterial hosts.
00:43:34.26 And the larger these collections grow, the greater the insights
00:43:39.04 and the resolution of the information that we can gain
00:43:42.12 from how similar they are, how related to each other, and the specific mechanisms by which they have evolved.
00:43:48.14 80% of the genes are of unknown function, and we and others have our work cut out
00:43:56.01 to try to find out what these are, what they do,
00:43:59.17 what they look like structurally, and why they are there.
00:44:04.09 We are beginning to learn about how they got to be there in these genomes.
00:44:08.12 Now we need to know what they do.
00:44:10.04 I've told you that the techniques have now been established
00:44:14.02 that we can begin to readily manipulate these genomes.
00:44:17.00 Tools that one again could imagine applying to other bacterial hosts
00:44:22.14 and other viruses in order to address these questions.
00:44:26.10 And I think that we have now a very powerful tool box
00:44:30.14 in this large set of phages, in this large number of genomes,
00:44:36.07 that can be used to understand what makes
00:44:41.06 Mycobacterium tuberculosis, a major human pathogen, tick.
00:44:46.26 And how we can exploit and use those genes and those genomes
00:44:51.21 for contributing towards the diagnosis, the prevention, and cure of human TB.
00:45:00.10 I would like to finish by acknowledging those who have helped to support this research,
00:45:07.19 both the National Institutes of Health and the Howard Hughes Medical Institute.
00:45:11.23 All the work that I have talked about was performed by a truly stunning set of colleagues,
00:45:18.28 and I've listed many of their names there.
00:45:25.01 As I mentioned throughout that the genomic work has in part been done
00:45:30.26 by a large number of undergraduate students and high school students,
00:45:35.23 both in Pittsburgh and beyond. I don't have you all listed here,
00:45:39.13 but the contributions I think are really massive,
00:45:42.20 and I acknowledge that contribution, and thank you for that.
00:45:46.24 And so thank you for your attention to this iBioSeminars lecture.