Introduction to Protein Design and Protein Design Algorithms
Transcript of Part 1: Introduction to Protein Design
00:00:07.17 Hi. 00:00:08.19 I'm David Baker of the University of Washington, 00:00:11.14 and today I'm going to give you an introduction 00:00:13.09 to protein design. 00:00:16.00 Proteins function 00:00:18.17 by folding to unique native structures, 00:00:21.08 and some representative native structures 00:00:23.00 are shown on this slide. 00:00:25.24 Proteins are encoded in genes 00:00:28.12 in our genomes. 00:00:30.07 Each gene encodes one protein, 00:00:32.05 and the proteins up to these 00:00:34.05 unique native structures 00:00:36.03 in order to carry out their biological function. 00:00:40.04 Native structures of proteins 00:00:42.00 are likely the lowest energy states 00:00:44.17 for the protein sequence, 00:00:47.13 so for each amino acid sequence 00:00:50.17 of a protein 00:00:52.11 their corresponds an energy landscape, 00:00:54.25 of which I've shown a cartoon here, 00:00:57.10 and there are many different possible conformations 00:01:00.01 a protein can have. 00:01:01.29 The native state of a protein 00:01:03.13 is the lowest energy state, 00:01:05.04 what I've shown here. 00:01:08.28 There are two research problems 00:01:10.18 I'm going to describe today. 00:01:12.10 The first problem 00:01:14.03 is the problem of predicting protein structure. 00:01:16.28 In our genomes, 00:01:18.28 we have on the order of 30,000 different genes. 00:01:22.20 Each encodes a unique protein, 00:01:24.22 and each organism that exists on Earth 00:01:27.23 has a different genome 00:01:29.17 with a different complement of genes, 00:01:31.09 and hence proteins. 00:01:33.06 So, there's a general problem 00:01:35.03 of predicting what the structures and functions 00:01:37.12 of these proteins are. 00:01:39.04 So, the top arrow 00:01:42.13 shows going from an amino acid sequence 00:01:45.02 to a 3-dimensional structure. 00:01:48.06 So, in this case 00:01:50.01 we have a fixed amino acid sequence 00:01:52.08 and we have to find the lowest-energy structure. 00:01:55.05 The inverse problem 00:01:56.26 is the protein design problem, 00:01:58.20 which I'm going to focus on today. 00:02:00.10 In this case, 00:02:01.18 we don't start with a naturally occurring amino acid sequence 00:02:03.27 or a naturally occurring structure. 00:02:05.20 Rather, we start with a brand new structure 00:02:08.12 that we'd like to make 00:02:09.29 and we go backwards 00:02:11.12 to find an amino acid sequence 00:02:13.13 which will fold up to that structure. 00:02:16.23 Both of these problems, 00:02:18.14 the protein structure prediction problem 00:02:20.15 and the protein design problem, 00:02:21.23 are very hard problems, 00:02:23.16 and I'm going to tell you why in the next few slides. 00:02:26.18 The first reason they're hard 00:02:28.13 is that a polypeptide chain 00:02:30.25 can have a very large number of different possible conformations. 00:02:33.20 For each side chain in a... 00:02:36.20 for each amino acid in a protein chain, 00:02:39.19 there are many rotatable bonds, 00:02:41.24 as shown in this schematic, 00:02:43.28 so each side chain, 00:02:45.12 each amino acid can have 00:02:47.11 on the order of 3 different conformations. 00:02:50.24 So, if you have a 100 residue protein, 00:02:53.03 that means you have 3 conformations 00:02:55.07 for the first one, 00:02:56.18 3 for the second one, 00:02:58.05 and the number of possible conformations, total, 00:03:00.00 you get by multiplying together 00:03:01.23 all of these possibilities. 00:03:03.13 So, it's 3 times 3 times 3... 00:03:05.22 up to 100 times. 00:03:08.05 So, more generally, 00:03:09.27 if you have... 00:03:11.25 if Nres is the number of amino acids in the protein, 00:03:13.24 the number of different conformations 00:03:15.20 is 3 to that power, so 3^Nres. 00:03:18.11 And this is an astronomical number. 00:03:21.13 The second reason that these problems are hard, 00:03:24.26 in particular the design problem is hard, 00:03:26.21 is there's also an astronomical number of protein sequences. 00:03:29.09 So again, the first residues 00:03:30.27 can be any 1 of the 20 different amino acids. 00:03:33.05 The second position 00:03:34.20 can also be any 1 of the 20 amino acids, 00:03:37.29 so the number of possible sequences 00:03:39.22 is 20 times 20 times 20... 00:03:41.12 to the Nres power, 00:03:42.29 which is again a very, very large number. 00:03:45.24 The third reason that these are hard problems 00:03:48.06 is that we need to find the lowest energy structure 00:03:52.08 for a sequence, 00:03:53.23 for example, in the protein structure prediction problem. 00:03:56.00 It's hard because calculation energies 00:03:57.20 is difficult to do accurately 00:03:59.28 because proteins have many, many atoms 00:04:02.25 and they're surrounded by water molecules, 00:04:05.05 which also have many atoms. 00:04:07.03 Each water only has three atoms, 00:04:09.11 but there are many, many water molecules. 00:04:11.07 So, we need to energies accurately 00:04:13.22 for systems that have many 1000s of atoms. 00:04:17.22 And now what I'm going to do 00:04:19.10 is tell you about how we go about 00:04:21.23 solving these problems. 00:04:23.21 So, to search through the possible 00:04:26.22 conformations for a protein, 00:04:29.00 we try and mimic the actual folding process, 00:04:33.14 and here you see a movie 00:04:37.06 depicting the computer calculation 00:04:39.04 -- this is using the Rosetta methodology 00:04:41.08 which my group and others 00:04:43.06 have been developing for the last 15 years or so -- 00:04:46.06 we try and simulate the actual process of folding 00:04:48.27 so we can sample through 00:04:51.12 and find the lowest energy structures 00:04:53.04 much more quickly than we could 00:04:55.08 if we were sampling all possible configurations, 00:04:57.29 which is essentially impossible. 00:05:00.25 So, this calculation that you see here 00:05:04.19 takes not much longer 00:05:06.16 than it takes you to watch it to actually calculate, 00:05:09.03 to actually carry out on a computer. 00:05:11.24 The challenge is that 00:05:14.01 every folding calculation like this, 00:05:16.03 or nearly every one, 00:05:17.24 will end up in a different final structure, 00:05:19.20 so what we need to do is many, many of these 00:05:22.20 independent calculations 00:05:24.27 to build up a picture 00:05:27.01 of what that energy landscape looks like 00:05:29.02 and where the lowest energy structure is. 00:05:33.03 The second problem that I mentioned 00:05:35.11 -- searching through the space of sequences -- 00:05:37.21 we handle as shown in this animation. 00:05:42.29 Starting with a protein backbone 00:05:45.09 for which we want to find a very low-energy sequence, 00:05:48.16 we carry out a calculation 00:05:51.05 which at each step 00:05:53.01 we're randomly substituting in a different amino acid identity, 00:05:57.25 and different side chain conformation for that amino acid, 00:05:59.29 at a randomly selected position. 00:06:02.19 We can do these substitutions very rapidly, 00:06:05.11 we evaluate the energy, 00:06:07.14 and we accept the change 00:06:09.07 if the energy got lower. 00:06:11.02 So, in this way, 00:06:12.21 we can scan through a very large number of possible sequences 00:06:15.21 and quite rapidly 00:06:17.17 identify the lowest energy sequence 00:06:19.25 for a structure. 00:06:22.02 The third problem, 00:06:24.04 the necessity 00:06:25.28 to calculate energies accurately, 00:06:28.18 we solve in the following way. 00:06:30.10 We use a model in which 00:06:32.01 we try and capture 00:06:33.25 the detailed interactions between atoms 00:06:35.16 as accurately as we can, 00:06:38.16 so there are terms in the energy function 00:06:40.23 that favor close atomic packing, 00:06:43.13 but the atoms can't be overlapping, 00:06:46.07 they penalize the burial of polar atoms 00:06:48.21 that would like to interact with solvent, 00:06:51.29 they penalize the burial of such atoms 00:06:53.18 away from water, 00:06:55.14 they favor the formation of hydrogen bonding interactions 00:06:58.04 between polar atoms, 00:06:59.25 we model the electrostatic interactions, 00:07:02.05 the favorability of positive and negative charges 00:07:04.20 to be close together, 00:07:06.10 and we also model 00:07:08.00 the bending preferences 00:07:09.21 of the polypeptide chain. 00:07:12.15 So, given what I've told you, 00:07:14.29 the algorithms for searching 00:07:17.06 for the lowest-energy structure 00:07:19.02 for a given amino acid sequence, 00:07:21.03 that was in the movie where the protein structure 00:07:23.23 was moving around, 00:07:26.12 and the algorithm 00:07:27.29 for searching for the lowest-energy sequence 00:07:29.29 for a fixed structure, 00:07:31.16 there are again two problems 00:07:33.24 which we can approach. 00:07:35.11 The first problem is the structure prediction problem 00:07:37.20 where, again, we are going from genome sequences 00:07:41.01 to try to... 00:07:43.20 starting from those 00:07:45.18 and predicting the structures and functions 00:07:47.09 of the proteins that are encoded by those genes. 00:07:50.08 The second problem is the design problem, 00:07:53.03 where we start with something completely new 00:07:55.18 that we would like to make 00:07:57.24 and work backwards 00:07:59.28 to identify a sequence 00:08:01.28 which is predicted to fold up to that structure. 00:08:05.03 And, for the remainder of this talk, 00:08:07.24 I'm going to describe some examples 00:08:10.20 of the second type of calculation, 00:08:12.29 the design calculation. 00:08:16.15 First I want to give you an overview 00:08:18.16 of the different types of protein structures 00:08:20.09 found in nature. 00:08:22.25 There in the top left is a depiction of 00:08:26.04 a globular protein, 00:08:29.23 where the secondary structure elements, 00:08:31.14 the alpha-helices and the beta-sheets, 00:08:33.15 come together and form a roughly spherical protein 00:08:36.23 with hydrophobic residues buried in the interior, 00:08:40.09 and it's the burial of those hydrophobic residues 00:08:42.15 away from solvent 00:08:44.07 which stabilizes the protein. 00:08:46.04 On the right is a protein 00:08:49.01 that consists of long helices packed together 00:08:51.28 to make, for example 00:08:54.07 in the case of what's shown, 00:08:55.26 a channel protein. 00:08:58.00 In the lower left is a repeat protein 00:09:01.02 in which a very simple module 00:09:02.22 is repeated over and over and over again 00:09:04.19 to make a long filament. 00:09:07.17 And then finally, on the bottom right 00:09:10.06 is a small protein 00:09:12.14 which is held together with disulfide bonds, 00:09:14.19 which are shown in yellow. 00:09:16.28 And, nature accomplishes 00:09:19.06 all the great diversity of biological functions, 00:09:22.11 in our bodies and in all living things, 00:09:24.25 through different... 00:09:26.27 by utilizing these different types of proteins 00:09:28.26 in different circumstances 00:09:30.13 where each one is most appropriate. 00:09:32.05 So, what I'm going to describe now 00:09:34.18 is our efforts to design 00:09:37.00 ideal versions of these classes of proteins, 00:09:40.08 not a protein that exists in nature, 00:09:42.25 but sort of like the Platonic ideal 00:09:44.22 of a globular protein 00:09:46.08 or a repeat protein. 00:09:48.21 In contrast to what's been... 00:09:52.03 has come through evolution 00:09:54.01 has been the result of natural selection, 00:09:56.09 so random amino acids substitutions, then selection... 00:10:00.01 the process that... 00:10:01.28 and so what the result is... 00:10:03.16 the proteins you actually get have a lot of history in them 00:10:05.25 and they may have initially functioned in one way 00:10:08.18 and then they were coopted for something else, 00:10:10.26 so each protein has a lot of idiosyncrasies 00:10:13.03 because of its history. 00:10:14.04 What I'm going to now describe to you 00:10:15.20 is taking what we've learned about 00:10:18.05 these classes of proteins 00:10:19.19 and the algorithms I've described to make, 00:10:20.23 again, sort of idealized protein structures 00:10:23.02 which are free of those types of idiosyncrasies. 00:10:27.14 And, the way this works 00:10:29.15 is I've outlined how the calculations... 00:10:32.03 how we calculate a sequence 00:10:33.25 which is predicted to fold up to a given structure, 00:10:37.08 but that's just the first step. 00:10:39.00 The next step is, 00:10:40.19 since we've designed the protein, 00:10:42.13 we know what its amino acid sequence is 00:10:44.11 because we came up with that amino acid sequence... 00:10:47.00 from the amino acid sequence 00:10:48.17 we can work back to the DNA sequence, 00:10:51.07 that's using the genetic code 00:10:53.01 which was worked out in the 1960s... 00:10:55.18 once we know the DNA sequence 00:10:57.17 we can write down... 00:11:00.17 we can essentially buy, 00:11:03.04 or make very easily in the lab, 00:11:05.04 a synthetic piece of DNA 00:11:07.10 that encodes this protein. 00:11:09.08 So, the protein we've designed on the computer 00:11:10.11 will have never existed in nature, 00:11:12.12 it's something completely new, 00:11:14.29 and the real miracle of this 00:11:16.26 is that it's so easy to manufacture DNA these days 00:11:19.20 that we can, for any crazy protein we design on the computer, 00:11:23.17 we can very, very easily 00:11:26.12 make a gene that encodes that protein 00:11:28.27 and once we have that gene 00:11:30.20 we can make the protein in the laboratory 00:11:33.13 by putting the gene into bacteria, 00:11:35.25 growing up the bacteria, 00:11:37.20 we can extract the protein out, 00:11:39.13 and then we can determine 00:11:41.01 whether that protein folds up to the structure 00:11:43.15 that we designed, 00:11:45.15 and we can also measure other properties of the protein. 00:11:49.00 So, what I'm going to tell you about 00:11:50.23 are several design calculations. 00:11:53.09 We set out to make a brand new protein 00:11:54.26 that was an idealized version 00:11:56.16 of what exists in nature. 00:11:58.21 We carried out the design calculation, 00:12:00.26 we designed a gene encoding the designed protein, 00:12:03.20 we put it into bacteria, 00:12:05.00 purified the protein, 00:12:06.16 and then solved the structure. 00:12:07.28 So, I'm going to be showing you the designed models 00:12:10.01 and then the crystal structures 00:12:11.26 of those designs 00:12:13.16 that we determined experimentally. 00:12:16.18 So, the first example 00:12:18.16 is of the class of globular proteins, 00:12:20.27 which are composed of regular secondary structure elements 00:12:23.11 surrounding a hydrophobic core. 00:12:27.23 After we do the design calculation, 00:12:30.00 where we come up with a sequence 00:12:31.12 that's predicted to adopt the structure, 00:12:34.19 and the two structures I'm talking about here 00:12:36.24 are the ones that are shown 00:12:38.19 under the design column on this slide, 00:12:40.25 again they're idealized so all the helices are perfect helices, 00:12:43.13 the strands are perfect strands, 00:12:45.09 and the loops are very regular, 00:12:47.25 there's one more step. 00:12:49.24 We take advantage of the protein structure prediction calculation 00:12:52.11 I described. 00:12:53.29 So, we take those sequences 00:12:55.20 and we send them out to volunteers 00:12:57.07 all around the world 00:12:58.23 who participate in a project called Rosetta@home, 00:13:00.19 and these volunteers 00:13:02.11 predict what the structure is 00:13:05.21 of that sequence; 00:13:07.00 they search for the lowest-energy state 00:13:08.08 of that sequence. 00:13:09.26 And, in the plots on the left, 00:13:11.20 you see many, many red dots. 00:13:13.22 Each red dot is the result 00:13:15.07 of a different Rosetta@home volunteer. 00:13:18.00 On the y-axis is the energy 00:13:19.23 that's calculated by the Rosetta program 00:13:22.12 that's running on their computer, 00:13:24.05 and on the x-axis 00:13:26.02 is how far away that low-energy structure they found 00:13:29.24 was from the structure we're trying to make, 00:13:32.00 the one that's in the design column. 00:13:34.02 And, you can see, first of all, 00:13:35.13 how big and complicated the space is 00:13:37.04 by the fact that 00:13:39.04 many of these lowest-energy structures that are found 00:13:41.07 are very far away from the structure 00:13:44.10 that we're targeting. 00:13:45.19 So, the x-axis is root-mean-squared deviation 00:13:47.24 in the atomic coordinates. 00:13:50.09 So, these structures on the right of these plots 00:13:53.22 are 10 Ångstroms... each atom is on average 10 Ångstroms away 00:13:56.26 from where it was supposed to be 00:13:58.13 in the designed model. 00:14:00.18 So, you can see that different people land 00:14:02.18 in different local minima on the landscape, 00:14:04.18 so different ones of those bumps 00:14:06.09 or those wells 00:14:07.28 that I showed in that schematic near the beginning. 00:14:10.01 But, what you can see is true for both of these sequences 00:14:12.18 is that the lower the energy, 00:14:14.10 that's again on the y-axis... 00:14:16.03 the lower the energy 00:14:18.02 the more the structure tends toward 00:14:20.28 the designed model, 00:14:22.13 and so there's almost a funnel shape 00:14:23.25 to these plots where, 00:14:25.22 as you go to lower and lower RMSD, going left, 00:14:28.23 the energy gets lower and lower. 00:14:30.17 So, the lowest-energy structures 00:14:32.08 found by our Rosetta@home volunteers, 00:14:36.05 who really play a critical role in our research, 00:14:38.18 the lowest-energy structures 00:14:40.11 are almost identical to the designed model. 00:14:42.09 When we see this property, 00:14:43.29 which is the one that we are looking for, 00:14:46.05 we then manufacture a gene, 00:14:48.15 a synthetic piece of DNA that encodes the design, 00:14:51.00 we make it in the lab, 00:14:52.22 and then we solve the structure, 00:14:54.09 in this case by nuclear magnetic resonance, 00:14:56.06 with colleagues 00:14:59.00 in the NESG Structural Genomic consortium. 00:15:02.09 And, on the right 00:15:04.08 you the see the column marked NMR 00:15:06.12 shows the experimentally determined structure, 00:15:08.23 and you can see it's very similar 00:15:10.07 to the designed models 00:15:12.04 in the second column. 00:15:13.21 And, then on the far right are superpositions... 00:15:17.25 blow-up superpositions 00:15:19.28 of the designed model and the experimental structure, 00:15:21.24 and they show that the side chains in these designs are, 00:15:24.20 in actuality, 00:15:28.01 where we designed them to be. 00:15:30.09 So, we've been able to make such structures 00:15:33.19 almost pretty routinely now, 00:15:35.07 so we can make brand new globular protein structures like this 00:15:38.23 quite effectively. 00:15:40.04 In fact, a new student coming to my laboratory 00:15:42.02 typically is assigned the project 00:15:43.26 of making up a brand new protein structure 00:15:45.29 and proving that the design... 00:15:47.20 designing it and then 00:15:49.28 characterizing the design in the laboratory. 00:15:53.12 Now, we can get to larger structures in this way... 00:15:58.25 we can make this Platonic ideals of globular proteins 00:16:01.29 and we can put them together 00:16:04.12 to make larger and more complex structures. 00:16:06.27 So, this shows an example of taking two of the... 00:16:09.26 two idealized building blocks 00:16:11.16 we've solved the structure of, fusing them together, 00:16:14.04 and in the lower panel on the left 00:16:15.23 is the designed model 00:16:17.20 and the right is the crystal structure. 00:16:19.09 So again, this is a completely made up protein, 00:16:21.18 but when we solve its structure experimentally 00:16:23.20 it comes out exactly as we designed it. 00:16:28.14 Now, the second class of proteins I described 00:16:31.03 are not globular, they're not spherical, 00:16:33.16 they can be long and elongated, 00:16:35.09 and this is actually a protein that's very close to my heart 00:16:37.14 because I designed it myself. 00:16:39.09 This protein... 00:16:40.25 a schematic of it is shown on the top right. 00:16:42.28 This is composed of 80 residue helices, 00:16:45.18 and I made it taking advantage 00:16:47.14 of the equations that Francis Crick worked out 00:16:50.24 whereby a backbone structure can be described 00:16:53.27 by a small number of parameters, 00:16:55.29 and I can make many, many different such structures 00:16:58.28 by sampling through different possibilities for these parameters. 00:17:01.26 I do that 00:17:03.21 and then I design each possibility 00:17:05.17 and choose the lowest-energy structures. 00:17:08.01 When this protein is manufactured in the lab... 00:17:12.06 when it was manufactured... 00:17:14.11 I did some initial tests 00:17:16.02 and found it was very stable, 00:17:17.29 and then Joe Rogers, a graduate student in England, 00:17:20.12 was asking me for a protein to do experiments on 00:17:23.14 so I sent him this protein 00:17:25.18 and he sent back this result, which is really quite remarkable. 00:17:30.03 In order to unfold this protein, 00:17:33.18 you have to add extremely high amounts 00:17:35.18 of a chemical denaturant called guanidine, 00:17:37.21 that's on this plot on the left, 00:17:40.17 and the unfolding... 00:17:42.27 you can see that on these lines... 00:17:46.07 as you add more guanidine are pretty flat, 00:17:48.10 and then at very high concentrations, over 7 molar, 00:17:50.22 the protein starts to unfold, 00:17:52.15 but only really does this at very high temperature. 00:17:54.26 So, this is something that's simply not seen 00:17:56.22 for naturally occuring proteins. 00:17:58.15 These designed proteins can be more ideal, 00:18:00.07 so much more stable. 00:18:01.26 And, when the crystal structure was solved of this protein, 00:18:03.25 it was found to be nearly identical 00:18:05.19 to the designed model. 00:18:07.01 So, we can make this class of proteins also. 00:18:10.13 I mentioned repeat proteins, 00:18:12.15 that was a third class, 00:18:14.16 and we've also been able to make 00:18:16.29 idealized versions of these types of proteins. 00:18:19.22 So, on the second column here, 00:18:23.09 you see a repeated protein 00:18:25.22 that goes on indefinitely, 00:18:27.23 and on the left is 00:18:30.02 a comparison of the designed model in red 00:18:33.11 to the crystal structure in grey. 00:18:35.07 You can see they're nearly identical. 00:18:37.28 And, on the right you see another example 00:18:40.08 of an infinitely extending repeat protein 00:18:42.27 where we've made one subsegment of it in the lab, 00:18:46.10 and you again see that the crystal structure 00:18:48.29 is nearly identical to the designed model. 00:18:51.27 So, we're very excited about these 00:18:53.20 as the basis for new types of new nanomaterial. 00:18:56.07 We can make rods, 00:18:58.05 straight rods and curved rods, 00:18:59.29 and start building things out of them. 00:19:04.04 And the final class of proteins, 00:19:06.11 those small disulfide-bonded proteins, 00:19:08.10 are very interesting because they could form the basis 00:19:10.24 of new types of therapeutics 00:19:12.14 because they're very small and easy to make. 00:19:15.13 And, here this shows examples of... 00:19:18.28 this is work by Vikram Mulligan, a postdoc in the lab, 00:19:22.00 where he's designed 00:19:23.20 very short peptides 00:19:25.16 that are predicted to fold up to unique structures, 00:19:28.28 and there are three examples in the top row of this slide 00:19:31.25 of designs he made, 00:19:33.25 then below that are NMR structures of these peptides 00:19:36.04 when they're actually made in the lab. 00:19:38.15 And again, these peptides 00:19:40.19 come out with very, very similar structures 00:19:42.28 to the designed models. 00:19:45.08 So, what I hope I've shown you today 00:19:47.07 is I've given you... 00:19:50.01 explained something about how... 00:19:53.11 about the protein structure prediction problem 00:19:56.02 and the protein design problem. 00:19:57.18 I've told you how we go about 00:19:59.13 approaching these problems, 00:20:01.01 and then I've shown you that we can start to design 00:20:03.08 sort of idealized versions 00:20:05.05 of the different classes of proteins 00:20:07.02 that are found in nature, 00:20:08.29 and these proteins are likely... 00:20:12.11 will be the basis for designing a whole new world 00:20:15.18 of functional proteins to solve modern day problems, 00:20:20.05 and I'll talk about that in another iBio seminar. 00:20:24.04 And, I want to acknowledge 00:20:25.26 the fantastic people 00:20:27.24 who have actually done most of this work. 00:20:30.00 So, Robu and Rie Koga 00:20:32.20 developed these rules for making idealized protein structures, 00:20:36.01 and I showed you... 00:20:38.07 took you through the design of two of their structures. 00:20:40.21 Vikram Mulligan, I mentioned, 00:20:42.08 did the designed cyclic peptide work. 00:20:44.06 TJ Brunette, 00:20:46.23 Possu Huang, 00:20:48.18 and Fabio did the work on the repeat proteins. 00:20:53.03 And thank you for your attention.