Introduction to Protein Design and Protein Design Algorithms

Transcript of Part 1: Introduction to Protein Design

00:00:07.17	Hi.
00:00:08.19	I'm David Baker of the University of Washington,
00:00:11.14	and today I'm going to give you an introduction
00:00:13.09	to protein design.
00:00:16.00	Proteins function
00:00:18.17	by folding to unique native structures,
00:00:21.08	and some representative native structures
00:00:23.00	are shown on this slide.
00:00:25.24	Proteins are encoded in genes
00:00:28.12	in our genomes.
00:00:30.07	Each gene encodes one protein,
00:00:32.05	and the proteins up to these
00:00:34.05	unique native structures
00:00:36.03	in order to carry out their biological function.
00:00:40.04	Native structures of proteins
00:00:42.00	are likely the lowest energy states
00:00:44.17	for the protein sequence,
00:00:47.13	so for each amino acid sequence
00:00:50.17	of a protein
00:00:52.11	their corresponds an energy landscape,
00:00:54.25	of which I've shown a cartoon here,
00:00:57.10	and there are many different possible conformations
00:01:00.01	a protein can have.
00:01:01.29	The native state of a protein
00:01:03.13	is the lowest energy state,
00:01:05.04	what I've shown here.
00:01:08.28	There are two research problems
00:01:10.18	I'm going to describe today.
00:01:12.10	The first problem
00:01:14.03	is the problem of predicting protein structure.
00:01:16.28	In our genomes,
00:01:18.28	we have on the order of 30,000 different genes.
00:01:22.20	Each encodes a unique protein,
00:01:24.22	and each organism that exists on Earth
00:01:27.23	has a different genome
00:01:29.17	with a different complement of genes,
00:01:31.09	and hence proteins.
00:01:33.06	So, there's a general problem
00:01:35.03	of predicting what the structures and functions
00:01:37.12	of these proteins are.
00:01:39.04	So, the top arrow
00:01:42.13	shows going from an amino acid sequence
00:01:45.02	to a 3-dimensional structure.
00:01:48.06	So, in this case
00:01:50.01	we have a fixed amino acid sequence
00:01:52.08	and we have to find the lowest-energy structure.
00:01:55.05	The inverse problem
00:01:56.26	is the protein design problem,
00:01:58.20	which I'm going to focus on today.
00:02:00.10	In this case,
00:02:01.18	we don't start with a naturally occurring amino acid sequence
00:02:03.27	or a naturally occurring structure.
00:02:05.20	Rather, we start with a brand new structure
00:02:08.12	that we'd like to make
00:02:09.29	and we go backwards
00:02:11.12	to find an amino acid sequence
00:02:13.13	which will fold up to that structure.
00:02:16.23	Both of these problems,
00:02:18.14	the protein structure prediction problem
00:02:20.15	and the protein design problem,
00:02:21.23	are very hard problems,
00:02:23.16	and I'm going to tell you why in the next few slides.
00:02:26.18	The first reason they're hard
00:02:28.13	is that a polypeptide chain
00:02:30.25	can have a very large number of different possible conformations.
00:02:33.20	For each side chain in a...
00:02:36.20	for each amino acid in a protein chain,
00:02:39.19	there are many rotatable bonds,
00:02:41.24	as shown in this schematic,
00:02:43.28	so each side chain,
00:02:45.12	each amino acid can have
00:02:47.11	on the order of 3 different conformations.
00:02:50.24	So, if you have a 100 residue protein,
00:02:53.03	that means you have 3 conformations
00:02:55.07	for the first one,
00:02:56.18	3 for the second one,
00:02:58.05	and the number of possible conformations, total,
00:03:00.00	you get by multiplying together
00:03:01.23	all of these possibilities.
00:03:03.13	So, it's 3 times 3 times 3...
00:03:05.22	up to 100 times.
00:03:08.05	So, more generally,
00:03:09.27	if you have...
00:03:11.25	if Nres is the number of amino acids in the protein,
00:03:13.24	the number of different conformations
00:03:15.20	is 3 to that power, so 3^Nres.
00:03:18.11	And this is an astronomical number.
00:03:21.13	The second reason that these problems are hard,
00:03:24.26	in particular the design problem is hard,
00:03:26.21	is there's also an astronomical number of protein sequences.
00:03:29.09	So again, the first residues
00:03:30.27	can be any 1 of the 20 different amino acids.
00:03:33.05	The second position
00:03:34.20	can also be any 1 of the 20 amino acids,
00:03:37.29	so the number of possible sequences
00:03:39.22	is 20 times 20 times 20...
00:03:41.12	to the Nres power,
00:03:42.29	which is again a very, very large number.
00:03:45.24	The third reason that these are hard problems
00:03:48.06	is that we need to find the lowest energy structure
00:03:52.08	for a sequence,
00:03:53.23	for example, in the protein structure prediction problem.
00:03:56.00	It's hard because calculation energies
00:03:57.20	is difficult to do accurately
00:03:59.28	because proteins have many, many atoms
00:04:02.25	and they're surrounded by water molecules,
00:04:05.05	which also have many atoms.
00:04:07.03	Each water only has three atoms,
00:04:09.11	but there are many, many water molecules.
00:04:11.07	So, we need to energies accurately
00:04:13.22	for systems that have many 1000s of atoms.
00:04:17.22	And now what I'm going to do
00:04:19.10	is tell you about how we go about
00:04:21.23	solving these problems.
00:04:23.21	So, to search through the possible
00:04:26.22	conformations for a protein,
00:04:29.00	we try and mimic the actual folding process,
00:04:33.14	and here you see a movie
00:04:37.06	depicting the computer calculation
00:04:39.04	-- this is using the Rosetta methodology
00:04:41.08	which my group and others
00:04:43.06	have been developing for the last 15 years or so --
00:04:46.06	we try and simulate the actual process of folding
00:04:48.27	so we can sample through
00:04:51.12	and find the lowest energy structures
00:04:53.04	much more quickly than we could
00:04:55.08	if we were sampling all possible configurations,
00:04:57.29	which is essentially impossible.
00:05:00.25	So, this calculation that you see here
00:05:04.19	takes not much longer
00:05:06.16	than it takes you to watch it to actually calculate,
00:05:09.03	to actually carry out on a computer.
00:05:11.24	The challenge is that
00:05:14.01	every folding calculation like this,
00:05:16.03	or nearly every one,
00:05:17.24	will end up in a different final structure,
00:05:19.20	so what we need to do is many, many of these
00:05:22.20	independent calculations
00:05:24.27	to build up a picture
00:05:27.01	of what that energy landscape looks like
00:05:29.02	and where the lowest energy structure is.
00:05:33.03	The second problem that I mentioned
00:05:35.11	-- searching through the space of sequences --
00:05:37.21	we handle as shown in this animation.
00:05:42.29	Starting with a protein backbone
00:05:45.09	for which we want to find a very low-energy sequence,
00:05:48.16	we carry out a calculation
00:05:51.05	which at each step
00:05:53.01	we're randomly substituting in a different amino acid identity,
00:05:57.25	and different side chain conformation for that amino acid,
00:05:59.29	at a randomly selected position.
00:06:02.19	We can do these substitutions very rapidly,
00:06:05.11	we evaluate the energy,
00:06:07.14	and we accept the change
00:06:09.07	if the energy got lower.
00:06:11.02	So, in this way,
00:06:12.21	we can scan through a very large number of possible sequences
00:06:15.21	and quite rapidly
00:06:17.17	identify the lowest energy sequence
00:06:19.25	for a structure.
00:06:22.02	The third problem,
00:06:24.04	the necessity
00:06:25.28	to calculate energies accurately,
00:06:28.18	we solve in the following way.
00:06:30.10	We use a model in which
00:06:32.01	we try and capture
00:06:33.25	the detailed interactions between atoms
00:06:35.16	as accurately as we can,
00:06:38.16	so there are terms in the energy function
00:06:40.23	that favor close atomic packing,
00:06:43.13	but the atoms can't be overlapping,
00:06:46.07	they penalize the burial of polar atoms
00:06:48.21	that would like to interact with solvent,
00:06:51.29	they penalize the burial of such atoms
00:06:53.18	away from water,
00:06:55.14	they favor the formation of hydrogen bonding interactions
00:06:58.04	between polar atoms,
00:06:59.25	we model the electrostatic interactions,
00:07:02.05	the favorability of positive and negative charges
00:07:04.20	to be close together,
00:07:06.10	and we also model
00:07:08.00	the bending preferences
00:07:09.21	of the polypeptide chain.
00:07:12.15	So, given what I've told you,
00:07:14.29	the algorithms for searching
00:07:17.06	for the lowest-energy structure
00:07:19.02	for a given amino acid sequence,
00:07:21.03	that was in the movie where the protein structure
00:07:23.23	was moving around,
00:07:26.12	and the algorithm
00:07:27.29	for searching for the lowest-energy sequence
00:07:29.29	for a fixed structure,
00:07:31.16	there are again two problems
00:07:33.24	which we can approach.
00:07:35.11	The first problem is the structure prediction problem
00:07:37.20	where, again, we are going from genome sequences
00:07:41.01	to try to...
00:07:43.20	starting from those
00:07:45.18	and predicting the structures and functions
00:07:47.09	of the proteins that are encoded by those genes.
00:07:50.08	The second problem is the design problem,
00:07:53.03	where we start with something completely new
00:07:55.18	that we would like to make
00:07:57.24	and work backwards
00:07:59.28	to identify a sequence
00:08:01.28	which is predicted to fold up to that structure.
00:08:05.03	And, for the remainder of this talk,
00:08:07.24	I'm going to describe some examples
00:08:10.20	of the second type of calculation,
00:08:12.29	the design calculation.
00:08:16.15	First I want to give you an overview
00:08:18.16	of the different types of protein structures
00:08:20.09	found in nature.
00:08:22.25	There in the top left is a depiction of
00:08:26.04	a globular protein,
00:08:29.23	where the secondary structure elements,
00:08:31.14	the alpha-helices and the beta-sheets,
00:08:33.15	come together and form a roughly spherical protein
00:08:36.23	with hydrophobic residues buried in the interior,
00:08:40.09	and it's the burial of those hydrophobic residues
00:08:42.15	away from solvent
00:08:44.07	which stabilizes the protein.
00:08:46.04	On the right is a protein
00:08:49.01	that consists of long helices packed together
00:08:51.28	to make, for example
00:08:54.07	in the case of what's shown,
00:08:55.26	a channel protein.
00:08:58.00	In the lower left is a repeat protein
00:09:01.02	in which a very simple module
00:09:02.22	is repeated over and over and over again
00:09:04.19	to make a long filament.
00:09:07.17	And then finally, on the bottom right
00:09:10.06	is a small protein
00:09:12.14	which is held together with disulfide bonds,
00:09:14.19	which are shown in yellow.
00:09:16.28	And, nature accomplishes
00:09:19.06	all the great diversity of biological functions,
00:09:22.11	in our bodies and in all living things,
00:09:24.25	through different...
00:09:26.27	by utilizing these different types of proteins
00:09:28.26	in different circumstances
00:09:30.13	where each one is most appropriate.
00:09:32.05	So, what I'm going to describe now
00:09:34.18	is our efforts to design
00:09:37.00	ideal versions of these classes of proteins,
00:09:40.08	not a protein that exists in nature,
00:09:42.25	but sort of like the Platonic ideal
00:09:44.22	of a globular protein
00:09:46.08	or a repeat protein.
00:09:48.21	In contrast to what's been...
00:09:52.03	has come through evolution
00:09:54.01	has been the result of natural selection,
00:09:56.09	so random amino acids substitutions, then selection...
00:10:00.01	the process that...
00:10:01.28	and so what the result is...
00:10:03.16	the proteins you actually get have a lot of history in them
00:10:05.25	and they may have initially functioned in one way
00:10:08.18	and then they were coopted for something else,
00:10:10.26	so each protein has a lot of idiosyncrasies
00:10:13.03	because of its history.
00:10:14.04	What I'm going to now describe to you
00:10:15.20	is taking what we've learned about
00:10:18.05	these classes of proteins
00:10:19.19	and the algorithms I've described to make,
00:10:20.23	again, sort of idealized protein structures
00:10:23.02	which are free of those types of idiosyncrasies.
00:10:27.14	And, the way this works
00:10:29.15	is I've outlined how the calculations...
00:10:32.03	how we calculate a sequence
00:10:33.25	which is predicted to fold up to a given structure,
00:10:37.08	but that's just the first step.
00:10:39.00	The next step is,
00:10:40.19	since we've designed the protein,
00:10:42.13	we know what its amino acid sequence is
00:10:44.11	because we came up with that amino acid sequence...
00:10:47.00	from the amino acid sequence
00:10:48.17	we can work back to the DNA sequence,
00:10:51.07	that's using the genetic code
00:10:53.01	which was worked out in the 1960s...
00:10:55.18	once we know the DNA sequence
00:10:57.17	we can write down...
00:11:00.17	we can essentially buy,
00:11:03.04	or make very easily in the lab,
00:11:05.04	a synthetic piece of DNA
00:11:07.10	that encodes this protein.
00:11:09.08	So, the protein we've designed on the computer
00:11:10.11	will have never existed in nature,
00:11:12.12	it's something completely new,
00:11:14.29	and the real miracle of this
00:11:16.26	is that it's so easy to manufacture DNA these days
00:11:19.20	that we can, for any crazy protein we design on the computer,
00:11:23.17	we can very, very easily
00:11:26.12	make a gene that encodes that protein
00:11:28.27	and once we have that gene
00:11:30.20	we can make the protein in the laboratory
00:11:33.13	by putting the gene into bacteria,
00:11:35.25	growing up the bacteria,
00:11:37.20	we can extract the protein out,
00:11:39.13	and then we can determine
00:11:41.01	whether that protein folds up to the structure
00:11:43.15	that we designed,
00:11:45.15	and we can also measure other properties of the protein.
00:11:49.00	So, what I'm going to tell you about
00:11:50.23	are several design calculations.
00:11:53.09	We set out to make a brand new protein
00:11:54.26	that was an idealized version
00:11:56.16	of what exists in nature.
00:11:58.21	We carried out the design calculation,
00:12:00.26	we designed a gene encoding the designed protein,
00:12:03.20	we put it into bacteria,
00:12:05.00	purified the protein,
00:12:06.16	and then solved the structure.
00:12:07.28	So, I'm going to be showing you the designed models
00:12:10.01	and then the crystal structures
00:12:11.26	of those designs
00:12:13.16	that we determined experimentally.
00:12:16.18	So, the first example
00:12:18.16	is of the class of globular proteins,
00:12:20.27	which are composed of regular secondary structure elements
00:12:23.11	surrounding a hydrophobic core.
00:12:27.23	After we do the design calculation,
00:12:30.00	where we come up with a sequence
00:12:31.12	that's predicted to adopt the structure,
00:12:34.19	and the two structures I'm talking about here
00:12:36.24	are the ones that are shown
00:12:38.19	under the design column on this slide,
00:12:40.25	again they're idealized so all the helices are perfect helices,
00:12:43.13	the strands are perfect strands,
00:12:45.09	and the loops are very regular,
00:12:47.25	there's one more step.
00:12:49.24	We take advantage of the protein structure prediction calculation
00:12:52.11	I described.
00:12:53.29	So, we take those sequences
00:12:55.20	and we send them out to volunteers
00:12:57.07	all around the world
00:12:58.23	who participate in a project called Rosetta@home,
00:13:00.19	and these volunteers
00:13:02.11	predict what the structure is
00:13:05.21	of that sequence;
00:13:07.00	they search for the lowest-energy state
00:13:08.08	of that sequence.
00:13:09.26	And, in the plots on the left,
00:13:11.20	you see many, many red dots.
00:13:13.22	Each red dot is the result
00:13:15.07	of a different Rosetta@home volunteer.
00:13:18.00	On the y-axis is the energy
00:13:19.23	that's calculated by the Rosetta program
00:13:22.12	that's running on their computer,
00:13:24.05	and on the x-axis
00:13:26.02	is how far away that low-energy structure they found
00:13:29.24	was from the structure we're trying to make,
00:13:32.00	the one that's in the design column.
00:13:34.02	And, you can see, first of all,
00:13:35.13	how big and complicated the space is
00:13:37.04	by the fact that
00:13:39.04	many of these lowest-energy structures that are found
00:13:41.07	are very far away from the structure
00:13:44.10	that we're targeting.
00:13:45.19	So, the x-axis is root-mean-squared deviation
00:13:47.24	in the atomic coordinates.
00:13:50.09	So, these structures on the right of these plots
00:13:53.22	are 10 Ångstroms... each atom is on average 10 Ångstroms away
00:13:56.26	from where it was supposed to be
00:13:58.13	in the designed model.
00:14:00.18	So, you can see that different people land
00:14:02.18	in different local minima on the landscape,
00:14:04.18	so different ones of those bumps
00:14:06.09	or those wells
00:14:07.28	that I showed in that schematic near the beginning.
00:14:10.01	But, what you can see is true for both of these sequences
00:14:12.18	is that the lower the energy,
00:14:14.10	that's again on the y-axis...
00:14:16.03	the lower the energy
00:14:18.02	the more the structure tends toward
00:14:20.28	the designed model,
00:14:22.13	and so there's almost a funnel shape
00:14:23.25	to these plots where,
00:14:25.22	as you go to lower and lower RMSD, going left,
00:14:28.23	the energy gets lower and lower.
00:14:30.17	So, the lowest-energy structures
00:14:32.08	found by our Rosetta@home volunteers,
00:14:36.05	who really play a critical role in our research,
00:14:38.18	the lowest-energy structures
00:14:40.11	are almost identical to the designed model.
00:14:42.09	When we see this property,
00:14:43.29	which is the one that we are looking for,
00:14:46.05	we then manufacture a gene,
00:14:48.15	a synthetic piece of DNA that encodes the design,
00:14:51.00	we make it in the lab,
00:14:52.22	and then we solve the structure,
00:14:54.09	in this case by nuclear magnetic resonance,
00:14:56.06	with colleagues
00:14:59.00	in the NESG Structural Genomic consortium.
00:15:02.09	And, on the right
00:15:04.08	you the see the column marked NMR
00:15:06.12	shows the experimentally determined structure,
00:15:08.23	and you can see it's very similar
00:15:10.07	to the designed models
00:15:12.04	in the second column.
00:15:13.21	And, then on the far right are superpositions...
00:15:17.25	blow-up superpositions
00:15:19.28	of the designed model and the experimental structure,
00:15:21.24	and they show that the side chains in these designs are,
00:15:24.20	in actuality,
00:15:28.01	where we designed them to be.
00:15:30.09	So, we've been able to make such structures
00:15:33.19	almost pretty routinely now,
00:15:35.07	so we can make brand new globular protein structures like this
00:15:38.23	quite effectively.
00:15:40.04	In fact, a new student coming to my laboratory
00:15:42.02	typically is assigned the project
00:15:43.26	of making up a brand new protein structure
00:15:45.29	and proving that the design...
00:15:47.20	designing it and then
00:15:49.28	characterizing the design in the laboratory.
00:15:53.12	Now, we can get to larger structures in this way...
00:15:58.25	we can make this Platonic ideals of globular proteins
00:16:01.29	and we can put them together
00:16:04.12	to make larger and more complex structures.
00:16:06.27	So, this shows an example of taking two of the...
00:16:09.26	two idealized building blocks
00:16:11.16	we've solved the structure of, fusing them together,
00:16:14.04	and in the lower panel on the left
00:16:15.23	is the designed model
00:16:17.20	and the right is the crystal structure.
00:16:19.09	So again, this is a completely made up protein,
00:16:21.18	but when we solve its structure experimentally
00:16:23.20	it comes out exactly as we designed it.
00:16:28.14	Now, the second class of proteins I described
00:16:31.03	are not globular, they're not spherical,
00:16:33.16	they can be long and elongated,
00:16:35.09	and this is actually a protein that's very close to my heart
00:16:37.14	because I designed it myself.
00:16:39.09	This protein...
00:16:40.25	a schematic of it is shown on the top right.
00:16:42.28	This is composed of 80 residue helices,
00:16:45.18	and I made it taking advantage
00:16:47.14	of the equations that Francis Crick worked out
00:16:50.24	whereby a backbone structure can be described
00:16:53.27	by a small number of parameters,
00:16:55.29	and I can make many, many different such structures
00:16:58.28	by sampling through different possibilities for these parameters.
00:17:01.26	I do that
00:17:03.21	and then I design each possibility
00:17:05.17	and choose the lowest-energy structures.
00:17:08.01	When this protein is manufactured in the lab...
00:17:12.06	when it was manufactured...
00:17:14.11	I did some initial tests
00:17:16.02	and found it was very stable,
00:17:17.29	and then Joe Rogers, a graduate student in England,
00:17:20.12	was asking me for a protein to do experiments on
00:17:23.14	so I sent him this protein
00:17:25.18	and he sent back this result, which is really quite remarkable.
00:17:30.03	In order to unfold this protein,
00:17:33.18	you have to add extremely high amounts
00:17:35.18	of a chemical denaturant called guanidine,
00:17:37.21	that's on this plot on the left,
00:17:40.17	and the unfolding...
00:17:42.27	you can see that on these lines...
00:17:46.07	as you add more guanidine are pretty flat,
00:17:48.10	and then at very high concentrations, over 7 molar,
00:17:50.22	the protein starts to unfold,
00:17:52.15	but only really does this at very high temperature.
00:17:54.26	So, this is something that's simply not seen
00:17:56.22	for naturally occuring proteins.
00:17:58.15	These designed proteins can be more ideal,
00:18:00.07	so much more stable.
00:18:01.26	And, when the crystal structure was solved of this protein,
00:18:03.25	it was found to be nearly identical
00:18:05.19	to the designed model.
00:18:07.01	So, we can make this class of proteins also.
00:18:10.13	I mentioned repeat proteins,
00:18:12.15	that was a third class,
00:18:14.16	and we've also been able to make
00:18:16.29	idealized versions of these types of proteins.
00:18:19.22	So, on the second column here,
00:18:23.09	you see a repeated protein
00:18:25.22	that goes on indefinitely,
00:18:27.23	and on the left is
00:18:30.02	a comparison of the designed model in red
00:18:33.11	to the crystal structure in grey.
00:18:35.07	You can see they're nearly identical.
00:18:37.28	And, on the right you see another example
00:18:40.08	of an infinitely extending repeat protein
00:18:42.27	where we've made one subsegment of it in the lab,
00:18:46.10	and you again see that the crystal structure
00:18:48.29	is nearly identical to the designed model.
00:18:51.27	So, we're very excited about these
00:18:53.20	as the basis for new types of new nanomaterial.
00:18:56.07	We can make rods,
00:18:58.05	straight rods and curved rods,
00:18:59.29	and start building things out of them.
00:19:04.04	And the final class of proteins,
00:19:06.11	those small disulfide-bonded proteins,
00:19:08.10	are very interesting because they could form the basis
00:19:10.24	of new types of therapeutics
00:19:12.14	because they're very small and easy to make.
00:19:15.13	And, here this shows examples of...
00:19:18.28	this is work by Vikram Mulligan, a postdoc in the lab,
00:19:22.00	where he's designed
00:19:23.20	very short peptides
00:19:25.16	that are predicted to fold up to unique structures,
00:19:28.28	and there are three examples in the top row of this slide
00:19:31.25	of designs he made,
00:19:33.25	then below that are NMR structures of these peptides
00:19:36.04	when they're actually made in the lab.
00:19:38.15	And again, these peptides
00:19:40.19	come out with very, very similar structures
00:19:42.28	to the designed models.
00:19:45.08	So, what I hope I've shown you today
00:19:47.07	is I've given you...
00:19:50.01	explained something about how...
00:19:53.11	about the protein structure prediction problem
00:19:56.02	and the protein design problem.
00:19:57.18	I've told you how we go about
00:19:59.13	approaching these problems,
00:20:01.01	and then I've shown you that we can start to design
00:20:03.08	sort of idealized versions
00:20:05.05	of the different classes of proteins
00:20:07.02	that are found in nature,
00:20:08.29	and these proteins are likely...
00:20:12.11	will be the basis for designing a whole new world
00:20:15.18	of functional proteins to solve modern day problems,
00:20:20.05	and I'll talk about that in another iBio seminar.
00:20:24.04	And, I want to acknowledge
00:20:25.26	the fantastic people
00:20:27.24	who have actually done most of this work.
00:20:30.00	So, Robu and Rie Koga
00:20:32.20	developed these rules for making idealized protein structures,
00:20:36.01	and I showed you...
00:20:38.07	took you through the design of two of their structures.
00:20:40.21	Vikram Mulligan, I mentioned,
00:20:42.08	did the designed cyclic peptide work.
00:20:44.06	TJ Brunette,
00:20:46.23	Possu Huang,
00:20:48.18	and Fabio did the work on the repeat proteins.
00:20:53.03	And thank you for your attention.