Hacker News new | past | comments | ask | show | jobs | submit login
What is protein folding? A brief explanation (rootsofprogress.org)
314 points by jasoncrawford on Dec 1, 2020 | hide | past | favorite | 78 comments



Apologies if I spam this stuff too much but I'm so intrigued by the dynamic complexity of this whole thing.

The following three videos describe the process of DNA Transcription (copying a single gene's worth of RNA from the DNA strand), Translation (subject of TFA, creation of the protein from that RNA), and Replication (the process of making a complete copy of DNA for purposes of cell division.

Keep in mind that the functional components you see in here are themselves (generally) proteins that are created (and have evolved) by a similar process.

Transcription - https://www.youtube.com/watch?v=SMtWvDbfHLo

Translation - https://www.youtube.com/watch?v=TfYf_rPWUdY

Replication - https://www.youtube.com/watch?v=I9ArIJWYZHI


If you want to have your mind blown even more, remember that all of these molecules are compressed in a super tight space (not a lot of space around them). Here is what a small bacteria would actually look like (artist rendition of course)

https://cdn.rcsb.org/pdb101/goodsell/png-800/mycoplasma-myco...


Really need this in higher res!

My go-to mind-blowing visualization is this incredible video: https://vimeo.com/260291601/505c3bfec8


I watched the version with narration (https://vimeo.com/260291607/5bcaf19961), and was impressed by how many distinct mechanisms are at work here, different molecules specialized for taking advantage of and working with different parts of the (host) human cell. All of this in a structure measured in nanometers, that evolved via randomness and selection for survival.


Couple other things that are crazy to me:

- ATP molecules are floating around the cell like little rechargeable batteries. Any portions of these operations that require chemical energy to be input not only need to have the structural features required to support the reaction, they also need to have the structures required to allow ATP to power that reaction.

- How sensitive the body is to various compounds. For example, vitamin D is all the rage these days. In a typical adult, there are clinically relevant distinctions between a dose of 10 micrograms and 50 micrograms. For a 100kg person that's one part in a billion.

- I just learned today that the polar nature of water causes it to differentially conform to various charge domains in the surface topology of proteins and that this provides essential functions for their operation - https://www.pnas.org/content/116/39/19274


Great video, but without any comments and the very few captions it's quite impossible to understand what's going on.

Is there a version with a commentary or some richer description?


From the Vimeo description:

https://vimeo.com/260291607/5bcaf19961

Looks like there are probably more as well. This is amazing and ultra creepy at the same time.

Edit: There is so much detail in this video that is missed by the narration. For example, if you watch after the ‘budding’ of the replicated viral envelope, there is an enzyme that disassembles the polymerized teal dome that assisted with the fission of the membrane.

I did a little looking and haven’t found what it is. The teal protein is called ESCRT-III, tons of papers on its analysis and behaviors. Here’s one if you don’t feel like googling.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3070458/


Found it! AAA ATPase Vps4

"Vps4 disassembles an ESCRT-III filament by global unfolding and processive translocation"

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4456219/


Thanks for this this was as creepy as amazing.


I feel like I have seen a movie like Inception—-well coordinated plan of attack is executed. Those attacks and defenses are occurring in our bodies. Why have we evolved to be unconscious of this battle? If we were conscious of our bodies being attacked, we could deliver more intelligent strategies.


"Keep in mind that the functional components you see here are themselves (generally) proteins ..."

The word "generally" means there are exceptions.

One from memory: The ribosome is a functional component made up of ribosomal RNA and ribosomal proteins. As I recall, most of the ribosome consists of rRNA.


How far are we from simulating these interactions? Does the recent breakthrough in protein folding provide any indication?


That was an awesome explanation of this concept. I wish more complex concepts were explained with so much clarity and concision.


Could someone ELI5 why proteins fold in a unique way? Why can't there be multiple "valid" ways for them to fold (ignoring mirror symmetries)?


There can be multiple valid folds, the most famous ones are prions which are a different fold of an protein than the native one.

The explanation why proteins typically only adopt one fold is not in the process of folding itself. But misfolding is really problematic for the cell, and there are various mechanisms to manage it. A misfolded protein is at best useless, and as you can see with prions can also be actively harmful. The cell has chaperones that help proteins fold, and it has quality control mechanisms that try to remove misfolded proteins before they are released into the cell.

But one major factor is evolution, proteins that fold reliably are selected over proteins that can fold into unproductive or harmful conformations. So the proteins we tend to look at already have been selected for their folding properties to some extent.


[Levinthal's paradox](https://en.wikipedia.org/wiki/Levinthal's_paradox) was the thought experiment that "proved" that protein folding is non-random. This is currently the dominant attitude in the field.

As others have stated, misfoldings do occur, and the protein is trying to achieve its lowest energy conformation, but the simplest answer is that we don't have a complete answer, hence the desire to come up with a reliable computational solution. We know how certain residues affect tertiary structure, and we know the measurements of secondary structures with a surprising degree of accuracy, but for a protein with potentially hundreds of residues, there are simply too many degrees of freedom to simply tack on amino acids and come up with a structure (although this is essentially the approach that markov chain, monte carlo solutions use).

[This article gives a pretty good synopsis of where computational approaches to the protein folding problem stand.](https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp...)


Markdown doesn't work on HN, so that link got corrupted. Fixed version for the lazy: https://en.wikipedia.org/wiki/Levinthal's_paradox


Haha I see that now, I've been on forums and reddit too long.


> Levinthal's paradox

What I was getting at here is that the answer to the question "does every protein have a default native conformation" is "yes."


Well, there are. Actually, misfolded proteins are implicated in some diseases.

I'm not sure if every possible sequence of amino acids has a unique dominant folding. But ones that don't, wouldn't be nearly as useful biologically, because you couldn't rely on them to do their jobs. So they would not be selected for. The ones that actually get coded for by genes fold up more consistently.


> because you couldn't rely on them to do their jobs. So they would not be selected for.

Wouldn't a 20% chance of a novel, useful interaction and 80% chance of no effect be almost as good if you are producing a lot of them?


In bacteria, they produce a lot more mutations in stressful environments because it is evolutionarily advantageous. In other words, if your situation is killing you playing Russian roulette with your genes is a better bet than accepting certain doom. Maybe you produce a deadly mutation that kills you and maybe you produce a mutation that helps you survive certain doom.

My recollection is that snails can reproduce either sexually or asexually and they preferentially reproduce sexually in stressful environments and asexually in environments that make staying the same more advantageous.

Sickle Cell is protective against malaria. Sickle Cell trait is protective without causing Sickle Cell Anemia, which is a horrible condition. So one copy of the mutation and you are more likely to survive in an area where malaria is prevalent and two copies and you are jacked up, but maybe less jacked up than with malaria.

Some studies suggest that Cystic Fibrosis is a predominantly Caucasian disorder because having one copy of the gene is protective against certain disease that were sweeping through Europe at one time. Two copies tends to kill people gruesomely at young ages.

So I think generally speaking the answer is that species seem to seek mutations when what they are doing currently isn't working and seek stability when what they are doing currently is working.

Also I have read that it is believed that half or more of all human pregnancies probably end in the first two weeks and result in a heavier-than-normal period without the woman even realizing she was ever pregnant in most cases because those fetuses are simply not viable. Laying bets on "Will this novel mutation or novel combination work?" tends to get a result of "Nope. It so doesn't work, it's not worth investing precious resources in to bring the baby to term and let it be born."

We mutate more when it is "mutate or die" and less when mutating is the thing more likely to kill you.


There are a few things in here that are fun to think about.

In the situation you describe, the answer is often "yes" and that's how you end up with evolution. Let's say you have a gene that makes a protein that digests glucose. And then one day, your cell messed up when replicating and accidentally made an extra copy of that gene. Well now you have an extra copy of that gene that isn't under purifying selection. It's redundant. It can mutate but as long as you have the first copy, you're ok. And eventually it mutates away from being good at digesting glucose. It can do it a little bit, but it's not great. Maybe it's 20% as effective as it originally was. But you have another gene that's still 100% effective so you don't even notice.

Now we have a protein that really doesn't do anything bad... It just doesn't do much good either. And since it isn't subject to purifying selection, every round of replication it keeps mutating. Until all of a sudden, it mutates into something that can digest lactose. Now, you have an evolutionary advantage from a protein that first had to get bad at binding glucose, before it could benefit you. But evolution has no foresight, so it didn't know. So it took getting rid of purifying selection to make it happen. But now as you come to rely on lactose, that protein will wind up back under purifying selection and become "fixed".

So now let's consider an alternative situation. You only have one gene that can digest glucose. If it mutates to be 20% effective, you will at best grow only 20% as fast as your competitors. Maybe you even die and become an evolutionary dead end. In that case, it is an extreme disadvantage to have a protein that can't do its job reliably, and organisms that don't have a malfunctioning variant will grow better and pass on their genetic material to more offspring, until you are eventually outcompeted and go extinct.

We can also imagine another scenario. Your glucose digesting protein mutates into something that can still bind glucose, but cant digest it. Then the glucose remains stuck to the protein, producing no energy, and becomes a waste for the cell. That is actively harmful and will likely kill the cell very quickly.

So to answer your question: it depends


Someone will probably need to correct me, but here's how I understand it.

Proteins naturally fold into a shape where they have the lowest "potential energy". There are several useful metaphors to explain what "lowest potential energy" means and why the proteins are attracted to the shape with the lowest potential energy.

In "the real world", an object's "altitude" is a form of potential energy. A ball on a hill will roll down hill until it settles into the lowest valley it can — the place where its potential energy is lowest. Balls roll downhill to the place of lowest potential energy, and proteins fold into the shape in which they contain the lowest potential energy.

You can also think of a fresh protein as a stretched out spring. The stresses in the spring from being stretched out of shape are a form of potential energy. The spring will contract until it is completely relaxed so there is the minimum amount of "springy" potential energy remaining. Springs contract into the shape that is most "relaxed" and has the lowest potential energy, and proteins fold into a shape having the lowest potential energy.

If the protein was to fold into any other shape, there would still be some potential energy left in the protein that could be relieved if only the protein could get itself folded into the "correct" shape. If the rolling ball gets stuck on a rock or the spring gets snagged and can't completely relax, both objects would be stuck with a higher potential energy than they would if the ball reached the bottom of the hill or if the spring were allowed to fully relax.

Hopefully this explains why there is a single shape that proteins are most attracted to when folding. But, it doesn't explain why other shapes are somehow "invalid".

Proteins are like pieces of cellular or chemical "machinery". Like the parts of a mechanical machine, the protein's shape is part of what defines how the protein works, what it can "do", and how it fits together with other pieces of the cellular machine. And, since "correctly" folded proteins always have the same shape, "machines" can be built with them.

When proteins are misfolded, they have a different shape from the shape that all of the other machinery expects. Like a gear without teeth cut into it, the misfolded protein doesn't perform the function that it, as part of a cellular machine, is supposed to perform. The protein might "jam" the machine up or even cause the machine to malfunction and start doing something completely unintended.

I hope this comment is correct enough and clear enough for an ELI5 — though it might be more of an ELI15.


To expand on that (great explanation btw) and to get at something the original question was asking:

A protein can have many different, stable conformations. Those conformations depend on the chemical environment, and any interactions the protein is making. Basically they alter the lowest energy to be a different arrangement.

However, basic elements of the fold, with a few exceptions, will never change. We call these secondary structure elements, and they are limited by phi and psi angles on the dihedral C-N peptide bond. These secondary structure elements are thought to form before the protein is even fully synthesized, and are extremely difficult to undo. However, the spatial relationship between these elements is much more dynamic depending on what the protein is doing.

The ELI5 version is basically that proteins will have a basic shape, and they can wiggle around that shape, but can't really radically change because it would take too much energy


The article terms them self-assembling nanomachines. It might be more helpful to think of them as tools with a specific purpose.

So if you have, for example, cystic fibrosis, you have a defect in a cell channel called the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) and its job is to handle traffic of specific molecules into and out of the cell.

So your question is a little like asking "Why can't the cell just spit out random tools that randomly do different things?" And the answer is that it's not helpful to the cell if wrenches sometimes randomly morph into hammers or screw drivers when you ordered X number of wrenches because DNA is the blueprint instructions for creating this tiny little factory of life called a cell.

It's coded to create wrenches and it's coded to create a specific number of wrenches and it's not necessarily more useful for misfolds to turn into random other tools instead of the pile of junk that misfolds create.


> Proteins are long chains of amino acids. Your DNA encodes these sequences, and RNA helps manufacture proteins according to this genetic blueprint.

What does it mean when it says “Your DNA encodes these sequences”?


Maybe I can explain this in computer terms.

Think of DNA as read-only. It's a series of molecules that represent data. Cells read the topology and molecular interactions of those molecules and determine information from them.

Now, because the DNA is read-only, the permissions are super restrictive. Which means if you want to access that data in the DNA, you need to go through an intermediate, which, when it leaves the restricted area, is ok if it gets destroyed, because that original copy is still intact. That's where RNA comes in. RNA is designed to be mutable and temporary. I think of it like system memory. Reboot the system, and you lose it all. Want to protect it? Write it to the disk (DNA)., Similarly, you don't mess with files right on the disk - first you read them to memory.

So, RNA is basically DNA that's been read to memory, and can now be messed with. You can do something like execute it, which would be analogous to translating it to proteins. The process is similar. While 0s and 1s might translate to microcode calls (i.e. physical action on part of the computer sending electrons around), RNA translates to amino acids, the building blocks of proteins, which are the physical components of cells, and do things like...well...move electrons around, among other things.

The way this works is that while 1010010101110 (I have no idea) might be some microcode call like OR or NOR, RNA bases (which are derived from the transcribed DNA) might say things like AUG, which tells the cell "OK go get the Methionine amino acid" or UUU which means "ok go get the phenylalanine amino acid". Chain enough amino acids together, and you get a protein, which essentially is a function in the cell. For instance, a protein of a specific sequence of amino acids might go fight off viruses. A protein of a different, but still specific sequence, might go produce energy.


When RNA is used to create amino acids, is only a specific segment read (if so, by what mechanism?), or does a cell create all the amino acids defined by RNA in one go?


The RNA is only a recipe for how to string amino acids together, they already float around in the cell (they are small, simple molecules, like bricks for a house to confusingly change the metaphor). Only a piece of DNA is copied at a time into RNA, like you would load only one program into memory and not the entire harddrive. The mechanism for where to start and stop is farily simple in simpler organisms - you look for a start and termination sequence ("codon") (like ASCII control characters). In more complex organisms like multicellulars, this is more complicated. There are also other types of RNA that do other things.


So RNA doesn't create amino acids - they already need to exist (and there's a lot of metabolism that explains where they come from). What the RNA does is tell a piece of cellular machinery to combine the amino acids in a specific order to make a protein.

Depending on the organism, there are a few ways to do this. If you have a simple organism, like an archaeal cell, here's how it goes:

The DNA is all one giant circle. That circle has maybe 3 million bases, or 3 megabits. Those three megabits maybe encode 4000 or so distinct protein-coding genes. There's also some space between genes. Even though it's a circle, it is directional because the molecules connect in a specific way (5' OH-3'OH but that's just details).

That DNA is sequences of four bases, which we represent by the first letter of their human names - A for adenine, T for thymine, G for guanine, and C for cytosine.

If we wanted to denote a stretch of DNA that goes Adenine-adenine-guanine-adenine, we'd write AAGA. The cell obviously has no idea that's what it is, but it can read the topology that stretch of DNA would make. i.e., the cell recognizes the shape of that specific sequence of DNA.

OK, so we have one giant "file" with 3 million A, T, G, and Cs in there. Inside that file, there's roughly 4 thousand functions (proteins) that we can check. And just like humans, cells have developed syntax.

To get just the RNA we need, i.e. to call a function, the cells look for the following DNA sequences (they all vary a little depending on what organism):

The BRE: CCCTCC. A specific protein called "Transcription Factor B" recognizes this and grabs onto it.

The TATA box: TTAAAATTA. A specific protein called TATA-binding protein will bind this.

The BRE and TATA box are a little bit in front of each gene. So they appear some ~4k in the genome, before each protein-coding gene. (this is simplified but you get the idea). The job of those spots is to be bound by their partner proteins, and those partner proteins then will instruct the cell to copy the corresponding gene (usually right next to them) into RNA. Once you get to the end of the gene, there's a terminator sequence[0] which the proteins that are copying the DNA into RNA get blocked by, and can go no further. So that's how you get just the RNA for your gene.

Now - the RNA can go to the ribosome to be "translated" into the protein. This is where the triplets come in. The first triplet is usually AUG, which codes for the amino acid Methionine. That's called the "start" codon. However, imagine our RNA sequence looks like this:

AUG CAA ACC AUA CAA GUA UCC AAA ACG GAG CUG AAG UCC CUC GCU

That should code for the following protein, where each letter indicates an amino acid:

MQTIQVSKTELKSLA

But we have a problem. How on earth do we make sure that the ribosome starts at AUG? What if it instead started at UG C, essentially "slipping" by one base? Then, our protein would be totally different, and would look like this:

CKPYKYPKRS.SPS

So organisms also transcribe, at the beginning of the gene, a "ribosome binding site", or "Shine Dalgarna/Kozak" sequence. That sequence looks like this:

AGGAGG

So your whole gene now looks like this in RNA form:

AGG AGG AUG CAA ACC AUA CAA GUA UCC AAA ACG GAG CUG AAG UCC CUC GCU

The AGG AGG makes sure that you bind the RNA in the right spot, and start reading from the "start" codon (AUG) until you reach the end of the transcript.

[0] The terminator sequence is tough to conceptualize, and not every gene uses it, but essentially its a bunch of self-complementary bases that when the DNA is unwound for transcription to RNA, they knot back up on themselves so the proteins that are converting DNA to RNA can't go any further. See the image on this page:

https://parts.igem.org/Terminators


> That circle has maybe 3 million bases, or 3 megabits.

Nitpick, but isn't 3 million bases equivalent to 6 megabits, since each base has four possibilities, thus representing 2 bits of data.


Yes, I'd agree with that


Wow!!! Really good explanation. Thank you so much


so I can look at them like DNA (source compiled) RNA (IR AST) Protein (executable Program)


RNA is definitely not like an "abstract syntax tree" or "intermediate representation". It is the final machine code that is executed by ribosomes to build proteins. Proteins aren't really "executable programs", they are either passive building blocks or simple tools that do the same thing over and over again.

There are "programs" implemented by proteins and other components at a higher level (even things like clocks), but trying to include everything in one metaphor will stretch it to the breaking point.


Nice one. Thanks!


DNA is like the source code for protein. That's its function. Specific triplets of bases along the chain correspond to specific amino acids, so a DNA strand ultimately results in production of a specific protein molecule. In turn, protein molecules do most of the specialized work in a cell, including controlling the intake and expulsion of small non-protein molecules from the surrounding medium.

In turn, it's mainly the shape of protein molecules that makes them behave differently. And some proteins change shape under changing conditions within the cell, thus being able to function in a regulatory fashion.


Based on the animation above - basically a punch-card system.


Cool video that explains how RNA is used to make protein.

https://dnalc.cshl.edu/view/15501-Translation-RNA-to-protein...


Animated explanation - https://youtu.be/5MfSYnItYvg


I worked besides my wife on some protein folding problems. People often want to know why it is so expensive.

I unusual describe it as, someone gives you all Lego pieces in a Lego box. The job of the folding algorithm is, to re-create what's on the picture of the package without looking.

I think it's easy to see, even if you only had 2 Lego bricks, that you have quite a lot of option to put the 2 bricks together. That grows quickly the more bricks you have.


The Playstation 3 used to have Folding@Home that users could opt-in to. I've always thought that the cycles that crypto currencies burn for mining could be much better used for something like performing protein folding calculations (or something else beneficial) and mine some coin in the process.


Algorithms used for proof-of-work have to produce solutions that are very efficient to verify (it needs to be much more efficient to verify than to compute, and it also just needs to be very efficient to verify overall because every potential block processed by each node has one), it needs to have a smoothly adjustable difficulty factor, and if the solutions are economically valuable to anyone, it weakens the value of it as a proof-of-work algorithm because it can cause mining to be a much less level playing field and cause double-spend attacks to be more profitable.

Imagine a university gets a huge grant to run protein folding calculations, and the grant lets them do more than everyone else including miners: now they own >50% of the mining power of the cryptocurrency without making any investments intended toward mining and can now do 51% attacks at no added cost.

Or imagine someone with less than 50% of the mining power trying to do a double-spend attack: normally this is discouraged by the fact that double-spend attacks are probabilistic, and the attacker has to pay the full cost for doing the attack regardless of whether it's successful. But if the proof-of-works are valuable outside of their purpose within cryptocurrency as proof-of-works, then the attacker can re-coup some of the costs of their attack by selling the results of their proof-of-works. Double-spend attacks will then have higher benefits and may be profitable to attempt now.

Additionally, proof-of-work systems used by a decentralized cryptocurrency can't rely on a centrally-produced dataset. If the Folding@Home group produced workloads for the cryptocurrency miners to solve, then the group could construct fake workloads that they already knew the solutions to, and then secretly insta-mine those workloads themselves to earn mining fees or to do double-spend attacks on the network.

I think it's better for cryptocurrencies to instead focus on developing solutions that just don't need proof-of-work mining to happen at all, like proof-of-stake.


"Proof of useful work" has been attempted for a few things, including this. None have really taken off. I don't know all the difficulties involved super well, but sibling commenter seems on point.

https://foldingcoin.net/ https://primecoin.io/


Many of the attempts to use folding as proof-of-work for cryptocurrency devolve to: have some central authority keep track of how much work everyone does, then hand out coins as appropriate. And if you have a central trusted authority that can issue currency arbitrarily by fiat, you have very literally just reinvented fiat currency, in which case you might as well just create a normal mysql db instead and call it a day.


I disagree. I think centralized issuance with decentralized transaction is useful, because maybe you trust centralized authority not to abuse issueance, but do not trust it not to block transaction arbitrarily.


check Gridcoin.us


Quick question on Alphafold - is it a one time effort ? Like human genome sequencing.

Like if we 3d map all the 170 million sequences (using alphafold or whatever),we are done right ? We can just look up in the library.

New protein sequences don't continually get created ?


> Like human genome sequencing

Human genome sequencing was not a one time effort. The Human Genome Project was a one time push to establish the capabilities of human genome sequencing, but the field has seen a lot of refinement since then, to apply it both to individuals and to construct a better human reference genome.

> New protein sequences don't continually get created?

As others have said, there are new proteins being engineered by humans. Apart from that there are a lot of organisms that haven't been sequenced yet (not even counting niche organisms that haven't been discovered yet). Especially in the plant world, sequencing advancements with nanopore sequencing only recently made it economical (and possible) to sequence a lot of the plant genomes that are many times larger than the human one.

So Alphafold might be comparable to the Human Genome Project in specific ways.


From my understanding it was mostly useful for understanding new sequences in medical and genetic engineering; questions like can we modify this part of the protein and keep the overall structure the same.


Well, there's genetic engineering. We might want to create new proteins—enzymes for chemical reactions, for instance.


>> "DeepMind seems to be calling the protein folding problem solved, which strikes me as simplistic".

I have a couple of questions, which might be related

(1) Why did you call it out as "simplistic"?

(2) Will the DeepMind team have to throw the same amount of computer power at each structure they seek to solve?


(1) Simplistic because it isn't binary, solved vs. not solved. There's still improvement to do in the solution. And we won't really know if it's “solved” until we start trying to use these structures for practical purposes. Don't mean to detract from the accomplishment overall though, it seems tremendous.

(2) There is a certain amount of computing needed for each new protein---quite significant, I think, but not prohibitively expensive. And I'm sure this can be improved over time


If a protein is a self assembling nano machine, why do we have to fold them manually in simulations?


That's a solid question.

We know the sequence of amino acids, we know what order they are in. but remember they are assembled 1 at a time inside the cell. Andso the way they fold 1 at a time is different from simulating how the whole thing end-to-end will fold.

another thing to be aware of is the environment in which it folds, the local pH can affect which -OH groups exist as OH or as O(minus), which affects the way it folds. Then there is local salt concentrations, various proteins incorperate calcium, magnesium or other metals into the structure.

There are Post-translational modifications that are made, where other enzymes come in and snip bits off, add sugars to proteins, catalyse S-S bonds etc.

Finally, we want to know how they are folded because we want to be able to modify their behavior. If you know that an enzyme is key in a disease you might want to design a drug molecule that can fit into its active site, for that we need to know how the protein is folded.


Other ways to determine the way they are folded is to isolate the enzyme and to crystalise it and scatter Xrays through it. This is no mean feat, it takes a lot of time and a certain instinct (or brute force with a matrix of conditions) to pull off.

Even if you manage to crystalise the protein, you need to check that it still performs the same function, otherwise you won't be getting the active structure.

More recently NMR has been used, which gives information about the environment that the various nuclei occupy, and what their neighbors are. This has become more sophisticated over the last few years but only accounts for a small percentage of solved structures on the protein databank [0]

[0] https://www.rcsb.org/


Very well written overview of the problem and the solution approach for a non-biologist. Thank you!


Slightly tangential, but does anyone have recommendations for pop sci books on modern genetics / epigenetics? (Not specifically molecular cell biology/biochemistry).


I enjoyed The Gene by Siddhartha Mukherjee. Depending on what you're looking for it might be relatively light on the science side and includes autobiographical digressions, but overall gives a high-level overview of key developments in the field.

https://en.wikipedia.org/wiki/The_Gene:_An_Intimate_History


One thing the article didn't mention is what else is untouched by AlphaFold in protein folding? I've read somewhere that dynamic folding is still unsolved?


Essentially all of it. Alphafold won't tell you anything about protein dynamics, side-chain positions, water interactions, charges, binding pockets or induced conformational changes. All of these things matter as much or more than the structure of the backbone itself.

One way to think about it is this: alphafold gives you the rough shape of the skeleton, but a protein, just a like a human, has a lot more going on than just the skeleton.


I think at least some of this can be solved for or inferred from the final shape once it is known.


Surprisingly little. Maybe things like binding pockets and broad classes of movement (hinges, etc.) can be inferred (sometimes). Otherwise, it's the other way around: the shape of the backbone is determined by the detailed interactions. Just think about alpha helices and beta sheets -- they're stabilized by hydrogen bonding patterns along the protein backbone, but the propensity of a particular residue to form a helix or a sheet is determined by interactions amongst the side-chain atoms in a particular context. The atomic interactions are the story, and the "fold" is really just the outline.

Also, of course, it's not just the atoms in the protein itself: add in water, ions, small molecules, other proteins, etc., and you have a real mess. A small molecule (say, a drug), can displace a water molecule, which has a cost, which causes something else to move a bit to compensate, stretching a bond, and so on and so forth, sometimes resulting in dramatic differences in the bound conformation of a protein. It's not uncommon -- it's essentially the norm. And it's not like there's just one conformation. The whole thing is really like a wiggly, jiggly, metastable mass, and the introduction of a slight perturbation is enough to send the blob spiraling into a new metastable state.

When you start to ask questions about how a protein moves over time, it becomes a much more complicated problem than predicting the overall conformation. Which is itself quite hard.


I think I read somewhere that the water molecules are thought form semi crystalline structures from protein surface charges that contribute to their function; you could imagine these forming channels or some other mechanism to orchestrate the liquid around the protein to aid it's function (so it's not all random molecules bumping into the spots it needs to go). I have no references for this though.


Ooh that’s wild, like a slippery elastic coating to guide interactions. If you’ve ever touched ferromagnetic fluid under the influence of a magnetic field, that’s how I’m imagining it.



Is there a simple python example, with protein Data and an Algorithm to fold it.


This is last year's entry from Deep Mind:

https://github.com/dellacortelab/prospr


I don't actually know. I'm just futzing around online while unable to get to sleep. But I googled "python play with protein folding" and came up with this:

https://stackoverflow.com/questions/18891472/monte-carlo-sim...


This is the TLDR from the article of the answer to the question "What is protein folding?":

If a protein is essentially a self-assembling nanomachine, then the main purpose of the amino acid sequence is to produce the unique shape, charge distribution, etc. that determines the protein’s function.

The article goes on to talk about other things, like "What if, instead of measuring a protein’s structure, we could predict it?" and that's cool, but it's sort of outside of the scope of the question posed in the title.

I only kind of skimmed that section because it's about this project that is trying to figure out how to predict what shapes proteins fold into. Nifty tech, but not something I care a whole lot about.

(Edit: If that tech interests you, there's also this on HN today: https://news.ycombinator.com/item?id=25253488 and I imagine that's why this was even posted -- because some people are probably wondering "What's a 101 explanation of protein folding?")

Not covered by the article:

Misfolded proteins are the crux of the problem with most genetic disorders. Your body produces this string of protein sequences and it fails to fold up into the unique shape that makes it a useful tool that does a specific job.

(I always think of some old cartoon where two kids are playing with "gender neutral spiffy spiffy high minded description" toy from Switzerland -- think LEGOS only blocks you string together and bend as you see fit -- and their two mothers are talking and then one of the kids folds the bendable blocks into the shape of a gun and says "Bang" and one of the moms is all "Never mind.")

Chemical derangement of the cell can interfere with protein folding. You have these little factories inside the cell that produce protein strings and as they come out, if the climate is chemically deranged (usually pH balance or salt imbalance are the culprits), then it won't fold. It just lays there, a stringy useless mess of protein.

When you cook an egg, the egg white turns white and solidifies due to proteins being denatured. Denatured means they got unfolded.

Sometimes when a protein is denatured -- unfolded -- it's reversible. Sometimes -- like with egg whites -- it's not.

So in some cases if you could figure out why the protein is failing to fold properly, the body can potentially re-use denatured proteins and turn them back into working nanomachines.

And sometimes it can't and then it has to go to some kind of garbage chute in the cell and some serious genetic disorders involve a failed garbage chute, so misfolds can't be effectively removed from the cell, which is a really huge problem. The "garbage chute" is some thing that basically digests the proteins.


You raise very good points about the role of protein folding and quality control in disease. But folding by itself is essentially a physicochemical process. A protein will fold (or its mutant will misfold) within a cell just like it would do it in a test tube.

Don't underestimate the role of structure prediction in human disease. There are proteins known to be involved in disorders where a structure has been impossible to determine by experimental methods (membrane proteins are a known example). And there are still plenty of barely characterized proteins where a structural model might shed light on their function.

> Your body produces this string of protein sequences

A small nitpick: you meant "string of amino acids". In other words, a protein sequence.

> Sometimes -- like with egg whites -- it's not.

Under the right conditions even proteins in coagulated egg white can be renatured.


TIL: Under the right conditions even proteins in coagulated egg white can be renatured.

A small nitpick: you meant "string of amino acids". In other words, a protein sequence

Thank you. You are correct.

Don't underestimate the role of structure prediction in human disease.

I'm not. I just don't really find the tech behind it fascinating. When they start coming out with articles saying "This project has successfully predicted X and this is clinically useful in this way," I'm sure I will read it on the edge of my seat like a gripping thriller.


It's going to take quite some time until this gets accepted at the same level of experimental data, if ever.


I honestly have no idea what your point is.


>When they start coming out with articles saying "This project has successfully predicted X and this is clinically useful in this way," I'm sure I will read it on the edge of my seat like a gripping thriller.

My point is that's a long way off. Like all data driven science, this will be great for forming hypotheses. However, without a falsification hypothesis, and empirical data, it is unlikely to ever reach the point where it's useful clinically.


Okay. I don't see why anyone would feel the need to say that. It seems to miss the fact that my point is that I am not dismissing the importance of this to disease. I'm merely not fascinated by the tech. I said that explicitly, so I would never guess that anyone would think I would need it explained that "This is a long way off."

Edit; The comment I replied to was edited after I began my reply, before I hit "submit." Comment I replied to initially only said My point is that's a long way off.


thank you




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: