No, DeepMind has not solved protein folding

dnautics · on Dec 4, 2020

Jeez. Clickbait title, tons of fallacy, like goalpost moving.

Some context on myself: I have 15+ years of postgraduate chemistry/molecular bio/biophysics/biochemistry experience, then quit to go into tech, worked for AI hardware and now AI-driven software startups, mostly as backend (not implementing ML models, but I know how to do that and have done some small models for myself). I'm a pessimist about this AI innovation cycle's prospects in general AI and in particular drug discovery.

Has protein folding been solved? Yes. As a practitioner, what I want to be able to do is pull a sequence that I've retrieved from DNA, drop it into a computer, and get the structure out. Intuitive insight can FOLLOW from those results. For example: In one of my projects, I was able to look at the structure (a homolog had been solved, so I just did a dumb alignment and threading), identify that it was acting as an NPN transistor (unpublished details), fix the electron flow through the enzyme and improve the yield (https://link.springer.com/article/10.1186/1754-1611-7-17). Later, I looked at the structure of the enzyme, modified the surface charge in one particular part of it, and improved electron throughput again (https://www.mdpi.com/1422-0067/16/1/2020). This was with primitive "protein folding through homology" tools, now there's an good chance I could do these sorts of things with proteins I don't have homologous structure for.

These are the sorts of things that protein folding enables. One more thought -- I bet that DeepMind can do things like make it obvious where there are certain posttranslation sites (like FeS clusters) or make it obvious where there is a cryptic phosphorylation or glycosylation site (sequence holdover from a previous mutant that no longer has its expected functionality) because it's been buried.

Will there be corner cases? Yes. Probably deepmind will have difficulty solving the fold of amyloid fibrils. Probably deepmind will have some difficulty with super-strange post-translational modifications (think things like GFP's core fluorophore), or if you design a drug where you splice in an unnatural or D-amino acid, or oddities like that.

> we are not at the point where this AI tool can be used for drug discovery

I disagree. Sure, it probably won't be able to find a small molecule binding site. It almost certainly won't be able to design a drug that has a long-range allosteric effect (think Gleevec's super strange mechanism of action). But, deepmind WILL be able to help design biologics that, for example, can interact with bump-hole mutations.

There was never an expectation that "protein folding" solves every problem in the drug discovery pipeline. That's out of scope for the basic problem.

As for this:

> AI methods rely on learning the rules of protein folding from existing protein structures.

Come on. There is no method that doesn't rely on learning the rules of protein folding from existing structures. Even de novo MD-modeling has tweaking fudge factors (we could call it "dark biochemical fields" -- think: what is the expected dielectric constant around a tyrosine residue?? No way we're calculating that from the schroedinger equation) that are empirically derived to get your results.

dnautics · on Dec 4, 2020

Since this is getting a pretty high rank, maybe I should use it to pitch an interesting experiment to the DeepMind team (if they are reading this). Take random amino acid sequence, maybe bias pulling out of the bag of letters in a reasonable distribution, then pick sequences that AlphaFold2 thinks will collapse into a volume in any way smaller than their expected free chain radius; then start mutating this sequence, selecting for sequences that decrease the chain radius. Sort of a GA descent on chain radius.

My question are as follows:

1. Does AlphaFold ever converge on structures in this fashion at all? As in, is the heuristic able to identify partial folds, or, is foldability somehow encoded either through evolutionary history?

2. If it does, are the structures that it identifies truly compact (measurable through sedimentation centrifugation) and solvable using conventional techniques? I suspect if the first than definitely the second because of sample bias.

3. If it does identify compact structures, are they novel sequences or sequences homologous to existing structures? If the former, then AlphaFold is truly extrapolating. If the latter, then it's only interpolating.

It would be super cool if AlphaFold discovered a "new fold" (for those who are unfamiliar with the jargon, a fold is a family of homologous sequences that create defined structure, novelty is gauged by homology, not the end structure; you can have the same structure created by wildly different sequences). It would be even cooler if AlphaFold discovered a new motif (besides alpha helix, beta sheet, and the known turns/chains). But I think that is rather unlikely.

4. If you change your bag distribution by using a primitive aa set (delete histidines, tryptophans, enrich for leucines and serines) how do your search dynamics change?

bsmith89 · on Dec 5, 2020

Lots of cool questions! I'm very excited for code and training data to be made available (and in a perfect world the pre-trained model and hosted prediction as-a-service). Then folks can answer all of these questions themselves, without having to go through DeepMind.

east2west · on Dec 4, 2020

Thank you for detailed and easy-to-understand explanations. This is why I come to HN.

As someone who used to work in adjacent field, I have a question that I hope you can help me with. DL usually requires a large volume of data, but I don't know whether there is a huge pile of experimentally determined protein structures for training. Did DeepMind find a clever method to get around data-size issue or there is really a lot of known protein structures? I skimmed one of their earlier papers and it seems training data was in tens of thousands. I am surprised that is enough for training DL structure prediction.

dnautics · on Dec 4, 2020

PDBs are not huge files, they are human-readable text format. Here is a rather large protein clocking in at kilobytes: https://files.rcsb.org/download/2RIK.pdb

Also I am not in ML research, but IIRC the spooky/counterintuitive nature of high-dimensional gradient-based search is that you can get better results from less data by increasing the number of parameters, as long as you have a sane set of regularization techniques (or have we even ditched those too?).

fpgaminer · on Dec 4, 2020

> as long as you have a sane set of regularization techniques (or have we even ditched those too?).

Maybe. In the context of natural language at least, Transformers require less and less data to reach the same result as you increase the number of parameters. No regularization needed. See Figure 2 in the paper Scaling Laws for Neural Language Models (2001.08361).

It's quite odd. Who knows if that will hold for other domains, like protein folding. It may very well be the case though, since AFAIK DeepMind's folding model used attention to reach these landmark results.

rprenger · on Dec 4, 2020

In figure 2, those curves are all from one epoch each right? Which is to say the model has never seen the same data twice?

It's true you don't need regularization if you've never seen the same data twice, but that's a similar regularization to early stopping. You'd expect the larger number of parameters would make the training error drop faster with fewer tokens as well due to improved optimizability. But rather than "larger models need less data", I'd say the take away is more "larger models need fewer steps to optimize training error". None of the models get "good" until they've seen a number of tokens similar to the number of parameters.

Unless you've run many epochs on small data sets and seen the same results, in which case that's pretty weird/cool.

martingoodson · on Dec 5, 2020

Figure 4 is for dataset size and seems to shows the same thing.

rprenger · on Dec 7, 2020

Yeah, figure 4 is more clear. This is the early stopped loss though, so the regularization is more explicit. If you trained the large models to completion on small data sets they would do much worse on test error due to overfitting.

It is interesting that larger models with regularization (early stopping) seems to work better than than training smaller models to convergence though.

dnautics · on Dec 4, 2020

I remember reading somewhere that this was likely to be the consequence of how distance metrics work in high-dimensional manifolds.

ArnoVW · on Dec 4, 2020

My understand, based only on having watched Yannic Kilcher's video, is that they had in the order of 10k training cases. So not a lot.

But they added a ton of highly domain specific features. And since it's essentially a natural phenomenon, I would expect the signal to be relatively strong.

I recommend the video, or any video from Yannic:

  https://youtu.be/B9PL__gVxLI

cambalache · on Dec 4, 2020

Be careful on who you trust. The author of the article is a professor in structural biology (the field on protein structure) at Imperial College in London. The article is well written and the criticisms are fair and carefully stated. Dont come to HN to get scientific opinions. This site is pretty mediocre at that. There are exceptions of course.

dnautics · on Dec 4, 2020

sure but also keep in mind that a professor in structural biology is not necessarily a consumer of structural biology (versus me, plus I have citations proving that I have consumed them) ;)

lucidrains · on Dec 4, 2020

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it."

czzr · on Dec 4, 2020

Talking about the professor or the commenter?

dnautics · on Dec 4, 2020

The AI that I currently help orchestrate is not in the sciences. I do plan to go back into biotech someday, but not in anything that requires this in a major way (the intended disruption is socioeconomic, not technological)

musicale · on Dec 5, 2020

Pretty sure Prof. Stephen Curry could crush most of his critics in 1-on-1 basketball as well.

spullara · on Dec 4, 2020

I think they said they used 140k structures as their training data.

timr · on Dec 4, 2020

"Jeez. Clickbait title, tons of fallacy, like goalpost moving."

Pretty much everything in the post is true, the title isn't clickbait, and I didn't see any fallacy. It's only "goalpost moving" if you confused the location of the goalposts in the first place. For the record: I have worked on protein folding, I did a PhD in a closely related area, and I work in computational structural biology professionally now.

The OP is saying that the general problem isn't solved, and provided (IMO factual, true) explanations why there's still a long way to go. Arguing that it's OK for some uses (which is what you're doing) is true, but not a rebuttal.

Maybe it's possible to use this method to come up with models "good enough" for rigid-body docking, or to get the overall idea of a fold. But it's also true that these structures aren't going to be good enough for any kind of drug discovery, which is what TFA says.

dnautics · on Dec 5, 2020

I did my PhD program at one the world's foremost structural biology program at the time (TSRI). I've solved loose structures of metastable peptides in NMR (by hand-- https://pubs.acs.org/doi/abs/10.1021/bi800828u, those assignments were painstakingly conducted by yours truly with assistance from Jane Dyson). I've done thousands of crystal drops with insulin to try to cocrystalize it (and failed; the x-ray structure I got did not have the cocrystal partner). I provided the critical insight for the prep enabled EM structure solved by the postdoc at the bench next to me: https://www.nature.com/articles/nature04339 . I'm pretty familiar with many aspects of the world of structural biology, but more importantly it's never been my primary - I've always been a consumer of the results first, a practitioner of it second.

It is my judgement that if you presented these results in 2003 anyone would have said "the protein folding problem has been solved". Nobody then expected "the solution to protein folding" to solve every single protein, and it's silly to now. If you're complaining about the 33% error rate, you don't know enough about ML to realize that that gap is going to get closed, and rapidly.

My argument is that it is "good enough for the uses that are in the domain of protein folding". There are a lot of things that are adjacent to protein folding (enzyme pocket analysis .e.g), but they never have been considered to protein folding per se.

timr · on Dec 6, 2020

So you're a structural biologist. Good for you? I'm not sure how it affects that substance of the argument you're making. I'm not trying to argue from authority here -- I'm pointing out that I have direct experience in this specific area, just so that you know I'm not some random commenter.

"It is my judgement that if you presented these results in 2003 anyone would have said "the protein folding problem has been solved"."

Your judgment is your own. At the very least, your use of the word "anyone" should give you pause.

"Nobody then expected "the solution to protein folding" to solve every single protein, and it's silly to now."

Not the argument I'm making.

"If you're complaining about the 33% error rate, you don't know enough about ML to realize that that gap is going to get closed, and rapidly."

Not the argument I'm making.

"My argument is that it is "good enough for the uses that are in the domain of protein folding"."

I have no idea what that means. Uses "in the domain of protein folding" are equivalent to uses "in the domain of structural biology". So obviously, it depends on the quality of the structure being generated.

"There are a lot of things that are adjacent to protein folding (enzyme pocket analysis .e.g), but they never have been considered to protein folding per se."

You seem to have some internal dialogue where there is a field called "protein folding", where there is an unambiguous threshold for success that has now been crossed. There is no such field, and there is no such threshold.

Are these structures good enough to do some kinds of things, like structural genomics or docking? I dunno...maybe? Are they good enough to do structure-based drug design? Not from what I've seen. TFA seems to be saying the same thing, and your comment was that TFA is just wildly wrong, which seemed...unfair, at best.

titzer · on Dec 4, 2020

> FTA: For one thing, only two-thirds of DeepMind’s solutions were comparable to the experimentally determined structure of the protein.

This seems to be the core of the article's arguments. Is this ratio high enough to claim the problem is "solved"?

throwaway2245 · on Dec 4, 2020

If your goal is drug discovery by checking out relevant protein structures, then you'll be lucky with useful structure information ~2/3 of the time (perhaps less, if you are depending on interactions of more than one protein).

If it was possible to accelerate drug discovery by looking at computed protein structures rather than protein structures established experimentally, it has now become plausible to do so. It has crossed a tipping point.

Since I'm not aware that there has been any drug discovery that has strongly depended on computational protein folding, this possible benefit remains to be seen.

[I worked in a lab which was heavily involved in CASP, the protein folding competition, but I was not myself involved]

mantap · on Dec 5, 2020

Algorithms only get better. Did Google "solve" computer Go when AlphaGo beat Lee Sedol, when Master beat Ke Jie, or when AlphaZero learned Go from scratch?

Even though Google only won 4/5 games against Lee Sedol, that was the moment that sticks in peoples minds as when Google solved the problem of computer Go. What happened afterwards was a process of perfecting the algorithm. It seems likely the next few years will bring the same for AlphaFold.

titzer · on Dec 5, 2020

But in this case AlphaFold is only marginally better than other algorithms. Since we are comparing computer algorithms (and not a human to an algorithm), it'd be like saying Usain Bolt "solved" the 100m dash when he set a new world record. Is it solved though, without knowing how fast those other algorithms are advancing?

visarga · on Dec 5, 2020

have you seen the performance graph? it went from 40% (other teams) to 60% (DeepMind 2 years ago) to 92% now.

yread · on Dec 5, 2020

here is a nice case where alphafold did way better than others

https://twitter.com/TassosPerrakis/status/133353559400213299...

Tassos is also a professor in structural biology. Discovering structures the experimental way.

EDIT: on the other hand he also agrees that it's not the end of experimental structural biology

https://twitter.com/TassosPerrakis/status/133402467831107993...

tigershark · on Dec 5, 2020

No it’s not. AlphaFold is a huge advancement on all the other solutions. Solutions that were based on the already massive improvement that DeepMind brought in the previous edition.

choeger · on Dec 4, 2020

Could you shed some light on the 66% part, too? As a layman and general sceptic of hype topics, this one is the biggest obstacle for me. How are we going to procede with models that are 33% wrong? In other engineering areas, uncertainty is usually covered by the modeler. But how would that work here?

dnautics · on Dec 4, 2020

who knows. You can't generalize 1 error out of 3 to 66%. Statistics of small numbers. Knowing how ML works, I'm surprised it didn't give any indication of low confidence.

Not to give DeepMind too much credit, but it's also entirely possible that "both are correct". By the nature of how targets are selected for CASP, year over year there is higher likelihood that "pathologically difficult" proteins are presented for the competition, for example - a protein that exhibits polymorphic structure where the technique (say Xtal vs NMR) biases the "solved" fold in a huge way. I believe CASP tries to weed out proteins that we know are polymorphic, but you can never be sure, and again, as time marches on, those are the proteins that fundamentally are harder to solve, so it's likely they will be enriched in the test pool.

An extreme example is insulin. The structure of insulin is has been solved for 60? I think years, but when it's in the environment of its receptor, it looks TOTALLY different (solved 5? years ago). Having said that, doubtful that DeepMind could ascertain that structure, since the environment is super-super different.

I think that the modestly high error rate is an indication that deepmind is mostly interpolating and has solved the broad protein folding heuristic. It probably will get better.

Voloskaya · on Dec 4, 2020

> You can't generalize 1 error out of 3 to 66%

Right but we also can't generalize from 2 out 3, and without knowing this figure it's really hard to say how useful this is, no?

> Knowing how ML works, I'm surprised it didn't give any indication of low confidence.

This is actually a hard problem in ML, for example in NLP, many people assume a high log prob score means a high confidence but it is not true at all.

dnautics · on Dec 4, 2020

> for example in NLP, many people assume a high log prob score means a high confidence but it is not true at all

Thank you for the clarification. I wasn't aware of this, as I'm most familiar with super-basic/old NLP techniques like BOW/RNN/LSTM/GRU techniques, where log prob scores seemed to me to be roughly correlated with result quality. I'm aware the landscape has changed recently with insights about high dimensional searches...

visarga · on Dec 5, 2020

There is no difference in the log prob between LSTM and transformers. It just depends on the number of updates you make (batch size, epochs), if you train it to overfit, then your log probs will be pushed closer to 1 or 0.

But after training you can recalibrate the temperature of the softmax on the test set and still get meaningful confidence scores (temperature calibration). Or you can use a variation of cross entropy called Focal Loss that will leave your logits un-squashed.

Voloskaya · on Dec 4, 2020

An interesting paper on this subject from just last week if you want to know a bit more:

http://phontron.com/paper/jiang20lmcalibration.pdf

willj · on Dec 4, 2020

I’m not a specialist in the biology domain, but more generally, for some models 66% accuracy is amazing. For example, a model predicting the next word with 66% accuracy.

vannevar · on Dec 4, 2020

Exactly. I might not buy a satnav that was 66% accurate, but I would definitely buy a lottery number predictor with that accuracy. I don't know whether protein folding is closer to the former or the latter, but you can't make a blanket statement without considering the context.

visarga · on Dec 5, 2020

It's not strictly 60% or 92%, it depends on the threshold of what is considered "good enough", 0.3Å or 1.6Å, which depends on what you want to use the prediction for.

When you set a threshold you improve precision at the detriment of recall. It's a tradeoff you can play with, but the score depends on it.

kylegill · on Dec 4, 2020

My machine learning professor would always remind us that 50% is analogous to random chance in a binary decision, but in a model where the prediction is not so black and white, 66% doesn't sound half bad!

stjohnswarts · on Dec 4, 2020

I guess but wouldn't that only be useful if there were only two possiblities. If the next "thing" in a sequence is from thousands, millions, or quintillions of possibilities then 67% is hellatiously better than 1 in $really_huge_number

LeifCarrotson · on Dec 4, 2020

It seems like this protein folding problem is something like the binary decision "guesses the winning lotto numbers" vs. "does not guess the winning lotto numbers".

If it got answer right 1 time in 100, that would be amazing and you'd be foolish not to use it!

gph · on Dec 4, 2020

>If it got answer right 1 time in 100, that would be amazing and you'd be foolish not to use it!

Except you have no way of knowing if the answer it gives you is one of the 66 right predictions vs one of the 33 wrong predictions. You could say it's likely correct, but not to a high enough degree of confidence that you could really trust it without verifying using the old established techniques.

LeifCarrotson · on Dec 4, 2020

You missed my point - there are not 33 but billions (an infinite number, really) of wrong protein structures. There is only one right structure, not 66.

Also important - verifying that a given model has a signature that matches the established techniques is far easier than using those techniques to generate the complete model from scratch.

gph · on Dec 5, 2020

The 33 vs 66 is in reference to the percentage chance that a prediction is correct. But if you have no way to tell if a given prediction is correct or not without doing the tests that you were trying to avoid in the first place then it's not really worthwhile except for perhaps some exploratory research.

I'm not really sure what your point is.

nl · on Dec 5, 2020

> But if you have no way to tell if a given prediction is correct or not without doing the tests that you were trying to avoid in the first place then it's not really worthwhile except for perhaps some exploratory research.

This isn't really how it works.

To quote the CASP competition organisers:

The organizers even worried DeepMind may have been cheating somehow. So Lupas set a special challenge: a membrane protein from a species of archaea, an ancient group of microbes. For 10 years, his research team tried every trick in the book to get an x-ray crystal structure of the protein. “We couldn’t solve it.”

But AlphaFold had no trouble. It returned a detailed image of a three-part protein with two long helical arms in the middle. The model enabled Lupas and his colleagues to make sense of their x-ray data; within half an hour, they had fit their experimental results to AlphaFold’s predicted structure. “It’s almost perfect,” Lupas says. “They could not possibly have cheated on this. I don’t know how they do it.”[1]

So you have experimental results, but still don't know how it folds. You aren't trying to avoid the all the experiments, just understand them.

[1] https://www.sciencemag.org/news/2020/11/game-has-changed-ai-...

twic · on Dec 5, 2020

> Has protein folding been solved? Yes.

No. Look a the actual competition results.

> As a practitioner, what I want to be able to do is pull a sequence that I've retrieved from DNA, drop it into a computer, and get the structure out. Intuitive insight can FOLLOW from those results.

According to the competition results, you cannot do that using AlphaFold2.

deeviant · on Dec 5, 2020

"We have been stuck on this one problem – how do proteins fold up – for nearly 50 years. To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment. "

- Professor John Moult

Co-Founder and Chair of CASP, University of Maryland

Gee, the Co-Founder of the competition calls it a solution. I wonder if that might be because this is a solution to the problem as he A) Knows much more about the problem than you B) has actually personally seen the results and C) has far more context than you to base his decision to call it a solution.

twic · on Dec 5, 2020

He should also look a the actual competition results.

nl · on Dec 5, 2020

According to the competition organisers, you can:

So Lupas set a special challenge: a membrane protein from a species of archaea, an ancient group of microbes. For 10 years, his research team tried every trick in the book to get an x-ray crystal structure of the protein. “We couldn’t solve it.”

But AlphaFold had no trouble. It returned a detailed image of a three-part protein with two long helical arms in the middle. The model enabled Lupas and his colleagues to make sense of their x-ray data; within half an hour, they had fit their experimental results to AlphaFold’s predicted structure. “It’s almost perfect,” Lupas says. “They could not possibly have cheated on this. I don’t know how they do it.”

Sure, you want to validate thing experimentally, but that isn't different to imaging.

[1] https://www.sciencemag.org/news/2020/11/game-has-changed-ai-...

dnautics · on Dec 5, 2020

just for clarification, I don't know for certain, but it sounds like this is what they are talking about in science-journalism speak:

https://news.ycombinator.com/reply?id=25262458&goto=threads%...

So they bootstrapped the phases of the X-ray diffraction pattern with the phases from AlphaFold. This does mean, though, that we must be critical - it is possible that they are converging to nonsense that AlphaFold has generated. To be truly sure, you must have some sort of independent confirmation. Might still be "good enough" depending on what your application is.

jason0597 · on Dec 4, 2020

I know what I'm about to say is off-topic, but:

>I have 15+ years of postgraduate chemistry/molecular bio/biophysics/biochemistry experience, then quit to go into tech, worked for AI hardware and now AI-driven software startups, mostly as backend

What made you quit biochemistry and enter tech and AI?

dnautics · on Dec 4, 2020

I have a strong anti-IP position: Didn't want to work for pharma, lost the battle to get into academia, didn't appreciate how much academia doesn't care about your science and how much you have to play politics to get in. I was always good at programming, figured it pays well (it does). Open source feels good, even with its warts.

sabujp · on Dec 4, 2020

so the answer is essentially YMMV depending on what project you're working on. This will accelerate certain aspects of structural biology and drug discovery, for others not so much. I don't see cryo-em, x-ray crystallographers, and NMR spectroscopists all losing their jobs anytime soon. At the end of the day you're going to have to validate and verify the claims of your simulation.

Florin_Andrei · on Dec 4, 2020

Could you use DeepFold to just get in the ballpark, then use that to bootstrap classic brute force methods (like Folding @ Home)?

dnautics · on Dec 4, 2020

probably depends on just how far off the "wrong" structures are. I presume the tally of wrong structures will get lower as the model gets refined.

hammock · on Dec 4, 2020

If protein folding has been solved, can DeepMind be used to engineer new prions for study and/or weapon use?

dnautics · on Dec 4, 2020

> Probably deepmind will have difficulty solving the fold of amyloid fibrils

xxpor · on Dec 4, 2020

A weaponized prion might be the scariest thing I've ever heard of.

dnautics · on Dec 4, 2020

Eh, having studied prions (and related conditions, like diabetes, and alzheimers) I think it's just because they're "exotic". If I just called prion disease "contagious alzheimer's"... You'd probably be more scared of ebola, because the way that it kills you is more painful, and you're going to be isolated, so you will die alone except for by people in spacesuits, with the only major difference being that the prion diseases are currently incurable. But if an agent is being successfully weaponized, infrastructure for curing more conventional conditions would be stressed to the point where you would be likewise incurable (we were stressed in many places by COVID-19 and outside of the risk that you go critical, that is very much not a scary condition for a huge chunk even of people that get it).

xxpor · on Dec 4, 2020

I think the "completely incurable" aspect of it is the scary part.

I'm not AS concerned about something like Ebola, because while being horrible, it's horribleness is self limiting because people die (relatively quickly) before they can spread it too far.

On the other hand, weaponized vCJD could have a population die off en masse years after they're infected. Since it can be transmitted via blood transfusions, it'd essentially shut down blood donations (although if everyone had it, I guess that wouldn't matter). Also just the fact that prion diseases can cause psychological symptoms make the implications different from something like a classical infectious disease.

Garlef · on Dec 4, 2020

> On the other hand, weaponized vCJD could have a population die off en masse years after they're infected.

Maybe the infection could use genetic mechanisms so that the infected pass it on to their children. Maybe this already happened thousands of years ago. X-Files Theme.

mantap · on Dec 5, 2020

Personally, I would definitely choose to die of Ebola rather than Fatal Insomnia. You get slowly driven insane by your inability to sleep. Truly one of the worst ways to go.

ma2rten · on Dec 4, 2020

Why are you pessimist about AI for drug discovery in this innovation cycle?

dnautics · on Dec 5, 2020

I don't think ai will be able to dock small molecules with proteins.

You can bin all astounding ml results into one of two categories, convolutional forms or sequential forms. The problem space does not fit into either of those so cleanly.

Yes, you can turn a molecule into a graph, and there are graph ml techniques, but how that interacts with a protein also does not have a lot of source data (real or synthetic) nor is the interaction space obviously reducible into a differentiable form. Either would be sufficient for today's breed of ML to excel at this.

rossjudson · on Dec 5, 2020

You want to increment your count of useful Google services? ;)

hesdeadjim · on Dec 5, 2020

Take all my upvotes please, this is a wonderful response.

kevinskii · on Dec 4, 2020

Mohammed AlQuraishi elaborated on this in his 2018 "What Just Happened?" blog post that was discussed on HN when the news was announced. [1]

It's a fairly long read, but he goes into a lot more detail as to why the problem isn't yet solved. Interestingly, he also notes that pharma and academia should feel some embarrassment from DeepMind's achievement:

"What is worse than academic groups getting scooped by DeepMind? The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees, let an industrial lab that is a complete outsider to the field, with virtually no prior molecular sciences experience, come in and thoroughly beat them on a problem that is, quite frankly, of far greater importance to pharmaceuticals than it is to Alphabet. It is an indictment of the laughable “basic research” groups of these companies, which pay lip service to fundamental science but focus myopically on target-driven research that they managed to so badly embarrass themselves in this episode."

[1] https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp...

(Edited to clarify that the blog post was not recent.)

CamelCaseName · on Dec 4, 2020

That quote is from 2018.

Here is what Mohammed AlQuraishi said in 2020:

CASP14 #s just came out and they’re astounding—DeepMind looks to have solved protein structure prediction. Median GDT_TS went from 68.5 (CASP13) to 92.4!!!! Cf. their 2nd best CASP13 struct scored 92.8 (out of 100). Median RMSD is 2.1Å. I think it's over [0]

and

“I think it’s fair to say this will be very disruptive to the protein-structure-prediction field. I suspect many will leave the field as the core problem has arguably been solved,” he says. “It’s a breakthrough of the first order, certainly one of the most significant scientific results of my lifetime.” [1]

[0] https://twitter.com/MoAlQuraishi/status/1333383634649313280

[1] https://www.nature.com/articles/d41586-020-03348-4

twic · on Dec 5, 2020

I can't square this with the actual competition results. There are tons of targets where AlphaFold2 didn't score over 70%. It was typically only 10% ahead of the next best program. The metric everyone is looking at, GDT_TS, gives marks for atoms that are within eight angstroms of where they should be - by experimental standards, eight angstroms is a huge miss! There are targets where on some important metrics, like C-alpha RMSD, AlphaFold2 did worse than another program!

refulgentis · on Dec 5, 2020

You're making extremely strong claims, counter to scientists with no incentive to praise Google being quoted with reactions such as expecting a mass exodus from computational biology. I, and I assume we, are open to hearing more but I'm not sure cherry-picking a couple examples is enough to credit your claims contra theirs.

For instance, the gentlemen who was presented as a skeptic is instead shown to also say it's solved, and in reply to this you say "everyone" is looking at the "wrong" metric, and it allows errors of 8+ atom widths - he notes the median error is 2.1 angstroms, or 2.1 atom widths.

byecomputer · on Dec 5, 2020

I feel like there's a certain je ne sais quoi to the comments Google employees write where I can just immediately sense them without even needing to check.

sbergot · on Dec 5, 2020

You are not helping your case by calling out people based on where they work instead of what they say. Refulgentis is raising factual points.

_Wintermute · on Dec 4, 2020

Pharma companies don't feel any embarrassment because it's really not much of a priority for them. Protein structure determination simply isn’t a rate-limiting step in drug discovery in general. [1]

[1] https://blogs.sciencemag.org/pipeline/archives/2019/09/25/wh...

dekhn · on Dec 4, 2020

That is true but every pharma has a structural biology department that solves product-related crystal structures with the intent of using that data to speed development. It's an established thing, $100+M/year budget per company. Generally well-respected and considered some of the most important IP the company can generate.

They also have theory groups which try to do structure prediction and drug docking but I don't think those groups get any respect inside the companies any more.

inopinatus · on Dec 4, 2020

Counterpoint: no-one needs to feel any embarrassment. The advancement of human understanding is not a boxing match.

The author acknowledges as much in the preamble, "my mood lifted during the meeting due to the general excitement and quality of discussions, [...] my tribal reflexes gave way to a cooler and more rational assessment".

stjohnswarts · on Dec 4, 2020

Tell that to said company's stockholders and get their response :) . In the big picture though is new, correct knowledge is always welcome to the pool.

zamadatix · on Dec 4, 2020

I think it's arguably the exact opposite. It would have been surprising if medical companies had proven machine learning can model protein folding so much better before the world's leading AI research group could. The demonstration was one of machine learning research being applied to medical problems not medical research being applied to machine learning problems so DeepMind was the one with the head start on expertise.

tachyonbeam · on Dec 4, 2020

And let's not forget that deep learning started really exploding and becoming mainstream around 2015-2016. I'm not sure how you could expect someone like Pfizer to hire top experts and build a competitive lab in such a short time, particularly since, Pfizer would want these people to focus exclusively on biotech, which would not be that attractive to most deep learning researchers.

nl · on Dec 5, 2020

OpenAI was founded in Dec 2015. They have a smaller budget than Pfizer and work on less practical problems.

It's a lack of vision and leadership from pharma labs.

tachyonbeam · on Dec 5, 2020

> It's a lack of vision and leadership from pharma labs.

That's probably true, but I don't think Pfizer could have beat DeepMind, even if they had really tried. DeepMind is in a unique position to recruit lots of young deep learning researchers.

mhh__ · on Dec 4, 2020

> The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees

Deepmind have at least a few thousand employees, I'm willing to go out on a limb and say that the industry doesn't have "hundreds of thousands" of people working on protein folding alone.

adtac · on Dec 4, 2020

well, not everyone at DeepMind is working on protein folding either presumably (disclosure: I work at Google, but I know nothing about DeepMind other than whatever's public). I think the point they're making is that DeepMind has an order of magnitude or two fewer people working on this for a much shorter period of time than pharma companies.

est31 · on Dec 4, 2020

In the video on the announcement page, there was a short scene where they showed a Zoom call where the results were announced. I counted 20 people. The paper they published in Nature has 1 authors, so it matches up. Probably this is not the precise team size, but gives you a rough estimate.

What's important is I think which people they hired, as well as Deepmind-internal infrastructures. You can just walk over to one of the world experts in deep learning and ask them how to address some issue.

In pharma companies they have people who are really smart about drugs and biology, but ML experts would all have to be hired externally, and likely don't represent the top of the field as well as Deepmind does.

tech-historian · on Dec 4, 2020

Has Deepmind ever released how many employees it has? Several thousand seems much larger than I would have thought. Please cite your source on that.

Permit · on Dec 4, 2020

Here's a June 2020 article that says they have ~1,000 people: https://www.cnbc.com/2020/06/05/google-deepmind-alphago-buzz...

kungito · on Dec 4, 2020

At first it sounded like a lot but when I remembered by FAMANG org of similar size it's actually not that big. A 1000 people org probably nets you around 15-30 teams which have to include frontend, backend, infra, theory, experimental, internal tooling etc. I guess the biggest question is how many people in deepmind are sales/consulting etc. Do they have any external for profit contracts?

woeirua · on Dec 5, 2020

There’s the rub. Depending on who you ask DeepMind appears to, as of a year or two ago, be running deep in the red.

nl · on Dec 5, 2020

DeepMind will always be a loss center. That's what research groups are.

I don't know how UK R&D tax works, but most countries allow companies to write off some profits against R&D. Apart from the fact Google wants to do research, having a large center based in the UK probably helps their UK (and formerly EU?) tax position.

Miraste · on Dec 5, 2020

How would DeepMind get any revenue? I thought it was solely for research.

jeremysalwen · on Dec 4, 2020

This post is from 2018, when AlphaFold won the CASP competition, but nobody was claiming that it "solved" the problem, as you are now hearing in response to AlphaFold's performance at the 2020 competition.

BHSPitMonkey · on Dec 4, 2020

The same person who wrote said post in 2018 wrote about the 2020 AlphaFold results:

> CASP14 #s just came out and they’re astounding—DeepMind looks to have solved protein structure prediction. Median GDT_TS went from 68.5 (CASP13) to 92.4!!!! Cf. their 2nd best CASP13 struct scored 92.8 (out of 100). Median RMSD is 2.1Å. I think it's over

https://twitter.com/MoAlQuraishi/status/1333383634649313280

CydeWeys · on Dec 4, 2020

This blog post is two years old, whereas the linked blog post for this overall discussion is current.

kevinskii · on Dec 4, 2020

You're right, I should have pointed that out. I edited my original post to clarify. Thank you.

sp332 · on Dec 4, 2020

DeepMind's first entry into the competition was two years ago.

SamBam · on Dec 4, 2020

Right, so it's not exactly pertinent to whether the most recent work "solved" (for whatever definition of solved) protein folding.

CydeWeys · on Dec 4, 2020

And they have improved significantly in the two years since then, so many if not all of these criticisms that apply to their efforts from two years ago may now be out of date.

hu3 · on Dec 4, 2020

My impression is that DeepMind's performance have improved significantly since then.

banjomet · on Dec 4, 2020

I think it is also very embarrassing to IBM, which has been trying to get into healthcare with Watson.

visarga · on Dec 5, 2020

I bet they are hiring BERT-ologists now.

xpe · on Dec 4, 2020

In the spirit of clarification, I want to share snippets from three people mentioned in the Nature article [1]:

John Moult: ''' “This is a big deal,” says John Moult, a computational biologist at the University of Maryland in College Park, who co-founded CASP in 1994 to improve computational methods for accurately predicting protein structures. “In some sense the problem is solved.” '''

Andrei Lupas: ''' AlphaFold is unlikely to shutter labs, such as Brohawn’s, that use experimental methods to solve protein structures. But it could mean that lower-quality and easier-to-collect experimental data would be all that’s needed to get a good structure. Some applications, such as the evolutionary analysis of proteins, are set to flourish because the tsunami of available genomic data might now be reliably translated into structures. “This is going to empower a new generation of molecular biologists to ask more advanced questions,” says Lupas. “It’s going to require more thinking and less pipetting.” '''

Mohammed AlQuraishi: ''' “I think it’s fair to say this will be very disruptive to the protein-structure-prediction field. I suspect many will leave the field as the core problem has arguably been solved,” he says. “It’s a breakthrough of the first order, certainly one of the most significant scientific results of my lifetime.” '''

[1]: https://www.nature.com/articles/d41586-020-03348-4

xpe · on Dec 4, 2020

I share these quotes because people don't have the same ideas about what "solving" protein folding means. I'd suggest finding clearer concepts. For example:

* What degree of accuracy is attainable via each technique?

* How much (a) wall clock time / (b) overall compute resources are required for each technique?

* What use case(s) fit best with each technique?

I offer these because I see a lot of energy expended in people digging in and defending their definitions, rather than understanding what other people mean.

m00x · on Dec 4, 2020

I think we can just go from prior works. DNA sequencing uses AI to sample the DNA at certain spots, then predict what the amino acids/genes will be from there.

There aren't more significant research towards finding new ways of solving DNA sequencing since this method is good enough and can improve from more data and better models.

We consider it "Solved" in this case.

Tons of tooling was built on top of it and until we can get the true sequence of amino acids quickly and cheaply, it's not going away.

m00x · on Dec 4, 2020

You don't get determinism when working within the body.

Having a high enough accuracy will give you a "good idea" of the interactions it might have with other proteins and substances, but can't account for the millions of other interactions they might have with other particles.

Most of the bioinformatics aren't deterministic, but still rely on stochastic measures. DNA sequencing is done by sampling, then predicting the rest and as far as my own biology teachings go, it's categorized as "solved". Sequencing might get better, but we've accepted it as a solution for the moment.

Arguing the term "solved" is just pedantic. We know it's not 100%, but the actual usefulness of improving the prediction of a few more Angstroms isn't going to make a huge practical difference.

What matter is that we can start actually building tooling and lab tests on this method.

dekhn · on Dec 4, 2020

The protein folding problem isn't part of the body- it's a reduction in which the protein is folded in regular water solvent with no other components.

I'm not sure what you mean by "DNA sequencing is done by sampling then predicting the rest". DNA sequencing works by oversampling and then making a "call" about the specific base in a position given the evidence. Regions without data are described as N with estimated length M, rather than a "prediction".

alextheparrot · on Dec 4, 2020

The fragment amplification stage is definitely oversampling. I feel the original poster might have been referring to how the input DNA could’ve been heterogenous as it is from multiple cells. Could’ve also been confounding GWAS with sequencing, too, I suppose.

Given the technical limitations of single cell techniques it seems you both could be technically correct regarding the sampling.

nabla9 · on Dec 4, 2020

Protein folding is often aided by chaperones.

Usually chaperones just accelerate the process, but some proteins requires chaperones to fold properly. chaperones. some of them prevent or correct damage caused by manifolding.

dekhn · on Dec 5, 2020

The protein folding problem (as tested by the CASP competition, and "won" by DeepMind here) does not include proteins that require chaperones. It covers spontaneous reversible folders like ribonuclease A.

sdenton4 · on Dec 4, 2020

Shouldn't the next step just be predicting a distribution of possible structures, instead of just the most likely? That doesn't seem crazy given where we're at on the machine learning side; predicting distributions is almost the default at this point, in many areas.

Scandiravian · on Dec 5, 2020

That is actually one of the "classic" ways of doing it. There's a bunch of different Bayesian models for protein structure - as an example, I implemented a model for structural changes in evolution for my master's thesis

The big challenge for these kind of models is the curse of dimensionality. Since every atom in the structure can potentially interact with every other atom, it's tricky to make a joint distribution for the entire sequence and it's rare to have a model that's both accurate and parallelizable, so the field hasn't benefitted much from the advances in for instance GPU computing

ampdepolymerase · on Dec 4, 2020

In practice it doesn't matter. What Deep Mind has done far outstrips previous results. Now that we know neural networks for protein predictions is not a dead end, the accuracy will easily improve with time.

Yajirobe · on Dec 4, 2020

Yes, we will just map out/prepare more training data, train better networks. Prepare more and more training data... until we have mapped out all proteins known and there is nothing left to predict for the networks...

chorsestudios · on Dec 4, 2020

The 2020 AlphaFold team mapped OFR8, a protein associated with COVID-19. It seems unlikely that mapping out everything we currently know of would be the end of their research. COVID-19 is probably not the last deadly pathogen we will encounter.

whoisburbansky · on Dec 4, 2020

This is exactly how I feel; protein folding is almost chaotic in the sense that tiny differences in specific atomic locations end up having huge functional impacts, which is completely unlike, say, neural machine translation, where a slightly garbled translation is still intelligible for the most part. I don't quite see how this approach to protein folding helps if you can't actually be sure about the predicted structure's functionality without doing the expensive experimental verification.

dnhz · on Dec 4, 2020

Well the other way to think about the problem is that protein structure can be robust to sequence changes. Among natural proteins, proteins that look similar can differ up to about 70% in sequence identity (since evolution had a hand in making sure the structure stayed folded as the sequence diverged from its ancestor). So long as some critical members of the sequence are preserved, the protein folds roughly to the same structure. AlphaFold does take advantage of this since part of the algorithm looks for sequence alignments with known proteins.

whoisburbansky · on Dec 7, 2020

Huh, I had no clue that this was the case, do you have examples of specific proteins that are like differ in sequence to a high degree but are similar in function?

visarga · on Dec 5, 2020

It helps the experimental phase speed up by a lot. It basically caches previous knowledge so you don't need to repeat experiments in each possible configuration.

Isn't it amazing that the same model (transformer) is now SOTA in both language and proteins? Seems like the real story here is the benefits we could get from the transformer in many different fields, not just NLP.

foota · on Dec 4, 2020

I think part of the answer here is that you can more easily verify than you can determine. Are x y and z where we expect them? Yes? Looks good.

whoisburbansky · on Dec 4, 2020

Except you can’t, right? Figuring out where we expect them to be involves finding out where they are in the first place.

nl · on Dec 5, 2020

This isn't right.

You can conduct cheap-and-easy experiments to verify the results, as opposed to imaging which doesn't always work anyway.

whoisburbansky · on Dec 7, 2020

I don't quite understand this; what's an example of an experiment that verifies whether or not you predicted structure is accurate without determining the actual structure?

foota · on Dec 5, 2020

I mean, use the ML model to predict where they will be and then do an experiment to confirm it. My understanding from what I've read is that it's easier to make sense of experimental data (and likely cheaper) when you have a good idea what you're looking for.

ampdepolymerase · on Dec 4, 2020

The more useful outcome is CAD for proteins. Studying existing proteins is just a small fraction of the possible uses.

BHSPitMonkey · on Dec 4, 2020

> until we have mapped out all proteins known and there is nothing left to predict for the networks...

I would imagine there's also benefits to studying _unknown_ proteins, and even being able to work backwards from desired characteristics to discover possible new ones.

saadalem · on Dec 4, 2020

The bottom line? Understanding disease biology and biological networks is the rate-limiting step in revolutionizing drug discovery. No matter how well the biophysics and structural biology progresses, medicinal chemists could be targeting the wrong protein the whole time. That being said, this achievement almost seems miraculous given where we were decades ago, and it's waiting for us as we work hard to figure out the disease biology piece of the puzzle.

dekhn · on Dec 4, 2020

Yes, this 100%. Back when I started my PhD program in biophysics, the exact opposite was stated: "if you can determine the 3D structure of the protein that causes a disease, you can target it with a drug.". I wasted decades believing that paradigm because really smart people kept repeating it.

unfortunately there is no clear next step for the disease biology part of the study, as far as I can tell, except to collect enormous amounts of high quality data about diseases, typically one or a few at a time, and hope you get lucky finding something (IE, serendipity is just as important as intelligence).

screye · on Dec 4, 2020

A couple of other valuable perspectives:

https://explainthispaper.com/ai-solving-protein-folding/

Another supporting article from Derek Lowe's (think Medical Science's Stratchery. Highly acclaimed and usually cynical) blog : https://blogs.sciencemag.org/pipeline/archives/2020/11/30/pr...

physicsguy · on Dec 4, 2020

Highly recommend the recent book Science Fictions by Stuart Richie which discusses in one chapter how scientists often write the press releases themselves and then the media just copy them almost verbatim. This is obviously a little different because the scientists are working for a company rather than a University, but I think the same thing applies here.

throwaway74453 · on Dec 4, 2020

This is very different. DeepMind submitted their solutions to a hidden test set at CASP14, with a known metric and a value that the subfield considered to mean solving protein folding. They cleared the bar with AlphaFold2. This is not a PR stunt by a company trying to spin their meager results.

gspr · on Dec 4, 2020

Really? I'm only familiar with university PR people writing them on the scientists' behalf, inducing a lot of cringe on the part of the latter.

aazaa · on Dec 4, 2020

To summarize:

1. "...only two-thirds of DeepMind’s solutions were comparable to the experimentally determined structure of the protein. ..."

2. "... the average or root-mean-squared difference (RMSD) in atomic positions between the prediction and the actual structure is 1.6 Å (0.16 nm). That’s about the size of a bond-length."

3. It may be "... more difficult to predict the structures of proteins with folds that are not well represented in the database of solved structures."

4. "... the method cannot yet reliably tackle predictions of proteins that are components of multi-protein complexes."

fock · on Dec 4, 2020

now that we have this inevitable discussion venue after everyone has digested the PR (and some of the people her might have even had a chance to ask some questions to deepmind) some remarks (questions - if anyone from the team is here):

- I thought that open code was becoming the standard and also being pushed by Google? But apparently this does not apply to deepmind, because of $$$s? For the original Alphafold (which actually was 3 models) no code is available except, where it had to be (their nature publication on one of the three models used in the competition).

- why did they not participate with all the available proteins? I guess it's some loophole in the rules, to allow for greater "improvements" even when models are not super general, but from a naive scientific view that is absolutely stupid.

- maybe they went fully generative this time, but in CASP13, 2/3 of their models were just blackbox-predictors, which they optimized with simulated annealing. Given that the configurational sample space for the protein is huge, doing that seems still rather costly. I wonder how the actual spectra fitting works and compares to that and why experimentalists could not go this route as well? (just do simulated annealing, until the spectrum fits).

- they already trained on all known proteins, yet with some they are still far off. Seems like it's not solved really, though results are certainly impressive and it could be a great tool for any person interested in frozen protein structures.

m3at · on Dec 5, 2020

I see a lot of negative reactions in the comments, but at least this part seems fair to me:

> That advance will be much clearer once their peer-reviewed paper is published (we should not judge science by press releases), and once the tool is openly available to the academic community

DeepMind has a tendency for hyper inflated PR (not the only ones mind you), wanting for the scientific process to run its course before claiming victory sounds good to me.

nl · on Dec 5, 2020

> DeepMind has a tendency for hyper inflated PR

Can you point at some?

AlphaZero (emphasis: AlphaZero, not AlphaGo) is probably the most significant AI breakthrough of the last 25 years (maybe more - I think it's more significant than AlexNet on ImageNet) and there was very little hype about it.

AlphaGo got quite a lot of PR, but a lot of that came from the Go community, especially in Korea.

the__alchemist · on Dec 4, 2020

We can't consider this solved, or that we have an understanding of how the process works until we can do it ab-initio. Improvements in machine-learning and other experiment-based algorithms are great, since they're the best we have.

We need a breakthrough in chemistry and the ability to solve the Schrodinger equation in 3D (Or something similar) to truly solve this. Ie generate the electronic structure statically without using arbitrary, tuned constants, then evolve it over time. We know the rules and constraints, but unfortunately, can't solve it without using approximations, and fitting to experimental data.

Machine-learning approaches will always suffer from over-fitting; they can produce practical results for known cases, but their predictive power is limited. (But still impressive!)

aurizon · on Dec 4, 2020

As dnautics implies, this is like you make a model of a mousetrap - along comes a mouse - snap, what is the shape of the closed mousetrap? By this I mean proteins fold into a shape, and DM gives us some idea what that looks like, then it enters into a reaction - what is the shape now? Many are enzymes and snap and release, snap and release. Some poisons have the same initial shape - snap - then no release - reaction site has been poisoned - the body thinks it has lots, but the body dies of the poison. Some toxins that kill by ribosomal inactivation are like this, mushroom poisons, ricin are like this. Mankind would love to be able to reverse ricin and mushroom poisonings, which usually happen days after the toxic event. Not being an expert, this is just an analogy.

openasocket · on Dec 4, 2020

There's one thing this critique didn't entirely make clear to me, hoping someone here could answer. Is there a relatively efficient algorithm to tell that a particular folding solution is correct (or within a certain distance of correct)? If there was, at least then this AlphaFold would either tell you "here, this is more or less what this protein folds into" or "I can't figure out the folding". Even if it told you it can't figure out the folding a large portion of the time, at least it isn't giving you garbage data.

marco_craveiro · on Dec 5, 2020

Complete lay person here, so please bear with me if the question is very silly. Can this help the work on Perovskite solar cells [1] by any chance? I ask because it appears AlphaFold has applications in crystallography and (to the lay person) it seems that finding the right crystal structure is key in making a Perovskite solar cell that can last for decades.

[1] https://en.wikipedia.org/wiki/Perovskite_solar_cell

screye · on Dec 4, 2020

I wonder if these select industry labs with full creative freedom (DeepMind, Brain, OpenAI) will end up being a Xerox Parc/ Bell labs moment for our generation.

A group of incredibly intelligent people allowed to do whatever fundamental research they want at the expense of parent company's business, whose value will not be appreciated for a couple decades until everything new in technology points back to a fundamental innovation coming out of that lab.

sandGorgon · on Dec 4, 2020

does anyone know where the latest code for this is ? i can see CASP 13 is here - https://github.com/deepmind/deepmind-research/tree/master/al...

fock · on Dec 4, 2020

this is not CASP13. It's one of 3 deepmind models from CASP 13 (and doesn't contain any feature-related code). And not even the most interesting imho (Variational LSTM/CNN-Autoencoder which generates structures from sequences). Just be aware that they are doing this for $$$s and PR - not for the good of the people like their website suggests - although they really try to put a smokescreen around this.

Pensacola · on Dec 4, 2020

The aspect of this story that fascinates me is that one could argue that DeepMind has not, in fact, solved anything. An inscrutable, black-box ML model has. While such an algorithm could ideally predict protein folding in any case, it can't explain anything about why, and therefore can't really advance the science.

An analogy might be that if you trained an AI model on billiard balls, it could become really good at telling you where a ball will end up when you hit it, but it could never tell you that the reason is that f=m*a, meaning it will do nothing to advance the science.

ahelwer · on Dec 4, 2020

We've known the "why" of protein folding for many years. You can plug your amino acid chain into NWChem and have it spit out the system hamiltonian. From there it's just using the Schrodinger equation to evolve the system in time. Good luck finding a computer that can do that before the universe goes dark, though.

I guess you're hoping for some higher-level heuristic that would let us skip a lot of the computation. Maybe it exists. That isn't how nature does it, though.

dekhn · on Dec 4, 2020

uhhhhh nobody has convincingly shown for sure that applying time-dependent schrodinger to a protein hamiltonian would solve the folding problem, even given infinite computer time. (note that I am one of the few people who has tried using extremely large amounts of computer time and classical force fields)

ahelwer · on Dec 4, 2020

Reading your other comments you say you have a PhD in biophysics so yeah you know a lot more about this than me, so I'm interested - are there reasons to believe it isn't just a matter of scale (obviously we are talking about exponential scaling here aka completely infeasible but you know)? The field of quantum computing has also tied a lot of its value to efficient hamiltonian simulation and what that can do for biology & material science.

dekhn · on Dec 4, 2020

I only mean that nobody has done the experiment (nor is it feasible to run). My base hypothesis is, given enough computer time and an accurate potential function, molecular dynamics indeed would solve the limited protein folding problem (small domains that are soluble in water), but I also think there must be a far simpler and less expensive way to do it (most likely using deep neural nets that incorporate a wide range of information).

That said I also believe that QC in principle could be a way to address this effectively as well, but I'm waiting until I see somebody demonstrate something interesting and useful before I get excited.

jjk166 · on Dec 4, 2020

We already understand the physics and chemistry underlying protein folding (the why), but proteins are composed of so many building blocks that applying this understanding by brute force is woefully impractical.

By analogy, billiards is nothing but highschool physics but understanding highschool physics does not on its own make you a master billiards player able to sink any shot.

roywiggins · on Dec 4, 2020

Painstaking experimental ascertainment of protein structure doesn't really tell anyone "why" it ended up that way either, but it's still been worth doing and advances science a great deal.

https://www.alpco.com/dorothy-hodgkins-discovery-insulins-3d...

dekhn · on Dec 4, 2020

Right. The challenging thing about protein folding is that we've been able to probe protein folding in numerous ways but none of them actually give a direct picture of what the process of folding looks like. That is, the specific physical configuration trajectories/envelope followed by an ensemble of folding proteins.

hajimemash · on Dec 4, 2020

Just because you don't understand exactly how something works, don't mean it's not useful. Someone who doesn't understand exactly how everything works, like how do bikes stay upright and how does gravity work, can still use the bike to acquire food or advance science.

Barrin92 · on Dec 4, 2020

>Just because you don't understand exactly how something works, don't mean it's not useful

The OP didn't say that it is not useful, what they implied was that it is not actually science, which is correct. Science is a system that produces and organises knowledge. Chomsky made this point many years ago in a similar debate in linguistics. Statistical learning might produce results, but it tells us virtually nothing about the underlying laws or structures that govern language use.

ML in its current from is effectively the modern version of behaviourism and will, or already does, suffer from the same issues.

gizmo686 · on Dec 4, 2020

"Science" is a very broad field.

In some sense, protein folding is a chemistry problem, in that it is entirely about determining the structure of (a very specific type of) molecules.

In another sense, protein folding is a computational problem that is a necessary input to answering higher level questions in the field of biology.

Put another way, this allows researchers in biology to not need to care about the science of protein folding.

MiddleWinger · on Dec 4, 2020

Yes, the did.

ramraj07 · on Dec 4, 2020

Another academic professor trying to undermine something that just makes their entire way of doing things look stupid, that's all there is.

When the human genome was sequenced another entrepreneur came in (Venter), said "You guys are morons to spend billions over decades, let me show how an actual smart team would do it" and bet them to it in a fraction of time and cost. Yet the consortium that spent billions on the human genome project still got congratulated. What's ironic is that when they said they sequenced the human genome they still had as many asterisks to that statement as this broad DeepMind statement has. In fact, the sequencing of the human genome claim was probably more disingenuous than DeepMind because, and this is important, they didn't make up this definition - the CASP organizers did. This team just met a pre defined standard as what's accepted as "solving of protein folding" and this brilliant team met this challenge.

Academics calling this hyperbole should fire every university's PR team because the amount of hyperbole they add to every press release about a paper where they "cured" cancer in mice is 10x larger than this.

Academia is fundamentally broken; the cracks started appearing in the sixties (read Hammings lecture notes), and have all but metastatasized throughout, especially in biology. We are all dying faster because of this (Google the Alzheimer's cabal). Academia is now just bunch of overperforming hacks who are honestly not that good at anything except sitting in circles in NIH grant review panels giving millions to each other while giving "constructive criticism" like this to what is clearly a monumental achievement, because if they don't then they reinforced how useless they are together as a group.

Going back to this hacky article, of course this is the first step, and there is much to be ironed out. But in the words of Sydney Brenner, a great scientist from a time when actually smart, humble people became professors, "The entry of large numbers of American ... into the field will ensure that all the chemical details ... will be elucidated." [1]

We are definitely now in a fairly deterministic path towards figuring out protein folding for all practical purposes. The few academics who still have humility and foresight see this. DeepMind is obligated to release enough results and there are still some sane minds left in academia that they will do what they are okay at, which is filling in the details.

[1] http://nemaplex.ucdavis.edu/General/Biographies/SBrenner.htm

anthony_doan · on Dec 4, 2020

Yeah a bit too much hype around ML.

I was in my org conference and the guy that head AI team was stating how Deep Learning going to change the world. That it'll write software in a few years (he's from the applied math field). He also glossed over many on going problems with Deep Learning.

We're in healthcare industry and this guy is pushing Theranos like level of snake oil.

The people that don't know a lick about ML or Statistic or both rely on these people. The doctors are relying on that dude and I for these inputs. And the dude is selling theranos like stuff. I think the problem is the medical field, cs, and stat are all high level field, requiring years of training to acquire mastery and knowledge. So it's rare that someone would have all three to be able to have a impartial view or a good overview of pros and cons.

king_magic · on Dec 4, 2020

Huh? Who exactly is pushing "Theranos-like level of snake oil" here? What "dude" are you referring to?