Which part of genetics are you thinking of? Much of genetics isn’t amenable to this kind of ML, because it isn’t some kind of optimisation problem. And many other parts don’t require ML because they can be modelled very closely using exact methods. ML does get used here, and sometimes to great effect (e.g. DeepVariant, which often outperforms other methods, but not by much — not because DeepVariant isn’t good, but rather because we have very efficient approximations to the exact solution).
Genetics is amenable because the genome is a sequence that can be language modeled/auto-regressed for depth of understanding by the network.
There are plenty of inferences that you would want to do on genetic sequences that we can't model exactly and there is some past work on doing stuff like this, although biology is usually a few years behind.
Not sure what you mean by that. Genetics is a field of research. The genome is a sequence. And yes, that sequence can be modelled for various purposes but without a specific purpose there’s no point in doing so (and furthermore doing so without specific purpose is trivial — e.g. via markov chains or even simpler stochastic processes — but not informative).
> There are plenty of inferences that you would want to do on genetic sequences
I’m aware (I’m in the field). But, again, I was looking for specific examples where you’d expect ML to provide breakthroughs. Because so far, the reason why ML hasn’t provided many breakthroughs in less about the lack of research and more because it’s not as suitable here as for other hard questions. For instance, polygenic risk scores (arguably the current “hotness” in the general field of genetics) can already be calculated fairly precisely using GWAS, it just requires a ton of clinical data. GWAS arguably already uses ML but, more to the point, throwing more ML at the problem won’t lead to breakthroughs because the problem isn’t compute bound or vague, it’s purely limited by data availability.
I could imagine that ML can help improve spatial resolution of single-cell expression data (once again ML is already used here) but, again, I don’t think we’ll see improvements worthy of called breakthroughs, since we’re already fairly good.
I spoke loosely, my mind skipped ahead of my writing, and I didn't realize that we were parsing so closely. "Genetics (the field) is amenable because the object of its study (the genome) is a sequence" would have been more correct but I thought it was implied.
> without a specific purpose there’s no point in doing so
Well yes, prior to the success of transfer learning I could see why you would think that is the case, but if you've been following deep sequence research recently then you would know there are actually immense benefits to doing so because the embeddings learned can then be portably used on downstream tasks.
> it’s purely limited by data availability.
Yes, and transfer learning on models pre-trained on unsupervised sequence tasks provides a (so-far under-explored) path around labeled data availability problems.
I already linked to a paper showing a task that these sorts of approaches outperform, and that is without using the most recent techniques in sequence modeling.
Maybe read the paper in Nature that uses this exact LM technique to predict the effect of mutations before assuming that it doesn't work: https://sci-hub.do/10.1038/s41592-018-0138-4
I am not directly in the field, you are right - but I think you are also being overconfident if you think that these approaches are exactly the same as the HMM/markov chain approaches that came before.
Thanks for the paper, I’ll check it out; this isn’t my speciality so I’m definitely learning something. Just one minor clarification:
> Maybe read the paper … before assuming that it doesn't work
I don’t assume that. In fact, I know that using ML works on many problems in genetics. What I’m less convinced by is that we can expect a breakthrough due to ML any time soon, partly because conventional techniques (including ML) already have a handle on some current problems in genetics, and because there isn’t really a specific (or flashy) hard, algorithmic problem like there is in structural biology. Rather, there’s lots of stuff where I expect to see steady incremental improvement. In fact, in Wikipedia’s list of unsolved biological problems [1] there isn’t a single one that I’d characterise specifically as a question from the field of genetics (as a geneticist, that’s slightly depressing).
But my question was even more innocent than that: I’m not even that sceptical, I’m just not aware of anything and genuinely wanted an answer. And the paper you’ve posted might provide just that, so go and do my research now.
Not being in the field, I would term what I see in this story as a ‘bottom up’ approach to understanding genetics/molecular biology. More akin to applied sciences than medicine or health. This, for example, seems to be very important but it still leaves us with a jello jigsaw puzzle with 200 million pieces and probably far removed from immediate utility in health outcomes.
Then there’s the more clinically oriented approaches of looking at effects, trying to find associated genes/mutations whatever mechanisms exist in between to cause a desirable or undesirable outcome. I’d call that ‘top down’.
I’m sure the lines get blurred more every day, but is there a meaningful distinction into these and/or more categories that are working the problem from both ends? If so, are there associated terms of art for them?
I cannot give constructive feedback to something which is incomprehensible.
"the genome is a sequence that can be language modeled/auto-regressed for depth of understanding by the network"
The genome is not a sequence so much as a discrete set of genes which are themselves sequences which specify construction plans for proteins. That distinction is important.
Language modeling in the context of machine learning typically means NLP methods. Genetics is nothing like natural language.
Auto-regression is using (typically time series) information to predict the next codon. This makes very little sense in the context of genetics since, again, the genetic code is not an information carrying medium in the same sense as human language. Being able to predict the next codon tells you zilch in terms of useable information.
"Depth of understanding by the network" ... what does that even mean???
The above sentence is a bunch of popular technical jargon from an unrelated field thrown together in a nonsensical way. AKA word salad.
> The genome is not a sequence so much as a discrete set of genes which are themselves sequences which specify construction plans for proteins. That distinction is important.
aka a sequence. "a book is not a sequence so much as a discrete set of chapters which are themselves sequences of paragraphs which are themselves sequences of sentences" -> still a sequence
these techniques are already being used, such as in the paper I just linked.
> Being able to predict the next codon tells you zilch in terms of useable information.
You have absolutely no way of knowing that apriori. And autogressive tasks can be more sophisticated than just next codon.
> bunch of popular technical jargon from an unrelated field thrown together in a nonsensical way
Okay, feel free to think that.
There's always this assumption of it "will never work on my field." I've done work on NLP and on proteins and read others' work on genetics. I think you will end up being surprised, although it might take a few years.
It is incomprehensible to you, because you just simply do not understand what your parent is talking about. You are the ignorant one here and indeed quite rude. Doesn't matter that genetics is not natural language. The point is we can train large transformers auto regressively and the representation it learns turns out to be useful for a) all kinds of supervised downstream tasks with minimal fine-tuning and b) interpreting the data by analysing the attention weights. There is a huge amount of literature on this topic and what your parent says is quite sensible.
That statement you quote is completely understandable.
Let's say you have discrete sequences that are a product of a particular distribution.
Unsupervised methods are able, by just reading these sequences, to construct a compact representation of that distribution.
The model has managed to untangle the sequences into a compact representation (weights in a neural network) that allows you to use it for other, higher level supervised tasks.
For example, the transformer model in NLP allowed us to not have to do part-of-speech tagging, dependency parsing, named entity recognition or entity relationship extraction for a successful language-pair translation system. The compact transformer model managed to remap the sequences into a representation that allows direct translation (people have inspected these models and figured out the internal workings of it and realized it does have latent information about a parse tree of a sentence or part-of-speech of a word).
Another interesting note is that designers of the transformer architecture did not incorporate any prior linguistic knowledge when they were designing it (meaning that the model is not designed to model language but just a discrete sequence).
FWIW, transformers is to sequences what convnets is to grids, modulo important considerations like kernel size and normalization. Think of transformers as really wide (N) and really short (1) convolutions. Both are instances of graphnets with a suitable neighbor function. Once normalization was cracked by transformers, all sort of interesting graphnets became possible, though it's possible that stacked k-dimensional convolutions are sufficient in practice.
I work in the field, I don't need the difference explained to me.
> Think of transformers as really wide (N) and really short (1) convolutions
Modern transformer networks are not "really short" and you're also conflating the difference between intra- and inter- attention.
There is still a pitched battle being waged between convnets and transformers for sequences, although it looks like transformers have the upper hand accuracy wise right now, convnets are competitive speed-wise.
It'll be genetics next.
e: although AlphaFold appears to be convolutionally based! I suspect that'll change soon.