Not sure what you mean by that. Genetics is a field of research. The genome is a sequence. And yes, that sequence can be modelled for various purposes but without a specific purpose there’s no point in doing so (and furthermore doing so without specific purpose is trivial — e.g. via markov chains or even simpler stochastic processes — but not informative).
> There are plenty of inferences that you would want to do on genetic sequences
I’m aware (I’m in the field). But, again, I was looking for specific examples where you’d expect ML to provide breakthroughs. Because so far, the reason why ML hasn’t provided many breakthroughs in less about the lack of research and more because it’s not as suitable here as for other hard questions. For instance, polygenic risk scores (arguably the current “hotness” in the general field of genetics) can already be calculated fairly precisely using GWAS, it just requires a ton of clinical data. GWAS arguably already uses ML but, more to the point, throwing more ML at the problem won’t lead to breakthroughs because the problem isn’t compute bound or vague, it’s purely limited by data availability.
I could imagine that ML can help improve spatial resolution of single-cell expression data (once again ML is already used here) but, again, I don’t think we’ll see improvements worthy of called breakthroughs, since we’re already fairly good.
I spoke loosely, my mind skipped ahead of my writing, and I didn't realize that we were parsing so closely. "Genetics (the field) is amenable because the object of its study (the genome) is a sequence" would have been more correct but I thought it was implied.
> without a specific purpose there’s no point in doing so
Well yes, prior to the success of transfer learning I could see why you would think that is the case, but if you've been following deep sequence research recently then you would know there are actually immense benefits to doing so because the embeddings learned can then be portably used on downstream tasks.
> it’s purely limited by data availability.
Yes, and transfer learning on models pre-trained on unsupervised sequence tasks provides a (so-far under-explored) path around labeled data availability problems.
I already linked to a paper showing a task that these sorts of approaches outperform, and that is without using the most recent techniques in sequence modeling.
Maybe read the paper in Nature that uses this exact LM technique to predict the effect of mutations before assuming that it doesn't work: https://sci-hub.do/10.1038/s41592-018-0138-4
I am not directly in the field, you are right - but I think you are also being overconfident if you think that these approaches are exactly the same as the HMM/markov chain approaches that came before.
Thanks for the paper, I’ll check it out; this isn’t my speciality so I’m definitely learning something. Just one minor clarification:
> Maybe read the paper … before assuming that it doesn't work
I don’t assume that. In fact, I know that using ML works on many problems in genetics. What I’m less convinced by is that we can expect a breakthrough due to ML any time soon, partly because conventional techniques (including ML) already have a handle on some current problems in genetics, and because there isn’t really a specific (or flashy) hard, algorithmic problem like there is in structural biology. Rather, there’s lots of stuff where I expect to see steady incremental improvement. In fact, in Wikipedia’s list of unsolved biological problems [1] there isn’t a single one that I’d characterise specifically as a question from the field of genetics (as a geneticist, that’s slightly depressing).
But my question was even more innocent than that: I’m not even that sceptical, I’m just not aware of anything and genuinely wanted an answer. And the paper you’ve posted might provide just that, so go and do my research now.
Not being in the field, I would term what I see in this story as a ‘bottom up’ approach to understanding genetics/molecular biology. More akin to applied sciences than medicine or health. This, for example, seems to be very important but it still leaves us with a jello jigsaw puzzle with 200 million pieces and probably far removed from immediate utility in health outcomes.
Then there’s the more clinically oriented approaches of looking at effects, trying to find associated genes/mutations whatever mechanisms exist in between to cause a desirable or undesirable outcome. I’d call that ‘top down’.
I’m sure the lines get blurred more every day, but is there a meaningful distinction into these and/or more categories that are working the problem from both ends? If so, are there associated terms of art for them?
> Genetics is amenable because it is a sequence
Not sure what you mean by that. Genetics is a field of research. The genome is a sequence. And yes, that sequence can be modelled for various purposes but without a specific purpose there’s no point in doing so (and furthermore doing so without specific purpose is trivial — e.g. via markov chains or even simpler stochastic processes — but not informative).
> There are plenty of inferences that you would want to do on genetic sequences
I’m aware (I’m in the field). But, again, I was looking for specific examples where you’d expect ML to provide breakthroughs. Because so far, the reason why ML hasn’t provided many breakthroughs in less about the lack of research and more because it’s not as suitable here as for other hard questions. For instance, polygenic risk scores (arguably the current “hotness” in the general field of genetics) can already be calculated fairly precisely using GWAS, it just requires a ton of clinical data. GWAS arguably already uses ML but, more to the point, throwing more ML at the problem won’t lead to breakthroughs because the problem isn’t compute bound or vague, it’s purely limited by data availability.
I could imagine that ML can help improve spatial resolution of single-cell expression data (once again ML is already used here) but, again, I don’t think we’ll see improvements worthy of called breakthroughs, since we’re already fairly good.