Exploring DNA with Deep Learning

comnetxr · on Aug 16, 2019

no examples of "unreasonable effectiveness" or even "effectiveness" are given; just a semi-plausible technique and some questions that might be worth answering. I hope the term doesn't get diluted with more examples like this.

also not clear why a 2D local representation is being used for 1D data. there is no meaning to the ordering of the rows (different individuals who represent the samples in the genome), so it doesn't make much sense to encode that into the image. I would presume that not much meaning comes from neighboring mutations that are separated by a long string of no mutations, so the in-row locality should be broken into chunks. Neither of these basic considerations is motivated in the text either.

I would guess there is no effectiveness of CNN at all on this data set and a different statistical technique should be used on this data set...

gus_massa · on Aug 16, 2019

In the paper they try with the rows ordered at random and with rows ordered using some similarity criteria. They get slightly better results with the ordered version.

I'm still not convinced that it is a good idea to use the "convolutional" part that in some sense compare one rows with their neighbor rows. They get some results that are slightly better than other methods, but the improvement is not very clear. (Perhaps the CNN is just calculating an average of the neighbor rows?)

EDIT: Remember that you can edit your comment.

dekhn · on Aug 16, 2019

This is the paper with the title: https://academic.oup.com/mbe/article/36/2/220/5229930

Almost every article I've seen that uses that naming trope fundamentally misunderstands Wigner's point in the original (https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.htm...). What Wigner was hinting at (he never really comes out and says it) is that he thinks math and science are two totally different domains and it's surprising to think that mathematical models of physical systems would be able to make generalized out-of-scope predictions. Ultimately, the best thing any theory can do is predict something we didn't expect from the previous models, and then have an experimentalist go and show the new theory's prediction is more consistent with natural observations. Both relativity and QM have done that repeatedly, although Wigner was surprised at that, as he believed that math was an independent domain untethered to physics (many today assume that the universe is effectively a physical embedding of a mathematical structure, and our mathematical theories are simiplified approximations of that structure, so it not's super surprising that a good math model would make good physical predictions), and I think this article was him basically hinting at that new idea without coming out and saying it.

As for why CNNs are useful here... in the generalized genotype-to-phenotype problem, where you are trying to take a list of a person's mutations and predict their physical attributes, there are some phenotypes/traits which are absolutely and totally explained by a single mutation in one gene. In those cases you could train classifiers using simple binary features ("has_mutation_TtoAatPosition37OfChromosome1") and make pretty good predictions.

But most traits are only predictable by making complex non-linear models that take more locations, and the interactions between locations. In some cases, it's 1-2 mutations in a single gene near each other, in other cases, it's 100 different mutations spread throughout the genome, and in other cases, many thousands (the variance of height in humans is a good example where a large number of effects combine non-linearly). CNNs are great for dealing with non-linear data with non-local interactions.

Sequence models also work well (always fine this funny because you're doing ML sequence models on DNA sequences) because so much of the signal can be found in the neighboring bases. For example, in transcription factors, where a protein recognizes a short chunk of DNA, a short window (10-20 base pairs) is recognized and it has significant internal predictability.

Real_S · on Aug 17, 2019

Interesting point about Wigner, but about the article...

This study does not examine phenotypes, but may be applied to them somehow. Instead:

>we use simulation to show that CNNs can leverage images of aligned sequences to accurately uncover regions experiencing gene flow between related populations/species, estimate recombination rates, detect selective sweeps, and make demographic inferences

I believe this works well because the sorting of the data (Fig. 2) introduces phylogenetic information into the image to be analyzed. This reminds me of neighbor joining [0], but has some differences. Without this ordering, their method does not work as well.

0)https://en.wikipedia.org/wiki/Neighbor_joining

longtom · on Aug 17, 2019

> sorting of the data

Perhaps the main advantage is that it can filter out non-adaptive, non-functional (noisy) mutations this way by simply averaging similar genomes? In that case, the rows of the learned M x N kernels should be nearly identical and one could have simply averaged M data rows at a time and fed it to the 1D CNN.

What other phylogenetic information could possibly be inferred?

Edit: It could also be thought as data augmentation as it effectively creates novel inputs each time. IIRC there was also a technique for hardening against adversarial examples which simply fed the network averaged datapoints along with the original data.

comnetxr · on Aug 16, 2019

name was edited from "unreasonable effectiveness of CNN in genetic population" after I read it but before my comment was finished.

comnetxr · on Aug 16, 2019

Oh, I see. The name is taken from the scientific paper linked at the top of the article. I couldn't read it, my comments were based on the article alone.

narvind · on Aug 16, 2019

I like how you make the effort to go back and update your previous comments made in haste.

Mizza · on Aug 17, 2019

The real problem with this technique is, of course, normalizing and labeling data.

I have worked on a project which has harmonized and labelled all public RNA samples: https://www.refine.bio

lkjhdcba · on Aug 17, 2019

TL;DR: a very entry-level article about the field of popgen ("what is a genome?") and how the "breakthrough" is representing multiple sequence alignments as compact binary matrices. There's very little explanation or actual examples beyond that, so unless you're a complete layman the article probably won't satiate you.

Applying deep learning to genomic data is something of a fad these days - the bioinformatics world has caught up with the DL hype of the early 2010s and is trying to use DL on nearly anything that moves for easy papers.

The main issue with DL frameworks in the context of genomics is the format of input data. You pretty much want all your data to be a matrix of fixed size (if you want to use CNNs at least, and that's what everyone is interested in anyway), but that's just not how genomics data works. Sequences vary in length (I see the problems of nucleotide gaps, let alone short indels is left unanswered), alignments are not absolute (they are very much aligner dependent and secondary alignments are a thing), the alignments themselves may stem from different data sources (long reads cover more stretches of DNA but are less reliable than short ones), there is no mention of how ploidy is handled (especially in plants!) and somehow you're supposed to transform all of that into a neat 48x48 array to feed to Keras. Wait, thousands of them. Did I mention the human or plant genomes are often billions of basepairs long? Waiting for bwa to be done mapping on your cluster is the xkcd equivalent of "can't do work, compiling!"

So yeah, sorry to put a damper on this but I'm waiting for something within the reach of practical workability (and believe me the standards of bioinformaticians for workable stuff are low) before getting hyped.

longtom · on Aug 17, 2019

> throwing away all the non-mutated locations in the genome because they carry no meaningful information

How can they claim this with certainty? Maybe the context of any given mutation contains useful information similar to how a word embedding vector contains useful information such as relatedness to other words etc.

mschuster91 · on Aug 17, 2019

The non-differing parts are worthless as you are looking for differences. For example, take a group of five people with highly sensitive hearing and a hundred people with normal hearing. You don't care what the sensitives have together with the normals, you care about if you can find a mutation present in all sensitives but in none of the normals. Therefore it makes sense to only look at the differences and throw away the equal parts, you just waste computational resources.

longtom · on Aug 17, 2019

But isn't genetic code transcribed sequentially and in the context of previously decoded structures? So it is similar to reading a text that references parts elsewhere in the text. If you only keep the changes between multiple subtly mutated texts, then it becomes an enigma what these changes actually mean and, what's worse, the changes are now in context of nearby changes which may make create completely different meanings.