Hacker News new | past | comments | ask | show | jobs | submit login
As DNA reveals its secrets, scientists are assembling a new picture of humanity (statnews.com)
109 points by yawz on Oct 7, 2016 | hide | past | favorite | 30 comments



Google is using graph learning for representations of stuff where obtaining a huge, representative, "labeled" corpus is hard.

As many in the field can tell you, "labeling" genomes is hard, too. You can stick labels like "died at 39 of AVCR" or "had 5 primary tumors by age 28" or "insane multi-substance abuser with extra toes" on a person, but that doesn't really encapsulate all their traits. I claim that genome analysis is a great candidate for semi-supervised learning. Looks like Ben Paten already had that thought...

Haussler, Paten, McVean, and the usual suspects are working on a tractable graph representation to replace (say) hg39, i.e. instead of a new "reference genome", there should be thousands. This makes more sense when you look at how common structural variants (say, inversions) are, and then when you do something like ask Haussler at ASHG "how do we represent inversions in this thing?" you realize how fcking hard it really is to get it right.

McVean & co. made it work for the major histocompatibility complex, which immunologists can explain better than I can, but that's perhaps the most diverse bit of genome there is. It's riddled with ancient repeat element insertions and generally fascinating. It's also a source of Little Problems like organ transplant rejections, graft vs. host disease in stem cell transplants, maybe bits of schizophrenia and neurodegenerative disease... in short, if it can be done for the MHC, it's probably doable for the whole enchilada. It's not easy, though. In fact it's so difficult to get right that a sub-field of accurate HLA typing and immunotyping evolved in parallel to large-scale genome analysis, because it's really, really important not to fuck this up. So that's a note of caution from historical precedent.

Nonetheless, it turns out that your pals at Google spearheaded the effort to have a Hadoop-like "data lake" for genomes (sign up for the GA4GH mailing lists if you like to watch professors bikeshed an API, and occasionally produce incredible insights by accident). Maybe this is going to converge in an interesting way. It won't happen overnight, but it will happen, and the mathematicians will be vindicated.

* Seven Bridges is a little genomics company with a funny name. Unless you're familiar with Eulerian paths and the Seven Bridges of Konigsberg, in which case it's sort of obvious why they chose their name. Nice people, other than the patent, which infuriated most everyone else.


Hi, I'm a pathology resident and I would like to present a journal club on this. My undergrad is in physics and I have taught myself some bioinformatics, like Durbin's use of Markov models, Ukkonen's suffix tree, BWT, BLAST, but clearly I'm no informatics expert. Two questions:

1) in the absence of a reference genome, does Burrows-Wheeler not apply?

2) Could you recommend any good articles to start from?


There's probably, in spirit, a graph version of the BWT, but I'm not familiar enough to know it. The approach is less straightforward because you're not just modifying BWT to allow for indels/errors, but rather a graph compression algorithm to allow for fragment searching along vertices of a sequence graph. That said, it has to happen eventually. It's been a while since I took comp bio (sequences & graphs) but what you mentioned is what I learned (and I took it from Waterman, so what I'm saying is that I think you've got it).

You can look at Pall's work to see how the current approaches may evolve into something compressible:

https://github.com/GFA-spec/GFA-spec

As far as articles? McVean's, without a doubt!

http://www.nature.com/ng/journal/v47/n6/full/ng.3257.html

An implementation paper for graph assembly HLA typing is at:

http://biorxiv.org/content/early/2015/12/24/035253

Interesting times ahead, with people recognizing that GxE and GxExE matters far more than G alone.


Here is a paper that presents a graph version of the BWT:

http://bioinformatics.oxfordjournals.org/content/29/13/i361....


It's not 100% clear to me that this is directly compatible with the graph reference, but if it isn't, that seems like a simple matter for Batzoglou & co. Those guys are really, really good, and the multi-reference target is a type of graph anyways, so I imagine it's just a bit more generalization (if any) to make BWBBLE work on a graph-structured reference assembly as implemented (there seem to be certain inconsistencies about how to represent structural variants that differ between implementations).

The other thing that would be neat is that then you'd have a direct tie-in to ancestral recombination graphs and could, in principle, get IBS/IBD for the same cost as high-confidence genotyping for any two individuals. Come to think of it, there's probably a way to recast this as shortest paths and get all admissible traversals between a population of genotyped individuals (given an ARG) for the same price as any two. Hmmm. This is a little disturbing.


There's got to be a lot more to the story that I don't understand.

Why wouldn't it have been obvious 16 years ago that a thoughtfully designed data model was necessary, possibly using graphs, to account for variability and other attributes of the genome?

Surely it was foreseeable that tooling would be crucial and that a solid software foundation would be invaluable to enable efficient and flexible processing for years to come?

Who were the lead developers supporting the original public genome project and what were they thinking?


A lot of them are the same people that worked on the first reference genome are working on these algorithms, or have/are mentored the scientists in this article. I would say that there's a lot more to analysis of genomes than you expect; there are many different comparisons that make sense. Initially the most informative comparisons were to other species, where a genome graph makes less sense at the time with the amount of data and compute ability that was available to bioinformatics scientists.

The rate of technology change for sequencing capabilities in the past 16 years makes Moore's law look like the rate of change in battery technology.

16 years ago I didn't think it would be possible to sequence individuals in the clinic before 2050 or so. Now we are building the technology to analyze the varitation in a million genomes.

I guess your question is kind of like, "why didn't computer scientists build systems like Kubernetes or Mesos in the 80s?" The problems and challenges were just different 16 years ago, and there was more than enough to work on between then and now. We don't need genome graphs at this instant, but we will in the coming years. And it's likely that newer representations will come about, too, as more math and theory is invented.


It isn't just an issue of coming up with a good representation of the data. Actually doing something with the graph, like answering the query "is this string a path through the graph?", is hard to do at the necessary scale (you might make a billion such queries after sequencing a genome). The classic string indexing approaches (suffix arrays, fm-index, etc) don't easily generalize to graphs and this is a very active research topic.


They were extremely limited in the resources they could bring to bear on the genome.

Sequencing was expensive and complicated, so even by sequencing several people the best they could do was make one composite image of the whole genome. Representing things as a graph would have provided no benefit at the time, and so people didn't do it.

Now we have thousands of public sequenced genomes. We need a new model for how we manage genomes. We are also learning how much is missed in the standard linear model of the genome, and need a way to incorporate new information into the reference. This all takes time.

Graphs are a good ways of encoding using our prior knowledge of genomes, but they are also difficult for researchers who have grown up on linear systems to understand.


Simplicity first. How do you start thinking of a complicated data model when collecting one sample is nearly 100 million dollars? A good data model develops when you know what questions you want to ask of the data.

Looking back and expecting otherwise is probably a case of hindsight-bias. Also worth emphasising is that the technology (and methodology) is moving very fast.


There are some sophisticated answer being given, but the simplest answer is that everything is obvious once it's been explained. It's a good principle to remember.


There has always been a fight in genomics between bench-types and computational biologists. Traditionally molecular biology yielded very binary yes/no answers requiring little mathematical analysis -- even when I was in grad school in the 1990s, I had a professor say in all seriousness "If you need statistics to understand the results of your experiment, you did the wrong experiment". Things are changing and many of the current generation of grad students are becoming hybrid bench/computational biologists, which is a good thing.


I interviewed for Seven Bridge.

Very interesting company. They had a typical white board interview process.

What they were doing didn't seem that technically hard, and they were more concerned about prior credential ( like most bioinformatics company ) instead of what you were able to do.

They also seem to think JavaScript is not worth their time :( The problem they gave me was algorithmic and I just used the tool that was available to me. They apparently write most of it in C++ due to "Speed".

The only reason I even got a face-to-face, even though I didn't even has a Masters Degree was due to having done the assignment better than their Phd candidates ( their words ).

This piece seems like a submarine article for their proprietary platform.


Genetics question:

How many cells do you need in a population so that there is at least one variant at each position (ie a SNP)? This can be for any species or cell line for which the info is available.


Humans have two copies of ~3 Billion base pairs. A reasonable error rate of DNA replication (not under stress) is about 1 write error in 1 billion reads. Many of those errors are actually immediately corrected by post-replication error correction mechanisms. Also, as cells divide, errors that happen early will be propagated more times than errors that happen late in development. Further, mutations are not uniformly distributed though the genome. A human has > 10^14 cells, so any given human might have a few thousand different variants among its cells. The error rates of human polymerase is way less than a bacterial polymerase. And a human genome is way larger than a bacterial (or viral) genome. Viruses actually have mechanisms to increase the error rate of DNA replication.

For a human, you would need a LOT more than a single human to have a variant at each position. Something like HIV can actually have a small enough genome, a high enough error rate, and a large enough population in an infected patient, that it can realistically have a unique variant at each position in its genome within a single human host.


Thanks, I later found this paper that claims something different. Can you explain where this logic has gone wrong, they seem to start out with a similar error rate (~10^-9 per site per division):

"For example, the intestinal epithelium contains approximately 10^6 independent stem cells, each of which generates transient daughter cells every week or two. Thus, the intestinal epithelium of a 60-y-old is expected to harbor >10^9 independent mutations. This implies that, not far beyond the age of 60 y, nearly every genomic site is likely to have acquired a mutation in at least one cell in this single organ." http://www.pnas.org/content/107/3/961.full

Edit:

>"A reasonable error rate of DNA replication (not under stress) is about 1 write error in 1 billion reads"

I think it is that you are using the 10^-9 value to be per genome while that reference uses it as per basepair


Write error == (fixed) base pair mutation. A classic example is a methylated cytosine spontaneously deaminating to yield thymine. Our DNA repair enzymes can't tell the difference (in terms of which is the "right" base) between the thymine and the guanine left behind, but since they don't match, one of them has got to go. Thus there is a 50-50 chance that the mutation will be fixed. That's the easiest example because it's not an "error" per se (rather a thermodynamics problem) but genuine proofreading errors also occur. The net rate is about one in a billion bases.

These estimates ignore indels and SVs but empirical evidence suggests that the "everyone over 50 has a 50/50 chance of at least one adult stem cell harboring a mutation in at least one interesting gene". My personal favorite is

http://www.nature.com/nature/journal/v518/n7540/abs/nature13...

but Druley's follow up was equally awesome:

http://www.nature.com/articles/ncomms12484

and the recent survey of adult stem cell mutations is nice:

http://www.nature.com/nature/journal/vaop/ncurrent/full/natu...

One take-away from all this is that, while 95% of a sensitively surveyed population of 50-60 year olds had at least one stem cell with at least one known preleukemic mutation, it is equally clear that most people aren't walking around with anything resembling an acute leukemia. The natural conclusion is that in individuals with a competent immune system and diverse enough pools of healthy stem cells, it's not that big of an issue. Only when bad luck and/or stresses to which the mutants are adapted (e.g. TP53 mutations in therapy-related leukemia) afflict people, or the natural diversity of their stem cell populations collapses (as with really old people and individuals whose immune system actively attacks their stem cells, as in severe aplastic anemia) do you see the sort of massive, life-endangering takeover that we recognize clinically as disease.

Furthermore, nearly all of us are born with 5-10 predicted-to-be-lethal variants in our genomes. Clearly, we're also not dead, so our conception of "lethal" can't be quite right. There is an enormous amount of complexity in how real live multicellular organisms deal with variation and mutation, something we're really only just starting to grasp, and of course all of that then interacts with the person's environment to manifest (or not) their genetic tendencies. We build models of reality because the actual thing is too complicated to be tractable; it's important never to confuse the two :-)


p.s. I edited the earlier response too many times already...

But my point is that I think a key epidemiological variable should be the age of peak incidence. This has been largely missed due to various common practices like:

  1) Binning into 5/10 year age groups
  2) Looking at age adjusted data
  3) Truncating the age-specific incidence data at 70-85 
     years old because the later data is deemed unreliable


>" The natural conclusion is that in individuals with a competent immune system and diverse enough pools of healthy stem cells, it's not that big of an issue."

Thanks, I have been thinking along those lines for a few years now after looking at the age-specific incidence of a bunch of different cancers from SEER. You see that many cancers peak consistently year after year at a given age, while the height of the curve may change drastically. The same was true when I looked at some data from other countries, although I never followed up very much on that aspect.

Then if you read the paper which spawned the multi-stage model of cancer that has been widely adopted[1], you see they make some assumptions for computational reasons that are unnecessary in these days of cheap computing power:

  pt ~ 1-(1-p)^t, if p<<1,
  where,
  p = probability a required mutation occurs during a given time interval
  t = number of elapsed time intervals (ie age)
Then by the product rule of probability they derive that, if cancer is due to accumulation of errors (usually considered to be mutations), the incidence at a given age would be:

  I(t) = k*p1*p2*...*pn*t^n = k*(p'*t)^n
  where
  I(t) = incidence at age t
  n    = number of required mutations
  p'   = geometric mean of the probabilities for mutations 1:n
  k    = a constant determined by the number of cells in each tissue,
         the proportion of times that a detectable tumor forms from 
         the carcinogenic cell, and possibly the sequence in which the
         mutations occur
If you use the non simplified version of their theory you would instead get:

  I(t) = k*(1 -q^t)^n
  where
  q = 1-p'
In contrast to the model that was simplified for computational reasons, this has a turnover. By setting the second derivative to zero you can get the age at which the peak incidence should occur as a function of number of required mutations (n) and geometric mean of the probabilities each mutation occurs (q = 1-p'):

  t_peak = log(1/n, base = q)
From this you will see either the multi-stage model is totally wrong, the error rate must be much higher than commonly thought, and/or the cell division rate of the error-accumulating cells must be much higher than commonly thought (the age is usually taken as a stand in for number of divisions). The last two possibilities suggest that we are constantly generating these cancerous cells and they are being cleared somehow.

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2007940/


Armitage-Doll was a nice advance over previous models, but it's also incomplete (and probably flat wrong in some cases, although working on pediatric malignancies has convinced me that a second cooperating event usually is mandatory).

In normal stem cells, it appears that attrition and immune clearance gets rid of damaged cells when they cycle and senescent cells all the time (subject to some variation, not entirely age related, at least in our volunteers). We may expect higher rates in filter organs, but liver cancer isn't too common, and the paper I referenced earlier shows that this can't be just an issue of fewer divisions (I despise the oversimplified Tomasetti & Vogelstein paper because the facts simply don't support it). Colorectal is probably more common because the crypts are "facing out" ala melanocytes, thus prone to accumulating lots of environmental damage.

Anyways, the latter of your possibilities (proliferative mutants divide faster and error more often than normal counterparts) makes the most sense -- the eventual "winner" in a tumor is the cell that produces the most progeny and resists apoptosis due to stress the best. It's probably not a coincidence that these are traits which adapt a mutated cell to survive chemotherapy as well. However, spawning nonself mutations willy-nilly is a great way to attract immune attention -- particularly if you haven't blown the immune system away by nuking it with chemotherapy. :-/


>"pediatric malignancies"

maybe, or you can use the full Armitage and Doll model I described above and replace t with something like a discrete exponential decay where N(t) = number of divisions since zygote as a function of time. Ie N(t) = N0(1 - k)^t + 1 where N0 = N_birth - N_adult

That is, take the difference between division rate at birth and division rate as adult and fit a constant k between zero and one. It is just a first approximation at best because data on division rate by age in various tissues doesn't seem available...

https://s18.postimg.org/9cn5vi8t5/div_Rate.jpg


The majority of pediatric malignancies are either germline related, in utero (de novo mutation/SV, potentially caused or facilitated by maternal environmental exposures), or a reverse lottery winner. There simply isn't enough time for somatic mutation to cause the sort of devastating fallout that you see in DIPG or infant leukemias. (Furthermore, even the point mutations seen in pediatric cases are characteristic and rare or absent in adults; some structural variants are also observed in adults, but they are much rarer and accompanied by fewer cooperating events)


>"There simply isn't enough time for somatic mutation to cause the sort of devastating fallout that you see in DIPG or infant leukemias."

This depends on the error + division rates (along with number of cells) in that tissue at that age though. From my research there not really such data available on any of those terms. Also, the errors need not be somatic mutation. For example, chromosomal missegregation may be much more common and potent since it can mess up the expression of many genes at once:

"Nevertheless, the rate of chromosome missegregation in untreated RPE-1 and HCT116 cells is  0.025% per chromosome" https://www.ncbi.nlm.nih.gov/pubmed/18283116

I'm just saying there are a number of other assumptions being made here, and if we get rid of the standard ones the Armitage-Doll model is capable of fitting the data surprisingly well.


NCBI has dbSNP, a database of important SNPs, with statistics on the variation of alleles among different populations. See e.g. [1]. From there you could compute the answer to your question.

[1] https://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs6564851


Is there an open source project that tries to replicate the Seven Bridges proprietary graphing technology? It seems like a logical next step.


There are many people working on various aspects of DNA graphs - for example, there's now the FASTG format, a replacement for the FASTA format which is essentially a graph in text form: http://fastg.sourceforge.net/

Some assemblers (SPAdes?) have started to support this format, but most downstream software only uses the FASTA format (the non-graph version from a single genome/representation).

Heng Li (somewhat now the godfather of bioinformatics) wrote a blog post about various implementations here: https://lh3.github.io/2014/07/25/on-the-graphical-representa...


Just realised that most of the above is from 2014 - here are some recent works:

A novel data structure to store the whole graph: https://almob.biomedcentral.com/articles/10.1186/s13015-016-... Software is here: https://www.uni-ulm.de/in/theo/research/seqana.html

This one maps the whole DNA space using markers: http://www.nature.com/articles/ncomms7914

Here's a paper that looks at the total genetic space of several individuals, but with read mapping alone, no graphs: http://genomebiology.biomedcentral.com/articles/10.1186/s130...

Most of this is based on open source or openly available software. A big company that's closed source is NRGene from Israel, I've read good things about their DeNovoMAGIC/PanMAGIC but I'm unsure how that stuff works exactly (apart from massive short read coverage).


The more relevant continuation of Heng Li's posts is Pall Melsted et al's work on GFA: https://github.com/GFA-spec/GFA-spec

This is pretty close to my area...


The main one is vg, which I think Benedict Paten is actively supervising, along with Richard Durbin. The main developer is Erik Garrison. I'm pretty surprised it wasn't mentioned by name: https://github.com/vgteam/vg


I'm highly interested in learning about genomics. What's the best way for an electrical engineer(signal processing,information theory,graph theory) to start gaining more insights in this field?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: