Darwin: a genomics co-processor provides up to 15,000x acceleration

klmr · on April 19, 2018

> The long read technology comes with a drawback though – high error rates in sequencing of between 15%-40%

These error rate estimates are seriously outdated (the quoted number is from a 2015 paper but it was obsolete pretty much as soon as that paper was published). Long-read technologies are evolving rapidly, so this is important. The state of the art is working with error rates of at most 15% (but usually much less), which, after correction, go down below 8% [1].

This is crucial because it makes the difference between a successful and a failed assembly: 40% error rate essentially means that you need different algorithms to perform sequence assembly (as the article notes, error correction takes up “orders of magnitude” more time than actual assembly).

I’d therefore be curious how this Darwin setup performs against conventional, state-of-the-art sequence assembly with state-of-the-art long read sequencing data.

[1] https://www.nature.com/articles/nbt.4060

stochastic_monk · on April 19, 2018

Read page 8 [206] of this paper to see the numbers they're providing. The PacBio reads are listed at up to 15% error rate. Their experiments were synthetic data from 15-40% error rates.

A few comments:

1. This was published in ASPLOS, not a comp bio journal, which should raise some eyebrows.

2. As another commenter said, error correction requires a reference, either by alignment, kmer filtering, or consensus. However, this is usually part of any assembly pipeline.

3. Like you said, they failed to compare to actual state of the art alignment tools. They do cite Canu, but they don't use it as their baseline. The issue here is that the current best long read assemblers use locality-sensitive hashing. The Canu paper even states that miniasm can produce a higher-quality assembly for CHM1 in 1/400 the CPU time (though their discussion of the table selectively ignores this fact). Considering that a combination of algorithm and implementation is able to provide a 400x speedup makes this result seem significantly less impressive, and that a comparison of CPU time is slightly unfair to miniasm considering how perfectly it threadscales, which would make the other tools look even worse by comparison. Additionally, for problems where you really need the improvement in performance, the GPU RAM limitations and communication overhead would likely incur significant penalties to their method. It's not simply a matter of how many reads a second you can process because it's a quadratic overlap problem.

klmr · on April 19, 2018

Their 40% number comes from the paper I referenced (i.e. that’s the one the Darwin paper references). I completely agree with your point 2. I’m not sure I agree with your first point: I think it’s pretty reasonable for it to have been published in that venue.

azeotropic · on April 19, 2018

In general, error correction requires alignment, so the correct comparison is error correction using Darwin for alignment vs. error correction using whatever long-read aligner you consider state-of-the-art.

tradedash · on April 19, 2018

Serious question: if the processing problems are classified as NP, then couldn't that processing be outsourced to a blockchain? Why have miners solve for useless problems that have no long lasting impact like "number of zeros in a SHA string" instead of processing data such as this? I can easily see it being the case that the mining problem in a given blockchain could be based on real scientific problems that needs solving.

I'm looking for someone to help me understand why the above isn't possible.

h4l0 · on April 19, 2018

Seeing this genuinely suprised me. I have been working on exactly the same topic that you have mentioned. Coinami[0], Coin-Application Mediator Interface is a blockchain application where scientific problems are distributed like grid and solvers are awarded by cryptocurrency. I will not go into detail but our first prototype was distributing read mapping to volunteers.

Now, I will list the problems we have encountered along the way.

First, you need problem suppliers. Blockchain does not need a problem supplier, hash puzzle is adjustable.

Second, proof-of-whatever must be very easy to verify. I mean the ratio between solving proof-of-whatever and verifying it must be really high. Unfortunately, read mapping is not a problem like that. Yes it is still NPC but verification takes way longer time than what is required.

Finally, privacy concerns. You must use encrypted, anonymized and distributed data. Current read mapping techniques do not accommodate to such needs. There are proposals[1] to satisfy these requirements but they do focus on hybrid cloud techniques which means network bandwidth is not a concern for them. However, blockchain applications are greatly concerned with network bandwidth.

*edit: references

[0] https://arxiv.org/abs/1602.03031

[1] https://www.nature.com/articles/ncomms15311#

tradedash · on April 19, 2018

Awesome! thanks for the reading material :D

1) The problem supply can be tackled in many many ways.

2) The verification problem is exactly why I stated NP as a problem class where the verification process is often relatively simple and straight forward. Obviously if that isn't the case then perhaps the problem isn't that well suited for this kind of environment.

3) The bandwidth issue is not something that I had considered. Makes absolute sense when you put it that way.

ethelward · on April 19, 2018

> The verification problem is exactly why I stated NP as a problem class where the verification process is often relatively simple and straight forward

Just for reference, don't forget that NP behaviour is displayed asymptotically. Checking a solution can still be pretty costly.

Mistletoe · on April 19, 2018

Thank you for posting this. I learned a lot about Coinami that I didn’t know.

jacquesm · on April 19, 2018

A blockchain is a data structure, it doesn't have processing power.

Miners solve for nonsense problems because that's the simplest kind that you could solve in order to make the proposition work, anything on top of that requires more complexity.

So it's not that it isn't possible, it's just that it would be charity from the point of view of the developers of the software to spend brain cycles on something they don't strictly speaking need to achieve their goals. I'm pretty sure that at some point there will be 'green' crypto currencies, let's call them 'greencoins' that make a play at avoiding the waste of energy that digital currencies are becoming associated with.

The same happened in other industries, I don't see why digital currencies would not be able to follow a similar path.

txsh · on April 19, 2018

I think it’s clear what he meant and you’re being purposefully obtuse. If I said I was going to put a problem “on the internet” you would know I didn’t mean the netowrking infrastructure was going to produce a solution.

Mistletoe · on April 19, 2018

When Ethereum implements Proof of Stake it will be quite green. In my opinion this will be the death knell for wasteful PoW like bitcoin.

JD557 · on April 19, 2018

I thought about this as well before, but the problem is that you don't want just ANY NP-problem: You want an NP problem (it doesn't even have to be NP-complete) that you can easily adjust the difficulty.

How do you adjust the difficulty on something like this?

Also, you always have to calculate the "hashes" of the blocks. I guess you could encode a block as a graph and it's "hash" could be the smallest path that visited all the edges, but how useful would that be?

When I looked into this in the past, I stumbled into Gridcoin[1], but I don't know how they deal with the difficulty/usefulness of the solutions.

[1]: https://www.gridcoin.us/

tradedash · on April 19, 2018

Thanks for the link. I'll look into it!

> you can easily adjust the difficulty.

For Prime Factorization for instance, which like you say, isn't even NP-complete. You can easily adjust the difficulty by increasing the size of the number in question.

JD557 · on April 19, 2018

I'm not sure if that's completely true for prime factorization. There are numbers that are larger, yet easier to factorize.

For example, I think that a power of two, like 1024, is much faster to factorize than, let's say, 1001 (71113). (I'm not quite sure this example is true)

Here's a more detailed description: https://mathoverflow.net/questions/249266/classes-of-numbers...

Geforce8472 · on April 19, 2018

Gridcoin relies on BOINC for the work distribution, essentially you get credit based on how long it took you to complete the task (to prevent cheating most projects rely on a quorum of results being the same (each task is sent to several users)).

This is not used as a direct mining mechanism, the blockchain is secured using Proof of Stake rather than Proof of Work.

CJefferson · on April 19, 2018

One fundamental problem is that the problem you solve must in some way "sign" or validate the set of transactions you want to add to the Blockchain. You can't just solve anything and say "I did some work".

So while we could hypothetically use NP complete problems (although there are several issues which arise to do with consistent difficulty), it's hard to do useful work, because we need the input to be fundamentally linked to the transactions in the block (and the hash of the previous block too)

tradedash · on April 19, 2018

This answers my question. Thanks :D

adrianN · on April 19, 2018

It's difficult to generate hard instances of problems. Many (all?) NP complete problems are easy on average. For example almost all uniform random SAT instances can be solved quite easily.

tradedash · on April 19, 2018

That's not my understanding of the NP domain. Take Prime Factorization for example. Which isn't even NP-complete, or Sudoku which is NP-complete. Both of these problems can be made really hard by changing 1 small variable, size.

Edit: I understand that your argument is about the average case and not the worst case.

adrianN · on April 19, 2018

If you pick a random number the chances are really good that it has many small factors and hence is a lot easier to factorize than its number of bits might suggest. Half of all numbers are divisible by two, two thirds of all numbers are divisible by two or three.

Regarding Sudoku, my hunch is that there is a critical density of numbers where random problems are harder than average. That's the case for SAT, there is a phase transition on the number of variables per clause where above a critical value almost all instances are not satisfiable and below the threshold almost all instances are. Instances right at the threshold tend to be hard for our current solvers.

ctchocula · on April 19, 2018

One argument is that the blockchain problem coincides with a "real scientific problem" that needs solving and if the scientific problem leads to some application, which can yield a monetary reward (may be big "if"), this would make it cheaper to launch a 51% attack against this blockchain. The attacker could subsidize their attack by being paid to do science.

tradedash · on April 19, 2018

Not so serious answer: Hurray for that? :D

Serious answer: Things would obviously have to be designed differently. This isn't about distribution of "economic power to the hands of the people" or some other Satoshi notion. The thought I have is more along the lines of Folding@Home

http://folding.stanford.edu/

tree_of_item · on April 19, 2018

If this isn't about some "Satoshi notion", then what does blockchain have to do with it? Just do something like Folding@Home. There are several projects like that.

crispweed · on April 19, 2018

What makes the attacker different? If there was a possibility to subsidize mining by being paid to do science, why would other miners not take advantage of this also?

sspiff · on April 19, 2018

Generally, blockchain "mining" requires a problem that is hard to solve, easy to verify. If I understand it correctly, the genome sequencing problem is hard to solve, hard to verify.

klmr · on April 19, 2018

No, a given assembly is easy to verify. The complexity in actual application comes from the fact that performing the sequencing part (the biological/mechanical part) again will yield a different assembly due to biological and technical variability. Another complexity comes from the fact that you can define different measures of “goodness” (and hence optimality), depending on which errors you model.

But from a computational standpoint, sequence assembly (using a given error metric) is simply NP-hard.

tradedash · on April 19, 2018

What you just said is the "definition" of NP problems which is why I had it in my question to begin with. Even if genome sequencing is not a good problem for mining. There are plenty of other scientific problems that are. Yet mining today is not tackling any of them :/

ewjordan · on April 19, 2018

The miner's identity (more or less) has to be one of the inputs to the problem, so that everyone is actually working on a different problem and you can't "steal" a block just by broadcasting another person's solution faster and wider.

Maybe there could be useful problems with this property, but it's not trivial to find them.

AndrewDucker · on April 19, 2018

Would the solving of the problem have a "difficulty" that could be adjusted to ensure that blocks appeared approximately every ten minutes? Is checking that the solution is valid a very fast operation? Are we sure that there is no "final" answer which would make the calculation unusable at some point? (Those are the three main factors for choosing what kind of work to do, as far as I can see)

tradedash · on April 19, 2018

> Would the solving of the problem have a "difficulty" that could be adjusted to ensure that blocks appeared approximately every ten minutes

I am fairly certain you can design your problem to behave in a similar manner or in some manner that increases the complexity based on some criteria

> Is checking that the solution is valid a very fast operation?

That's the NP part in my question. Plenty of scientific problems out there that are fast to verify hard to compute. Ex. Prime Factorization

> Are we sure that there is no "final" answer which would make the calculation unusable at some point?

I would argue, and I might be wrong due to my ignorance on the subject here, that we can never "know" that. That's why P=NP is still an open question

xvilka · on April 19, 2018

There are droves of the distributed computational projects, mostly based on Boinc [1]. There are projects whose goal is biochemistry applications too [2].

[1]. https://boinc.berkeley.edu/

[2]. https://boinc.berkeley.edu/projects.php

dekhn · on April 19, 2018

There have been a lot of co-processors in the genomics and sequencing world. They have all failed for obvious business reasons. It's not even clear that any genomics problem that exists today can't be solved with conventional hardware. Most genomics programs reach a few percent of the capacity of the hardware, so I think more software tuning and design is the right approach.

searine · on April 19, 2018

I agree. The problem in genomics isn't how fast alignment is going, it is where to put the gobs and gobs of data.

The bottleneck in speed is how fast you can flip a DNA sequencer, which right now with an illumina Novaseq that bottleneck is a few days per run. That gives plenty of time to process each runs data on an HPC on current software for alignment and snp calling.

However, when each runfolder takes up 30Tb and you're flipping 20 runfolders a week. The scale of storage get large, fast.

klmr · on April 19, 2018

That’s why compression of sequencing data is becoming so important. CRAM is nowhere near the theoretical limit (CRAM achieves ~2x over BAM; commercial products currently achieve up to 6x).

(COI disclaimer: I work for such a company.)

heuermh · on April 19, 2018

No need for the disclaimer, post a link to your benchmarking results.

We don't compress as well as CRAM, but distributed reads across an Apache Spark cluster are much more efficient via Parquet

http://adam.readthedocs.io/en/latest/benchmarks/storage/

dekhn · on April 19, 2018

It's not clear you want to store your sequence data maximally compressed unless you have it archived. If you're doing live work on the data, it's better to have it modestly compressed with a fast decompressor, and a good index so you can minimize your total access.

klmr · on April 20, 2018

That’s plausible but it turns out not to be true. Disk read latency is larger than the overhead of our decompression, even on fast storage media. As a consequence, better compression actually leads to (slight) performance improvements. Additionally, genomic data is often accessed via relatively slow network storage on clusters or, worse, via the internet. Increasing throughput trumps all other considerations here. You’re right concerning random access and indexing, of course.

dekhn · on April 20, 2018

I counter your argument: I have constructed systems that did this. Disk read latency doesn't matter for streaming reads.

The actual math for this for a product at scale is interesting; I built such a product and did the math, and we found that moderate compression gave higher rates. However, this is based on a production-class infrastructure.

dekhn · on April 20, 2018

I should also mention: using more CPU to decompress highly compressed data means you have less CPU available (on a multithreaded app in a resource-managed environment) for other work. Since after decompressing you're going to be doing a bunch of other work (and a lot of that work will be going on simultaneousy in other threads), using less CPU to decompress can give higher throughput. But, again, this is a very complex problem.

neuromantik8086 · on April 19, 2018

Oddly enough, I found a similar sentiment echoed last night when I was looking up how widespread the use of ASICs was in various scientific fields. I had discovered an FPGA startup called Edico Genome [1] that seemed to offer substantial advantages in speed of processing, and after searching through reddit came across a post where someone noted that there has been a precedent for FPGA/ASIC related companies to go under in the genomics sphere [2]. The poster there attributed this to an unwillingness to use tools that weren't widely cited, however.

[1] http://edicogenome.com/ [2] https://www.reddit.com/r/bioinformatics/comments/73wcd3/fail...

dekhn · on April 20, 2018

To deal with the citation problem, many FPGAs and ASICs that implemented BLAST guaranteed "bit-identical output". You could easily verify anything found by the accelerator using BLAST, just more slowly.

Scaevolus · on April 19, 2018

Slides from a talk by the author that give a good introduction to the landscape and the previous algorithms: https://platformlab.stanford.edu/Seminar%20Talks/Yatish_Tura...

daemonk · on April 19, 2018

Cool. But I think the current state of the art is using minhashes (minimap2) to avoid doing expensive all vs all alignments. There is still a place for more sensitive alignments though.

noobermin · on April 19, 2018

Very interesting write up too! I often read that computational biology is a different beast than the rest of the computational landscape (which usually revolve around solving PDE's fundamentally) and this was a nice picture into at least one problem on that side of the pond. A "best substring" search to me sounds closer to the kind of problems CS people think about than what computational cats do.

m3kw9 · on April 19, 2018

“Up to x% faster” another marketing term that used to work.

lewisinc · on April 19, 2018

Oh no, someone should tell them that darwin is the name for the kernel inside of iOS/MacOS.

_of · on April 19, 2018

... as well as a british biologist.

ShinTakuya · on April 19, 2018

That was clearly the joke.