New businesses eye the opportunities in managing genome data

greenleafjacob · on June 25, 2016

At least on the storage side, the data is very compressible. A LZ77 variant for genomic data called GDC [1] got a compression ratio of ∼9500, reducing the incremental size from 100GB to about 10 MB.

[1]: http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&proj...

_ihaque · on June 25, 2016

GDC isn't directly relevant for most of the large-scale storage needs in genomics. There are a handful of "most important" file formats in the kind of genomics discussed by the article:

- FASTA: a sequence or set of sequences representing an entire genome. Usually used (in human genomics) to represent the "reference genome", but very rarely individual genomes.

- FASTQ: a set of sequences with associated probabilities at each position (representing the probability that the base at that position is correct). Used to represent the output from a sequencing experiment, where you may get 100s of millions to billions of short reads (order of 100 to 10000 bases in length) from a biological sample.

- BAM: "binary alignment map". Stores the data from the FASTQ (sequence and quality) in a way that "aligns" it to a reference genome -- identifying where the read "came from" (more formally, mapping each base in a read to the most likely corresponding base in the reference genome)

- VCF/gVCF: "variant call format". You can think of this as a diff between the individual sequenced and the reference genome.

In most cases in human genomics you wouldn't construct a full FASTA from the individual (something you would do through a process called assembly). Instead, you would sequence the sample (producing FASTQ), align it to a FASTA reference genome (producing BAM), and call variants (producing VCF). The VCF is much smaller than the other formats, and that size differential is most of where GDC would get its performance: all in all, we're usually not that different from the reference.

The big storage problem usually ends up being storing and manipulating FASTQ and BAM files, because these are the (almost complete) original data from the sequencing run, and occasionally there's a need to keep them around:

- you may want to run a new analysis that wasn't done and so wasn't encompassed in the original VCF - you may want to know the underlying quality of the data that created a variant call. Sequencing is a stochastic process, subject to a variety of types of error. Even though VCF calls typically have an estimate of quality, in many cases there's no substitute for looking at the original underlying data. - you may have a legal or contractual obligation to maintain this data. (e.g., under CLIA regulations laboratories offering clinical sequencing may be required to store the raw data underlying a clinical result for a number of years).

So, how do you store FASTQ or BAM more compactly? BAM is already compressed -- block gzipped, to be specific -- but it still stores all the information explicitly. The obvious first step is reference-based compression (e.g., CRAM: http://www.ebi.ac.uk/ena/software/cram-toolkit), which elides sequence data that is identical to the reference genome, and actually gets rid of almost all of the storage needed to store the sequences.

Other than some easily compressible stuff, like read names, the main thing left over is the quality scores -- the sequence of numbers telling you how "high-quality" each base sequenced was. In raw form, the qscores take an equal number of bytes as the sequence reads, but once we've used reference -based compression to elide the bases, they're by far the dominant component. Unfortunately, compressing quality scores well is a difficult problem. There's been a lot of work done in this area (e.g., http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3832420/), but it's by no means a solved problem.

So if you're into data compression, coming up with a better way to compress FASTQ/BAM files is still a valuable line of work.

danieltillett · on June 25, 2016

As someone who works on DNA sequencing software the dirty secret is qscores are excessively precise. They technically span the range 0 to 99, but due to limitations in being able to predict qscores (you can't really get closer than +- 5) the real range is more like (0 to 7)*10. Using better bit-packing it should be possible to compress much better.

_ihaque · on June 25, 2016

Range isn't an issue (any reasonable scheme won't assign codes to scores that never occur), but you're right that qscores are both precise but not necessarily accurate. Quantization of quality scores is a common solution (it is optional in CRAM, mentioned above). There's also recent work on better methods for doing it [1,2].

The problem is that if you have to maintain BAMs for regulatory reasons, lossy q-scores may not be sufficient for compliance, because you 1) have lost part of the original data 2) may not be able to exactly reconstruct your analysis results (unless, of course, you did the analysis on the quantized scores).

Thus, it would still be interesting to see better lossless compression methods.

[1] http://bioinformatics.oxfordjournals.org/content/early/2014/...

[2] http://web.stanford.edu/~iochoa/publishedPublications/2015_q...

danieltillett · on June 25, 2016

The qscores do occurs, just the error bars on them are quite large in practice. It certainly would be interesting to see if the analysis can be constructed from quantized q scores.

dekhn · on June 25, 2016

Compressing the data is fine but more important is indexing it and the metadata for lookup, if you want to do live analysis. If you're just storing for compliance, then systems like glacier are ideal because you rarely need to retrieve.

therein · on June 25, 2016

I am not surprised as there will be plenty of subsequences that repeat but that's some amazing compression ratio.