Data Sketching: the approximate approach is often faster and more efficient

a_bonobo · on June 8, 2017

This is gaining popularity in bioinformatics too.

When you sequence a genome with a sequencing machine in tiny pieces you get the same piece several times, it's not even coverage, some pieces show up a lot, some pieces show up rarely so that you have to sequence the genome several times, so to speak (30 to 80 times is the norm).

A lot of that data is redundant to get to those rare pieces, so 'data sketching' is getting more popular.

Examples:

- The Abyss genome assembler has (experimentally) a Bloom filter to decrease memory requirements and run time at the expense of accuracy and having to fiddle with the Bloom filter parameters. The paper assembles a genome using 35GB o RAM instead of a few hundred GB.

- The khmer package contains many functions to downsample sequencing runs using, again, Bloom filters. It also has a HyperLogLog implementation to quickly count k-mers (similar to n-grams, but with ATGC instead of words)

- MinHash sketching is a popular first step in aligning messy PacBio reads (implemented for example in the Canu assembler). People also use this to map population structure using sequencing reads directly, and to quickly assign a taxon to the data of an unknown species.

- There are some approaches with the Oxford Nanopore sequencer (that tiny thing you can hook up to your laptop) where you stream the incoming data directly into the assembler or other program and tell the sequencer to stop sequencing once you found what you're looking for, or your coverage is high enough

Etc. pp. It's a goldmine for fast algorithms, I haven't seen people use a Cuckoo filter, for example

yarapavan · on June 8, 2017

In response to this challenge, the model of streaming data processing has grown in popularity. The aim is no longer to capture, store, and index every minute event, but rather to process each observation quickly in order to create a summary of the current state. Following its processing, an event is dropped and is no longer accessible. The summary that is retained is often referred to as a sketch of the data.

This article introduces the ideas behind sketching, with a focus on algorithmic innovations. It describes some algorithmic developments in the abstract, followed by the steps needed to put them into practice, with examples. The article also looks at four novel algorithmic ideas and discusses some emerging areas.

fabiofumarola · on June 8, 2017

It is interesting to see that this topic which is around the academic world since 2004 is gaining popularity only now.

An interesting thing is that you can find some of these concepts implemented in apache spark (count approx) and into BlinkDB. While, from a theoretical point of view many result are relate d to statistical bound like Chernoff bound and Chebyshev inequality. During my phd I found this topic interesting and promising to answer to mine data with a verified error bound for precision/recall and elapsed time.

To go deeply into this topic I suggest the book Knowledge Discovery from Data Streams http://www.liaad.up.pt/area/jgama/DataStreamsCRC.pdf and other books written by Joao Gama.

lorenzhs · on June 8, 2017

Agreed, streaming algorithms and sketches are a fascinating topic. The article's author, Graham Cormode, has been conducting research in that area for a long time and is one of the co-authors of the Count-Min Sketch paper.

Basically a Count-Min Sketch is the same thing as an undersized counting Bloom filter, but it's used quite differently. This allows for loads of interesting operations. First, there's the estimated number of occurrences of an element (how often has it been added?), but it can do more complex things as well, such as quantiles and heavy hitters.

benstrumental · on June 8, 2017

Here are implementations of some of the sketching algorithms mentioned in the article for those interested:

https://github.com/ct-clmsn/CHIUW2017

andrewxhill · on June 8, 2017

We (set.gl) have been working on a few really exciting ways to do this on-device with primarily location data. Here's the talk this fall at Strava https://conferences.oreilly.com/strata/strata-ny/public/sche...

lambdaphagy · on June 8, 2017

Sorting a dataset in order to draw k samples is completely nuts. Pick k random integers between 1 and N, rerolling any duplicates.

Is your data a stream? Use reservoir sampling for O(n) vs O(n log n).

Sorting a collection of values in order to sample from it doesn't make any sense.

lorenzhs · on June 8, 2017

The sampling algorithm you describe becomes pretty slow as k grows. There are much faster algorithms. We wrote a paper on fast sampling: https://arxiv.org/pdf/1610.05141.pdf (page 11, what you described is labelled "H" (for hash table) in the plot. Sorry about the single-letter names, it's a weird convention for sampling algorithms). You can see that as soon as k is more than a few thousand, "R" ("recursive") is much faster. What it does is recursively split the range (initially 1...N) in half, and deciding how many samples come from the left and how many from the right side. It does this by using hypergeometrically distributed numbers, which is the distribution for sampling without replacement. Its base case is the algorithm you describe (H) when there are less than ~1000 elements left. Algorithm R is also very easy to do in parallel. ("B_MKL" and "B_GPU" are vectorised, non-portable implementations of a Bernoulli sampling based algorithm using Intel MKL and Nvidia CUDA, respectively).

Reservoir sampling can be done much faster than O(n). It's possible in Θ(k log(N/k)). This is a tight bound, so things like O(k) are impossible, because every position in the reservoir will be replaced approximately log(N/k) times. See https://www.researchgate.net/profile/Jeffrey_Vitter/publicat... for a proof.

But you're right in that sorting in order to draw a sample is not particularly efficient most of the time. Not to mention that a selection algorithm would be sufficient as well, and requires expected time O(N): https://en.wikipedia.org/wiki/Quickselect