Hacker News new | past | comments | ask | show | jobs | submit login
Algorithms Every Data Scientist Should Know: Reservoir Sampling (cloudera.com)
120 points by Irishsteve on July 1, 2013 | hide | past | favorite | 40 comments



The core problem with Reservoir sampling is the huge number of calls to the random number generator. That quickly drains the entropy from PRNGs.

Also, for some sampling applications, the calls to the PRNG are the slowest part of the algorithm. So, you should prefer an algorithm that makes the fewest possible number of calls.

Lastly, you should take care in how the algorithm is implemented. For example, if the PRNG yields consecutive 32-bit values, computing randbits modulo n leads to small selection biases whenever n doesn't evenly divide 2^32. There are a lot of ways around this problem, but naive implementations won't give equi-distributed results.

For the most part, you should choose some other sampling algorithm unless you have exotic needs: not knowing the population size until you loop over it, AND not caring about the huge number of calls to the PRNG, AND having a limitless supply of entropy for your PRNG, AND having an easy means of getting equidistributed values in range of 0 to n-1, OR the quality of your results isn't important.


Another way would be to simply calculate hash(prefix + obj) and then choose the object with lowest hash value. You can use a GPU-fast hash like sha2, and you can even paralelize it. It's not guaranteed to be evenly distributed, but I think it would be close enough for most purposes.


Ingenious! This should work -- the uniform distribution property of hash functions is relied upon very heavily by other streaming algorithms such as HyperLogLog, and there has been sound empirical evaluation by the guys at Aggregate Knowledge of the properties of various hash functions:

http://blog.aggregateknowledge.com/2012/02/02/choosing-a-goo...

(Dunno why you include a prefix in the calculation above. hash(obj) should be sufficient.)


You need the prefix because without it, you will end up picking the same object every time you run the algorithm (e.g. On the next day's stream of data)

Depends, of course, upon the size of the stream and the range in variety of the objects in the stream.A stream of bytes would not be a good use case, for example.


I don't understand why you would need many calls to random. Are there any obvious reasons it wouldn't suffice to get one random number and then reuse it throughout the stream?


Looks like that doesn't work,

https://gist.github.com/tantalor/5911137

The distribution ends up weighted towards the front with a spike at the end.

Looks like this: https://docs.google.com/spreadsheet/oimg?key=0AhzL62E3zw7-dG...


Another thing about the algorithm and article is, this is supposed to be a "real world problem" that the interviewer asks because they're struggling with it themselves. But what would a single randomly chosen item from a huge set mean to "real world" operations? That you would very occasionally get a very usual value? - You'd have to run the scan many times for that and whatever other thing you were thinking of.

Basically, the algorithm is "cool exercise in probability" that any deeper look will discover has just about zero use or practicality in the actual "real world". Which kind of throws into the zone of obnoxious interview brain teasers despite claims to the contrary.


The problem was supposed to be based off of a real world problem. Almost always, when you simplify a problem you have to its core, you end up with a very simple question that appears to have no practical applications.

Normally, this simple question is also easy, in which case simplifing it is the only skill you need. In other cases this question is hard, in which case you need to be able to solve it. Both of these skills are valuable, but should not be tested in the same question.

Also, if you are trying to pick a candidate's brains, their is no reason to make them redo your work of simplification, when you are stuck on the actual problem.


Not at all. The exercise is trivially extended to allow for selecting 1,000 or 1,000,000 items from the set. This can be useful if you want a reprsentative sample from the stream, but require a well-defined and limited number of samples.


> That quickly drains the entropy from PRNGs.

If you're worrying about the entropy in your PRNG, you are (probably) doing it wrong—just use Fortuna and be done with it.

> Also, for some sampling applications, the calls to the PRNG are the slowest part of the algorithm.

That's a valid concern. Of course, how often do you need cryptographic security and performance in a sampling application?


Is there any reseason you couldn't use a fast psuedo random number generator?

Even so, you are processing a stream of data, sounds like an easy place to get entropy.


Sounds like the sort of problem the Mersenne Twister PRNG is good for.


I love reservoir sampling. When I first heard about the algorithm I had been preparing a lab session for a course I was a TA on. (I had a lot of freedom to create labs in this course -- a good and bad thing!) I made dinner in a bit of a daze while I proved in my head that reservoir sampling worked. It's a simple inductive proof, which is a good thing because if it was more complicated I might not have been able to do it without paper, and I might not have eaten that night.

I've since learned there are many other interesting sampling algorithms that apply in a streaming setting. A few are give here: http://people.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf


Good link, thanks. Here's a couple people doing basic algo research

http://www.cs.princeton.edu/~arora/publist.html

http://www.cs.rutgers.edu/~muthu/streams.html


I've always loved the one-liner of this in Perl: http://learn.perl.org/faq/perlfaq5.html#How-do-I-select-a-ra...


Let's unpack that,

    rand($.) < 1 && ($line = $_) while <>;
We all know rand and while, but if you don't know perl the rest is hard.

<> is a common way to read stdin, and the value is assigned to the $_ special variable.

The one I didn't know was $., the current input line number. In other words, its your loop index that automatically increments.

http://www.kichwa.com/quik_ref/spec_variables.html


That's good until your input stream exceeds 32768 lines [1]. There are better generators available from CPAN, like Math::TrulyRandom.

[1] - http://www.perl.com/doc/FMTEYEWTK/random


Hmm,

Given the article's windup, I'd be a bit skeptical of the naive solution - at input n, with probability 1/n replace the current choice.

Once you've reached input number 10^8 or whatever, you've got all sorts of chances to have arithmetic overflow and/or the weirdness of that many pseudo-random operations screw you.

I'd rather keep a list of x's; let x1 be randomly chosen from the last 10, x2 randomly chosen from the last 100, x3 random chosen from the last 1000, etc and when termination time come do a little fixup. That'd take O(log(n)) memory instead of O(1) but if this matters, shouldn't you want some sample of what's happening?


It's easy to do the right thing using arbitrary-precision arithmetic and a random-bit generator. Basically, generate bits after the decimal point until the possible range for the final random number does not include 1/n. The number of bits you have to generate doesn't depend on n: the expected number is 2 bits[1]. If your random bits are biased, that's easy to correct too[2].

[1] You always need at least one bit. After the first, each additional bit divides the space in which 1/n can lie in half, so it's just 1 + (1/2 + 1/2(1/2 + 1/2(...))) = 1+1/2+1/4+... = 2. It's a lot like arithmetic coding[3]

[2] http://en.wikipedia.org/wiki/Fair_coin#Fair_results_from_a_b...

[3] http://en.wikipedia.org/wiki/Arithmetic_coding


Just FYI, what Wikipedia is talking about is biased but still random distribution that thus wouldn't have the possible problem of correlation between successive "tosses".

Not that I have any knowledge any given pseudo-random number generator would trouble with long series's of zeros. Still, scheme I had in my parent post was aiming to avoid potential problems with that kind of thing.


Once you've reached input number 10^8 or whatever, you've got all sorts of chances to have arithmetic overflow and/or the weirdness of that many pseudo-random operations screw you.

Why? Double floats don't overflow until 10^-300 or so, and RNG's give uniform numbers from 0 to 1 just fine...


Most RNGs that generate numbers from 0 to 1 actually just generate a random 64-bit value, and divide by 2^64; which means that they do underflow relatively easily. (This is relatively easily to overcome using a better (& more expensive) random bits -> double algorithm.)


I'm not sure how your solution gives a uniformly distributed sample?


So you take the first item in the stream. Then you take the second. Flip a coin. Pick the winner, discard the loser. Promote the winner to level 2. Take in the next 2. Promote hte winner to level 2, discard the loser. Pick a winner from the two level twos, discard the loser. Promote the winner to level three. The winner at the top level k will always be picked with probability 1/2^k. If there were 2^k elements, this is exactly the probability it should have; and by the symmetry of selection at each level, it's uniform.

There's some complications if the sample is not a power of two (or whatever number of elements you picked for each level). Essentially though, you know the number of elements at that point though, and you can weight the probability of selection so that the winner has exactly 1/N probability of having been selected.


Right, that works; but it seems different from "Pick one from first 10, pick one from next 100, pick one from next 1000".


Fundamentally it's the same.


I first saw that sampling technique in The Practice of Programming by Kernighan and Pike. It's used in their Markov Chain example.


Just yesterday this blog post was on r-bloggers that shows how to use it with R: http://things-about-r.tumblr.com/post/53690776834/time-is-on...


This is a great algorithm, entertaining to read about. I can say pretty safely that I wouldn't have come up with it on the spot.

Maybe that's why I'm not so enthusiastic about it as an interview question? ;)

More seriously - I do think that questions with specific optimal answers (or at least known answers that are considerably better than simple greedy solutions) can be good interview questions, as long as they contain a big middle, with plenty of opportunity to think about the problem and show some good problem solving skills.

Questions where you know the answer or don't are probably the worst. Questions where you get the right answer or you get nowhere aren't great either. Questions where you can get a good answer by thinking about it, even if you don't get the most optimal answer, are probably best.

Another good way to approach this would be this: how much would an interviewee/candidate's performance change if he or she were allowed to type "how to randomly sample from a list of unknown length" into google. If that would make a huge difference, then it probably isn't a great question unless you're really testing to see if someone is already aware of an algorithm.


I deduced this myself when I wrote an AI for Ludo, so that it chooses one of the possible moves - equally likely - as the moves are getting generated :)

Some other similar logic - Selecting n distinct items out of N, all C(N,n) selections being equally likely, passing through the data only once :)

And the next step - if the data can be accessed randomly, the above can be done in O(n) time instead of O(N) through a modification of the algo, still giving a statistically equivalent sampling scheme.


Any data scientists on HN? I just ask because I wonder if it would be a good career move for me. Spent 10 years as a programmer and systems analyst. Got real tired of that.


"Data scientist" is a really vague term. It's really a matter of looking at your current skill set and interests, and seeing where they fit.

For example, many CS grads don't have strong stats/maths backgrounds. However there are still many "data science" jobs open to them. For example, you can work on the tooling that goes into storing and processing data (e.g. Hadoop), or you can build visualisations. If you want to work on, say, data analysis you need to have a stronger background in stats.


Since machine learning and A.I is pretty much applied stats, there has got to be quite a lot of cs grads with good stats skills.


Some CS departments require undergrads to do the typical engineering maths sequence (Calculus, Linear Algebra, etc.) but many don't. Lots of PhD students in machine learning have undergrad degrees in physics, maths, or engineering.

Where I did my PhD (the top ranked CS department in the UK by a recent survey) the undergrad students did one maths courses in their entire degree, with most of it being discrete maths (IIRC). This really limited the ability of students to undertake research in machine learning or theoretical CS.


Since your based in Birmingham I can take a good guess at where that is, because I studied there and live in birmingham. And I agree, but the machine learning/A.I modules themselves had to have maths placed within them.

Nice to see brummies on hacker news. Since your start up guy, I guess you hang around faraday wharf?


I was at Faraday Wharf a few months ago, but I mostly work from home now. Drop me an email (address is in my profile) if you want to chat more.


I work for a big firm doing data science - or risk, fraud, response etc. prediction specifically. It's easy money. The problem that you can run into is this.

The systems are archaic. By that I mean both the implementation platform, and the people who you will sell your solution to. To most of the senior stuff this is new and mysterious - they would rather "play safe", and may ask you instead to use traditional methods, put "intuitive" variables, and keep the solution "simple" (simple being anything that they already know of, from their past 10-20 years of experience).

But things are changing. In my farm too. And if you interview now as a data scientist, you will probably be given the right respect - and put into a team where the leaders may want people like you.

And you will be one of very few in the company who knows what this stuff is, rather than myriads who know how to model using the company's already setup platform - which you will prove to be inferior. You may have to push the boundaries of the company for it, and you will pave the way. But you will also be welcome - as the realization now dawns upon corporates that there may be better ways of doing what they have been doing.

Also you will not be alone. You will find some smart people in the company you join soon. They will challenge you, be your friends. You will learn and think about things that you have not before... as will you intrigue others to think of things that they haven't before.

So go for it. Try it out. It's easy money.


I hope you find this. I was wondering if it's ok to ask you what "easy money" means? Quantitatively, that is. Care to share ball park figures? I can never get accurate or reliable figures with salary sites online.


It is a good career move, the salaries tend to be about 50% more than the those for great programmers. However, be prepared to spend at least ten years getting to the point where you are as comfortable as you are now. There's a whole lot to learn.


What about salaries? It took me forever to scale up to over 100k as a programmer. I'm guessing it's not much different with data science.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: