Hacker News new | past | comments | ask | show | jobs | submit login

Their estimator does not require any assumption of uniformity. Notice that P(random ID is valid) = total_valid_IDs / total_IDs is the same regardless of the distribution of valid IDs.



I don't think this is correct. In order to generate a random number at all you first have to assume a distribution. If they generate a series of random ids, and those random ids are uniformly distributed across the space of all possible ids - while the valid ids have clusters - then won't their method still give a skewed estimate?


Nope. Let's take a small example. We have 20 bins. You are going to put 3 things in those bins, with each thing getting its own bin.

I'm going to roll a d20, which gives me a uniformly distributed integer from 1 through 20. I'll look into the bin with that number and if there is something there take it.

You don't want me to take any of your things. How can you distribute your 3 items among those 20 bins to minimize the chances that I will take one of your items?

If I were not using a uniformly distributed integer from 1 through 20 and you knew that and knew something about the distribution you could pick bins that my loaded d20 is less likely to choose.

But since my d20 is not loaded, each bin has a 1/20 chance of being the one I try to steal from. Your placement of an item has no affect on the probability that I will get it.

It works the same the other way around. If you place the items using a uniform distribution, then it doesn't matter if I use a loaded d20, or even just always pick bin 1. I'll have the same chance of getting an item no matter how I generate my pick.

In general when you have two parties each picking a number from a given space, if one of the parties is picking uniformly then nothing the other party can do affects the probability of both picking the same number.


> Let's take a small example. We have 20 bins. You are going to put 3 things in those bins, with each thing getting its own bin.

Now imagine a number of these bins _can hold zero things_ (not 3). Eg in a world where all bins are the same size, you can always steal 3 things from any of the bins, whereas in a world where the bin sizes vary. You'd hit a few bins which are guaranteed empty. Doesn't this directly affect the probabilities?


The key here is that the query sample is uniformly distributed, and that this is sufficient. I think some other comments in this thread give some decent intuitions why this is true. Cheers


No. Think about it: how could the underlying distribution change the probability of a hit from a uniform random sample?


I'm not sure they've sampled enough videos to be able to make that kind of central limit theorem assertion. A trillionth of a percent is an awfully small sample of the total space.


How did you get the trillionth of a percent? The total space is 2^64 and they found 24,964 videos. Based on their estimated number of videos (13,325,821,970) we can infer[1] that they have made the equivalent of a sample of size 3.46e13. I say equivalent because of the "cheating" they mention in the article which improves the efficiency of their method so their actual sample would be 32000 times smaller I guess). Anyway as shown in the link below that's a good sample size since it gives a precision of about 0.6%.

[1] https://news.ycombinator.com/item?id=38744227




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: