Wait a second. If the IDs are all allocated in a contiguous block, and the autho...

treesciencebot · 2023-12-23T02:14:38.000000Z

Is 32,000 a good enough number to estimate the entirety of the Youtube’s video space? It felt to little for what they are trying to accomplish (especially when they started doing year by year analysis)

cbolton · 2023-12-23T13:52:08.000000Z

32000 is just the "cheat factor" by which they increase the method's efficiency.

I'm not sure how much the "cheating" would affect the precision of the result. But assuming it has no effect, it's easy to estimate this precision:

They found X = 24964 videos in a search space of size S = 2^64. For the number of existing videos they report the estimate N = 13,325,821,970. From this we can find their estimate for the probability that a particular ID links to a video: p = N / S ≈ 7.22e-10. So the equivalent number of IDs that they have checked (the number of checks without cheating that would give the same information) is n = X / p ≈ 3.46e13.

Since X is a Binomial, its variance is Var(X)=n⋅P(1-P) (where P is the real proportion corresponding to the estimate p above). And N = X⋅S/n so its variance is Var(X)⋅S^2/n^2. The standard deviation of N is thus σ = S⋅sqrt(P⋅(1-P)/n). Now we don't know P but we can use our estimate p instead to find an estimate of σ!

We find that the standard deviation of their estimator for the number of YouTube videos is approximately S⋅sqrt(p⋅(1-p)/n) ≈ 8.43e7. That's just 0.633% of N so their estimate is quite precise.

Dylan16807 · 2023-12-23T04:21:00.000000Z

When you're estimating a ratio between two outcomes, the rule of thumb is that you want at least 10-100 samples of each outcome, depending on how much precision you want.

They got 10,000 samples of hits, and a huge number of samples of misses. Their result should be very accurate. (32,000 was a different number)