Hacker News new | past | comments | ask | show | jobs | submit login

> Even P99 estimates averaged from many tiny samples are at most inaccurate, they won't be biased.

Are you sure about this?




Sure, simulate it if you don't believe me.

    x <- rnorm(1000, mean=50, sd=10)
    quantile(x, 0.99)
    mean(sapply(1:20, function(i) {
      i <- i * 50
      quantile(x[i-49:i], 0.99)
    }))


Matlab / GNU/Octave code:

    rand('seed',123);

    outer_rounds = 100;
    inner_rounds = 100;
    items = 500;

    q99_mean_vec = [];
    q99_complete_vec = [];

    for ior = 1:outer_rounds;

      q99_vec = [];

      x = rand( 1, rounds*items );

      for iir = 1:inner_rounds
	
	q99_vec(iir) = quantile( x( (iir-1)*items+(1:items) )', 0.99 );
	
      end

      q99_complete_vec(ior) = quantile( x', 0.99 );
      q99_mean_vec(ior) = mean( q99_vec' );

    end

    hist( [ q99_complete_vec; q99_mean_vec ]', [0.988:0.0001:0.991] );
    xlim( [0.988, 0.991] );
    legend( 'from complete data', 'from means' );
This looks biased to me even though the distribution is uniform. I did not expect that. Maybe the quantile function is broken.

EDIT: I've installed R and run your code. You use a very low number of samples in the mean, so I changed it a bit. 1. I use the uniform distribution, so we know the true value of the 99th percentile is 0.99. 2. I've increased the number of samples:

  x <- runif(100000)
    quantile(x, 0.99)
    mean(sapply(1:2000, function(i) {
      i <- i * 50
      quantile(x[i-49:i], 0.99)
    }))

When run multiple times, I've found that the mean is always lower than the quantile over the full sample.


Actually, doh, you're right, my bad, turns out sample quantiles are only asymptotically unbiased, and a better estimator would take some sort of weighted average of P99 and P99.5 with the weights depending on the sample size. (You still don't need the full population and you don't have to merge histograms, though.)


I find this part of statistics very hard to argue about. Simulations only work with well-known distributions, but even then not all of them. Even when you have a simulation that confirms hypothesis A1 using distribution D1 it says little about some pathetical distribution D2.

Using mathematical precision might take years to get it right and often there is no easy answer.

What I've been looking for is something that is fast and works "good enough". Estimating the 95th percentile and averaging it did not work well in my use case. Using the histogram method does work well although it certainly is not perfect.



Thanks, this is informative and pretty similar to what I've been implementing in JavaScript. I guess I'll have to take a deeper look into literature when doing an update to that topic.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: