Notes

Sampling Samples

August 21, 2019

If you had a set of p90 samples, how would you get a p90 of the overall original data? Would you take the p90 of your p90 samples? Would you take the p50 median of your set of p90 samples? It turns out that neither are correct and it’s actually impossible to reliably recover original percentiles from derived percentile data:

If your original data is grouped into a [10, 10, 10, 10] sample and many [0] samples, your p90 samples sould be [10, 0, 0, 0, 0...] and your p90(p90(data)) would turn out to be 0 (your p50(p90(data)) would turn out to be 0 too).

Even with equal sized samples, p90 samples aren’t useful for finding an overall. If we had 10 sets of samples, [1->10], [11->20], etc. until [91->100], our overall p90 would be 90, but our p90(p90(data)) would be 89.