Averages Can Be Misleading: Try a Percentile (2014)

baq · on April 2, 2019

IMHO plotting the distribution should be the first step before trying to compute its statistics. If you know the shape, you can understand the values - otherwise it's guesswork.

jmngomes · on April 3, 2019

Agreed, this is actually demonstrated by Anscombe's quartet, a set of "four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed" (https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

stargazer-3 · on April 3, 2019

... although median and percentiles would be different within the Anscombe's quartet.

Sean1708 · on April 3, 2019

But there are other datasets that would do the same thing with medians.

shittyadmin · on April 2, 2019

We've switched to box and whisker plots for most of our statistical reports, they give you a good idea for various important percentiles and adding average indicators is fairly simple if desired. Even for things like latency this can be quite useful.

the8472 · on April 2, 2019

Box plots can still obscure the nature of a distribution, e.g. it might be multi-modal.

Violin plots + outlier dots + additional markers are more helpful. Sometimes the CDF is also more useful than the PDF, e.g. for latencies.

b_tterc_p · on April 2, 2019

I dislike violin plots because they don’t give a good sense of how many points there are overall, and this can be very misleading if you’re trying to compare segments of different size. They also look like female genitalia and I’m 100% serious when I say this tends to distract people for laughs.

I like swarm plots for this kind of task.

https://seaborn.pydata.org/generated/seaborn.swarmplot.html

Edit: well, I suppose I should clarify the comment on violin plots is implementation dependent and biased by my personal preferences for visualization libraries

wyldfire · on April 2, 2019

> don’t give a good sense of how many points there are overall

Consider inner="stick" from [1] instead. It probably comes closer to what you're looking for.

[1] https://seaborn.pydata.org/generated/seaborn.violinplot.html

the8472 · on April 2, 2019

The plots on that page seem to exhibit some clustering and weird shapes which might not be all that great for discrete data or if you have sampling artifacts since it distracts from the overall shape of the distribution.

Showing the number of samples is good though. The combined version seems useful.

cafebeen · on April 3, 2019

I find raincloud plots to be the best of both worlds, and they have been my preference lately:

https://wellcomeopenresearch.org/articles/4-63/v1

They can visualize common statistics like box-plots, the distribution shape like violin plots, as well as the raw data!

dmix · on April 3, 2019

Oh, that is quite nice. Communicates a lot in a single chart.

minimaxir · on April 2, 2019

Violin plots are harder to parse at a glance.

Like all data visualization techniques, which plot type is better depends on the context.

wyldfire · on April 2, 2019

I wholeheartedly agree! Violin plots are a great way to get a dense appreciation for a distribution. Multi-modal distributions are completely masked by mean, even median+std.

pytyper2 · on April 3, 2019

This entire thread is great, 10 data scientists all want their own special chart to be the best. You are all wrong, you should have a view of all these charts!!!! hahaha

shittyadmin · on April 2, 2019

Seems like a good idea, I wish I had more of a statistical background for this kind of thing. Seems like it'd have proved more useful for most software purposes than the calculus I had to take instead as a prerequisite.

It's basically just my boss making suggestions and me implementing, so the results are probably less than optimal for this kind of thing.

photojosh · on April 2, 2019

Even more generally speaking, there's more than a few people saying we should elevate learning statistics over calculus in the general maths curriculum. Arguably it's even more important than trig and geometry.

[0] is just one of the first results from a search that makes this case.

[0] https://stanfordreview.org/calculus-is-overrated-why-we-shou...

viraptor · on April 3, 2019

Or a similar is to violin plots - heat map / waterfall.

jldugger · on April 3, 2019

Not a bad idea, but in my line of work, tools like statsite calculate the summary stats and throw the data away because aint nobody got storage for that. From this viewpoint, it's still important to teach engineers the mean is heavily influenced by outliers. If they're looking to know what the 'average' is, a median is often better, and if they want to know how broad the distribution is, p95/p99 is a good slice into the long tail.

Traster · on April 3, 2019

I'm not sure that's really relevant here, since this is monitoring over time. When you design this system sure you could look at the graphs of data, and you'd still end up with the same impression. The point is that over time you can't collect and analyse all the data always.

The important thing here is understand the underlying system and identify what it's modes of failure are. If the likely mode of failure was just to fail to serve some requests at all then the new approach in this article would be entirely useless since the latencies would be the wrong metric completely.

In this case they identified one mode of failure is for a small % of requests to deteriorate almost exponentially and so watching the 99th percentile is useful - this is very common in networking. That's great for this system but the overall message is to find the metric that actually reflects the ways your system will fail- and you need to do that in a practical way that may not necessarily involve recording and storing all the data.

memling · on April 2, 2019

> IMHO plotting the distribution should be the first step before trying to compute its statistics. If you know the shape, you can understand the values - otherwise it's guesswork.

My manager's dictum is histogram and time series to start, which falls under the same auspices.

KyleBrandt · on April 2, 2019

Kibana can do visualizations of distributions with elastic data. I agree it is the best way to get a clear picture of latency and other metrics within a period of time.

I think percentiles over time, given consistent sample sizes per bucket, can be good for understanding change over time.

In general I think of histograms for these sort of metrics providing higher fidelity of value, where as stats over time can provide higher fidelity of time (for roughly the same storage).

Animating the distribution over time can be pretty neat, but not common feature in my space (but can be done with a little R code).

I imagine 3 axis (3d) histograms over time can make some neat visualizations but have never experimented with that, and would probably want to slice or rotate the plot. More common are heatmaps where the darkness of the color is your third axis. Kind of like viewing the 3d plot of histograms over time from above where the cubes are partly transparent and get darker the deeper they are.

Rafuino · on April 2, 2019

This topic always leads me to think about this great talk from Gil Tene on how NOT to measure latencies (basically, don't use averages!).

https://www.youtube.com/watch?v=lJ8ydIuPFeU

I'm also a huge fan of how Dormando showed latency distributions in one of his recent Memcached Extstore posts. The default is 95th percentile but you can change the percentile to what matters to you (i.e. 99th percentile if you ask me!). Scroll down to see what he did and play with it.

https://memcached.org/blog/nvm-multidisk/

scott_s · on April 2, 2019

I felt a major point of his talk was you should also look at your max latency.

Rafuino · on April 3, 2019

Yeah good point. Max and the various 9s that matter to your organization.

cromulent · on April 2, 2019

There's a great story on 99% Invisible about averages, particularly when used to design cockpits for the average pilot.

https://99percentinvisible.org/episode/on-average/

sohkamyung · on April 3, 2019

Check out this comic on "Why Not to Trust Statistics" [1]. His book, "Math With Bad Drawings" [2] has a chapter on statistics and why not to trust a single statistical measure only.

[1] https://mathwithbaddrawings.com/2016/07/13/why-not-to-trust-...

[2] https://mathwithbaddrawings.com/2018/05/23/math-with-bad-dra...

camel_gopher · on April 3, 2019

Percentiles can be misleading, try a histogram - https://www.circonus.com/2018/11/the-problem-with-percentile...

novaleaf · on April 3, 2019

My own solution, which might be useful to those using javascript (nodejs or browser):

I use mathjs.quantileSeq() and log 0%, 25%, 50%, 75%, and 100%. This seems to be good for "casual metric logs".

I've found that this gives a good shape of the data, as well as the absolute min/max values. If you use 1% or 99% you'll miss the absolute worst performers, and I want to be at least aware of what the worst performance numbers are.

https://mathjs.org/

https://mathjs.org/docs/reference/functions/quantileSeq.html

LiamPa · on April 2, 2019

Site Reliabilty Engineering goes over this in a lot more detail.

https://landing.google.com/sre/books/

spenthil · on April 2, 2019

Specifically Chapter 4, under "Aggregation" https://landing.google.com/sre/sre-book/chapters/service-lev...

phosfox · on April 2, 2019

Reminds me of “Don’t cross a river if it is four feet deep on average.” — Nassim Nicholas Taleb

mitchtbaum · on April 3, 2019

thx.. good summary: http://greatesthitsblog.com/the-black-swan-nassim-nicholas-t...

mikorym · on April 3, 2019

I've used Elasticsearch + Kibana for agricultural data and similarly "expanded" the view out from averages to time series.

People in agriculture love averages and it makes a lot of sense in financial data since averages preserve totals e.g.:

50 ton / ha average over 100 ha = 5 000 tons

At the same time summing each individual ha gives you 5 000 tons total.

But once you realise that you can expand on this, things get really interesting. I don't know of other people working on the same problems that I am working on, but they are relevant both economically (in the sense of making money) and environmentally (in the sense of improving efficiency and managing climate).

SketchySeaBeast · on April 2, 2019

More knowledge is always better, but percentiles are a little misleading as well - the 99% at 867 ms latency makes you have a moment of panic, but when you see that 95% is 60 ms, then you really realize how few of your visitors are experiencing the slow response. Might it be a problem? Possibly, and I has brought awareness to that potential, but it also has the possibility to blow it out of proportion if you don't look at the rest of the data.

Edit: I'm not saying Averages are better, but that Percentiles can be misleading as well.

Xorlev · on April 2, 2019

> but that Percentiles can be misleading as well.

I'm not sure I agree. If they're computed wrong, sure, but this is what your system is actually doing. And honestly, the tail has a way of dictating your system's performance as a whole.

> the 99% at 867 ms latency makes you have a moment of panic, but when you see that 95% is 60 ms

It's easy to write off 1 in 100 users, but the reality is a little more dim. If your P(slow request) is normally distributed (it isn't always -- some requests are more expensive, some data is on worse disks, etc.), then you can compute the (extremely rough) probability a user will run into a slow request in a session:

P(slow request for user) = 1 - (0.99)^N

N = number of requests.

For example, lets say a user visits 15 pages in a session with that call in each. They have a ~13.9% chance of running into that 99th percentile. :(

Now if you're fanning out lookups (as one often does), you could easy have 50 lookups for a single request. Now you're at 39.5%! What happens in 1% of requests can become extremely important and essentially dictate your user's experience.

The Tail at Scale [1] talks a lot about this. I'd recommend it as reading.

[1] https://blog.acolyer.org/2015/01/15/the-tail-at-scale/

the8472 · on April 2, 2019

> the 99% at 867 ms latency makes you have a moment of panic, but when you see that 95% is 60 ms, then you really realize how few of your visitors are experiencing the slow response.

And then you realize that visitors are hitting your application with hundreds of requests per page load and the 99th percentile suddenly becomes your average.

And then you realize that you didn't plot the windowed maximum and have some crazy hangs every now and then that block entire page loads for a whole minute.

chewbacha · on April 2, 2019

Percentages without volume are still useless though. 99% percentile of 1 billion requests could still mean it's impacting a huge number of users.

moral: No one number is a panacea

_qwfv · on April 2, 2019

Having multiple percentiles is key, but I can't think of a time when I've ever found average to be useful.

dragontamer · on April 2, 2019

The average (or more specifically: arithmetic mean) has a number of key properties that allows for advanced analysis.

If you have two pieces of data: say the average roll of a D6 (6-sided dice) + average roll of D20 (20-sided dice), you wouldn't be able to do anything with percentiles.

The 90% percentile roll of D20 is 18. The 90% percentile roll of D6 is 5(ish). But the sum of these numbers tells us nothing (18 + 5 == 23, which tells us... nothing).

In contrast, the mean / average roll of D20 is 10.5, while the average roll of D6 is 3.5. The average roll of D20 + average of D6 is 14.

You can add and subtract averages just fine, combining separate pieces of data to make a larger conclusion. You cannot do the same with percentages.

----------

There's nothing "misleading" about averages. Its just that most people are awful at statistics. The real lesson is... learn statistics, so that you can learn to use these tools correctly.

Averages, and standard deviation / variance, are excellent for combining data. Its a blurry picture, but its still mathematically correct. Percentiles / quartiles / graphs are more precise and allow for deeper conclusions... but its not always possible to create a percentile graph, especially if you cannot directly measure some attribute. "Indirect measurement", by way of arithmetic, averages, and standard deviation, is quite useful.

----------

EDIT: While I'm talking statistics, don't forget about the three kinds of averages: Arithmetic Mean is the most common, but is often meaningless. You have to also understand geometric mean, and harmonic mean.

Multiplicative problems should use a geometric mean. Harmonic Mean is used when comparing speeds (roughly). Going 60MPH for 10 miles, and then going 120MPH for 10 miles is averaged using Harmonic Mean.

Benchmarks should typically use harmonic mean: the arithmetic mean is meaningless in the scope of benchmarks. Etc. etc.

But this is all statistics, that for some reason is rarely taught well in the US High School level. IMO, its an issue with our school system... we really should be teaching more statistics, especially given how common the analysis of data is in today's world.

coldtea · on April 2, 2019

>There's nothing "misleading" about averages

Well, the fact that the average of my wealth and a Bill Gates' wealth is dozens of billions of dollars shows why averages are misleading -- and frequently used to give a false image with statistics.

Misleading doesn't necessarily mean mathematically or factually wrong. Something can be totally true, and still give the false impression, and averages do that.

And saying "people should learn statistics" wont change this. Even knowing statistics, the average doesn't tell me much.

dragontamer · on April 2, 2019

> Well, the fact that the average of my wealth and a Bill Gates' wealth is dozens of billions of dollars shows why averages are misleading -- and frequently used to give a false image with statistics.

I'm not sure I follow your example. But let me explain my point of view first.

Lets say we have a game, where we flip a coin. Heads, you win your amount of money (lets say $100,000). Tails, you win Bill Gate's money (lets say $50 Billion).

If we play the game 50 times, how much money will you make on the average?

Well, that's just 50 times the expected value of the game. Even if we have grossly separate results for heads vs tails, the mathematical properties of the arithmetic mean / common average remains the same.

----------

"Proper use" of averages depends entirely upon the use of the number afterwards. How you're interpreting the data and why. That's solidly within the realm of statistics: understanding exactly what "Average" means, and using the math correctly.

I fully agree with you that very few people out there seem to understand the definition of "Average". In fact, I personally prefer the more precise term "arithmetic mean", specifically so that I don't mislead those who say "Average" (also, "average" is ambiguous in the statistical world: mean, median mode? If mean, then arithmetic, geometric, or harmonic??).

But that's more of a writing issue as opposed to an interpretation / mathematical issue. I've used ALL average calculations (median, mode, arithmetic mean, geometric mean, and harmonic mean) at some point in my life. They're useful, and I'm not even a statistician by trade. (Actually, I use those calculations mostly in my video-game analysis...)

ComputerGuru · on April 3, 2019

You have a knack for explaining statistics in an approachable way.

SketchySeaBeast · on April 2, 2019

> Even knowing statistics, the average doesn't tell me much.

Depends on what you're doing. Bill Gates and my wealth averaged won't tell me anything, but taking my average speed on the highway will give me a good estimate on when I'll get to my destination - in this case the 99th percentile is only interesting to highway patrol, and won't help me estimate when I'll get home. It depends on what you want out of the information.

lultimouomo · on April 3, 2019

99th percentile over your average speed across trips it's arguably the most useful number, if you want to make almost sure you're not late.

Armisael16 · on April 3, 2019

(the arithmetic mean is actually pretty bad for estimating travel time unless your speed stays in a tight band)

coldtea · on April 3, 2019

>but taking my average speed on the highway will give me a good estimate on when I'll get to my destination

Like in the Bill Gates example, if you had a clear highway and went with 100 mph for the first hour, and then hit a busy traffic and are going with 20 mph for the last 5 minutes (with no foreseeable end in sight), the average wont help you much.

It's only really useful for this purpose when there is constant speed (as in school math problems), or few extremities (or when any extremities neatly cancel out).

See how average can be misleading?

SketchySeaBeast · on April 3, 2019

If the last 5 minutes of a trip are incredibly slow, there's no chance you'll be able to use estimates to determine when you'll get there anyways, as at that point you're not actually estimating, you're now there.

coldtea · on April 4, 2019

Which is neither here, nor there.

This is not about literally seeing into the future, and knowing about unforeseen events.

This is about making the best prediction with the data you already have.

And if (as per example) you had "1000 miles of 100mph and 100 miles of 5mph in some traffic jam" with 500km left to go, the average will be misleading as it will lose too much info to be able to help you give the best prediction.

In that situation I wouldn't reach for the average, to make a prediction to the person waiting for me. Would you?

dragontamer · on April 4, 2019

> And if (as per example) you had "1000 miles of 100mph and 100 miles of 5mph in some traffic jam" with 500km left to go, the average will be misleading as it will lose too much info to be able to help you give the best prediction.

This is literally the use case of the harmonic mean, a type of average. Harmonic mean gives you the accurate estimate in this case.

As I stated before: its about using the correct calculation for the correct reasons. It requires an understanding of statistics, and the 5-different-kinds of average (mode, median, arithmetic mean, geometric mean, harmonic mean) to pick the correct one for any particular use case.

The arithmetic mean in this case is simply wrong. Just... wrong. It is meaningless. You have to use the harmonic mean.

--------

It takes experience and practice to know which "average" to use. But anybody can learn it if they put forth the effort. It is definitely teachable to high school, and even middle-school children. (Its just not commonly taught in America for some reason).

lottin · on April 2, 2019

Statistics is about summarising data. It is implicit in the notion of "summarising" that you won't get all the information... but that's the whole point.

coldtea · on April 3, 2019

It's not about getting all the information.

It's about retaining the relevant parts in the summary.

Some summaries are better in that than others.

And some summaries are used (e.g. by the media or corporations looking to mislead) precisely because they are handy for hiding important information in their summarized version.

roywiggins · on April 2, 2019

This works because the arithmetic mean of a large number of trials should converge to the "expected value", and expectation is linear: E(X+Y) = E(X)+E(Y). So this can be a useful way to reason about sums of random variables.

pmart123 · on April 2, 2019

Percentiles can be misleading if the data follows a Poisson distribution, especially if the lambda coefficient is closer to 1. Latencies I would imagine, would typically be Poisson due to arrival time.

Lightbody · on April 2, 2019

One of my favorite (short) talks on this topic. Well worth a few minutes of your time:

https://www.youtube.com/watch?v=coNDCIMH8bk

Aengeuad · on April 3, 2019

I know it's in the spirit of the talk, but the histogram at 10:45 and the related discussion about how the latency improved for most users but the average latency increasing meaning a worse experience for other users reminds me of the anecdote a Google engineer had when Youtube started rolling out their HTML5 player, the responsiveness of the page had increased but the average latency graphs went up. This wasn't down to it being a bad update, or some users getting a worse experience - not really anyway, but the switch to the HTML5 player allowed a wider audience to start using Youtube where they wouldn't have been able to do this previously. A change increasing average latency, even on a histogram, doesn't necessarily mean it's a bad change. Look at your data indeed.