Measuring software performance with percentiles

In order to understand systems performance and usage, numerous metrics are collected, stored and processed throughout infrastructures. But simple possessing that data is meaningless without knowing how to gain insights about the system from it.

Why use percentiles?

The simplest and most widespread first attempt at data analysis is to calculate an average from a dataset. While this is easy and can work in some cases, it is also unreliable if data points include extreme outliers. Percentiles instead provide insights into the real data points, by highlighting values from the dataset itself.

Their definitions looks similar at first glance, but with significant differences:

Average (or mean average): Add up all the values and divide by the number of observations. While straightforward, the mean is highly sensitive to outliers.
Median (or P50 / 50th percentile): This is the middle value in the sorted dataset, with half the values above it and half below. It's robust to extreme values and better reflects typical performance.

The outlier issue becomes obvious when looking at specific example cases. Let's assume an API reports metrics for response times in milliseconds, the ones for today being {45, 48, 50, 15000, 52, 49}. The average response time for the day would be 2540.67, but the median is 49.5. This highlights how the percentile is able to ignore the extreme 15,000 value, while the mean average is skewed to a point where it doesn't represent the typical response times throughout the day anymore.

Calculating percentiles

Calculating a percentile is quite a bit more complex than the mean average and may not be feasible for all data sets. It first of all needs the data to be sorted in ascending order (lowest to highest value), then a fractional position can be calculated using:

R = P / 100 * (n + 1)

The variable P is the percentile, for example 90 for P90. R is the rank needed in the next step. n is the total number of values in the data set.

The next part is conditional; if the resulting rank R is an integer, you can use it directly. Assuming R is 3, the percentile is the value of the 3rd value in the data set (careful programmers: counting starts at 1, not 0!).

If R is not an integer, you need to compute two more values:

lower = floor(R)
upper = ceil(R)
fraction = R - lower

Then you find all values at these positions, for example if R was 2.5, then lower is 2 and upper is 3, with fraction at 0.5. Use these to find the values in the dataset, for example for lower=2, lower_value is the 2nd element in the data set, for upper=3 the upper_value is the 3rd item in the data set.

Finally, fill the variables into the calculation:

Percentile = lower_value + (fraction * (upper_value - lower_value))

The resulting number is the percentile in question. In case there is no lower value (0th percentile) or no upper value (100th percentile), you may use the rank R directly just like for integer values.

P25

The 25th percentiles can be used to measure the typical data for the lower quarter of the dataset. The primary use of this percentile is to identify underused resources or pin down heavily varying data.

For example, if a kubernetes cluster reports these metrics:

Node 1:
- Average CPU: 45%
- P25 CPU: 15%

Node 2:
- Average CPU: 42%
- P25 CPU: 12%

Node 3:
- Average CPU: 38%
- P25 CPU: 8%

It is clear to see that while all nodes seem to be decently utilized on average, they are heavily underused for 25% of the time. Using P25 to measure resource utilization can identify targets for optimization, like scaling down the cluster in the example above. Other applications include avoiding network congestion by spotting low-traffic hops in a network or measuring a baseline of minimal performance.

In a different scenario, P25 can als be used to better understand a dataset. Assuming you measured the latency of read and write queries in a database separately:

Read queries:
- Average time: 200ms
- P25: 50ms

Write queries:
- Average time: 180ms
- P25: 150ms

Again, their average is very similar, but the P25 value reveals that the write queries perform very evenly with 75% taking between 150ms and 180ms, while the read queries vary a lot more, from 50ms to 200ms.

P50 (median)

The 50th percentile is commonly referred to as the median and represents the middle of the dataset, where half the dataset is lower and the other half is higher than the median value. It shows a "typical" value for a dataset and is a much more reliable way to find the "average behavior" of a system than the mean average itself.

The median represents normal operating conditions and is a great indicator to check a system's overall health without skewed data from critical error conditions or outages. It is used to measure anything from typical load times of websites to normal query execution speeds or hardware utilization levels. Knowing the median value for a metric allows operators to plan baseline capacity accordingly, knowing what usage to expect under normal conditions and identify trends over time.

P90

Using the 90th percentile allows a more complete picture into a dataset while still excluding extreme outliers that would unnecessarily skew the value. Only the highest 10% of values are discarded, giving operators a better understanding if only some values are dragging the average down, or the overall dataset is high.

Given an API reports P90 latencies for it's endpoints:

/users endpoint P90: 150ms
/products endpoint P90: 400ms
/checkout endpoint P90: 180ms

The 400ms value of the /products endpoint is immediately obvious, with high confidence that it is not caused by one or two runaway worst-case transactions. The mean average would have included them, leaving doubts about the data, while P90 explicitly disregards the worst 10% of transactions, making the value much more reliable to use and base decisions on.

P90 can be used to identify system behavior under load spikes or peak usage conditions, but is also commonly used in Service-level agreements (SLAs) in the enterprise world, assuring customers that 90% of requests will complete within a set latency. Many auto-scaling systems will look at P90 values of cpu or memory usage to decide when to scale a service up in order to maintain availability.

Other percentiles

There are many other percentiles one may use to derive insights from a data set. Common ones include P75 to catch early warning signs of overloading / soft scaling hints and P95 / P99 for more aggressive metrics on performance and worst-case identification. In practice, you may use any percentile for measuring, but the common ones provide well-understood information about a set of metrics, while you may need to carefully think about custom percentiles like P32, including all other operators involved with the analysis, like coworkers or contractors. It is up to you to pick the right analysis tools to derive conclusions from your metrics.