Be aware: a total Grafana gotcha

🇬🇧 🇧🇷

This blog is more about my research on DevOps organizational structures. However, I think it’s also good to persist here some epiphanies related to DevOps technologies (example: post about Blunder). In this post, I’ll talk about a tricky detail about Grafana. Let’s go!

Consider the following configuration of a Grafana panel.

Configuration of a Grafana panel, using a counter metric and with the "Total" key turned on in the graph legend configuration

Figure 1 - Setting up a Grafana panel using a Prometheus counter metric

We will clarify in this post that “Total” setting enabled in the side panel. But baby steps…

This panel accesses data from a Prometheus instance, showing in the graph data related to the volume of access to a certain service, with each series in the graph representing a combination of client and accessed resource, where the client is defined here by the client’s IP and a client name, while the resource is related to the URL accessed. The PromQL (Prometheus query) expression used is:

sum(increase(counter_requests{env="pro"}[1m])) by (clientName, resource, clientIp)

And what does it mean? Important… this means that each point on the graph represents the increase in the number of requests in the last minute for a given group (client X resource), this in the production environment. We will call this “1 minute” (specified by “1m”) as the moving average window. A really cool tutorial that explains Prometheus counters and the corresponding queries is this one. Well, see the result in the figure below.

Grafana panel corresponding to the configuration explained above

Figure 2 - Grafana panel in action using the configuration previously presented

In the graph would appear several series (lines with data), as the preview visible in Figure 1. However, in Figure 2, I clicked on the legend of one of the series. Note that we have a series on the chart for each group and, correspondingly, a series legend. When clicking on the series legend, Grafana shows only the corresponding sequence. Next to the description of each series legend, it is possible to associate a value summarizing the displayed data. According to the options in the side menu, these options are: min, max, avg, current, and total. The meaning of most of these values ​​is quite evident… if we have 20 points in the graph (which is what happened to me), min, for example, will show the smallest of these values; max, on the other hand, will deliver the largest of these values. But what about total? What does it mean? Well, total is the sum of the values ​​(it could be called sum, couldn’t it?). But there is a detail…

Now comes the gotcha for the unsuspecting. If you followed closely, maybe you won’t be mistaken, but I was, so here we go… the graph shows the evolution of our measure in the last 5 minutes. Now comes the misinterpretation: the total value would be the increment in the number of requests from that group in the last 5 minutes.

So let’s be emphatic: the total value, summarizing the series, is not the increment in the number of requests from that group in the last 5 minutes.

So, what is it? That’s what we discussed… it’s the sum of the values of all points on the graph! And what does that mean? Well, in absolute terms, nothing interesting (although in relative terms, it may be helpful to compare different moments in the graph).

Let’s dig into the details. The points in Figure 2 have the following values: 2.4 3.6 3.6 3.6 2.4 3.6 3.6 3.6 2.4 3.6 3.6 3.6 2.4 3.6 3.6 3.6 2.4 3.6 3.6 3.6. By adding these values, we get 66, which is very close to the 69.6 shown in the graph (remember, never expect exact values ​​in Grafana, several rounding and interpolation issues are going on).

Let us now focus on the first two points. Remember, each dot represents the increment in requests in the time between the dot and 1 minute behind it. If we have 20 points in 5 minutes, between two points, we have an interval of ~5*60/(20-1)= ~16sec. Now consider the following scheme:

                     P1      P2
|---------60s--------|--16s--|
|--16s--|---------60s--------|

Note… the value in P1 is the increment of its last minute. The value in P2 is the increment of its last minute, which includes part of the time considered by P1. In other words, in P1 we have the accumulation of requests that happened between time 0 and time 60, while in P2 we accumulate requests that occurred between time 16 and time 16+60. Thus, note that requests that occur between time 16 and time 60 will be counted both in the value of P1 and in the value of P2. So, if we do P1+P2, we are counting these values ​​twice! Consequently, the sum P1+P2 does not represent the increment of requests in the last 32 seconds (32 = 2*16, with 16 being the interval between the points).

Therefore, again: the total value, summarizing the series, is the sum of the displayed values, which differs from the increment in the number of requests for that group in the last 5 minutes.

Okay, but what to do?

Well, maybe the best option is not to use any summarizer in the series legend. After all, the other summaries also require some attention. The less problematic is current, which would show the increment in requests in the last minute, corresponding to the last point on the graph. However, only the last point is rarely of any interest, a point alone is seldom interesting… more interest is the recent behavior, the last points. That’s why min and max are not so attractive either, especially min. If you want to use max, remember: we will see the increment obtained in the minute with the largest increment within the analyzed period.

Finally, let’s go to avg, which is the average. Maybe a more interesting summary. But first, the average always has its dangers. In steady-state of use, with minor fluctuations, OK, the average can be interesting. But if we analyze a whole day… normally, the commercial period has much more requests than the dawn. In that case, maybe the average doesn’t mean much. But if you adopt the average, remember: average of what? So, this average is the average increment per minute considering each minute analyzed.

Ah, throughout the explanation, to simplify the text, I always said “minute”. However, where I said “minute”, please read “the moving average window”, which is specified in the PromQL expression. Finishing up… note that this window size is hidden from the panel user. So, in order for him to understand the summary, it’s good to highlight something in the chart title. For example: if the series legend shows the average value (avg), “Amount of requests per minute” helps more than “Amount of requests.”

If you notice any mistake in this post, please let me know via Twitter (@leonardofl).