0/ When large eng orgs rely on metrics for both monitoring *and* observability, they struggle with cardinality.

This is a thread about “the two drivers of cardinality.” And which one of those we should kill. :)

🧵👇

1/ Okay, first off: “what is cardinality, anyway?” And why is it such a big deal for metrics?

“Cardinality” is a mathematical term: it’s *the number of elements in a set*... boring! So why tf does anybody care??

Well, because people think they need it, then suddenly, "$$$$$$$."
2/ When a developer inserts a (custom) metric to their code, they might think they’re just adding, well, *one metric*. …
3/ … But when they add “tags” to that metric – like , , or (shiver) – they are actually creating a *set* of metric time series, with the *cardinality* of that set being the total number of unique combinations of those tags.
4/ The problem is that some of those tags have many distinct values.

E.g., when I was working on Monarch at Google, there was a single gmail metric that became over 300,000,000 (!!!) distinct time series. In a TSDB, that cardinality is the unit of cost.

So again, “$$$$$$$.”
5/ Okay, so that’s why people care about metrics cardinality. Now, what are the two *drivers* of that cardinality?

A) Monitoring: more detailed *health measurements*
B) Observability: more detailed *explanations of changes*

Let’s take these one at a time…
6/ First, “More Detailed Health Measurements” (monitoring):

Consider an RPC metric that counts errors. You need to independently monitor the error rate for different methods. And so – voila – someone adds a “method” tag, and now there’s 10x the cardinality for that metric.
7/ … And also 10x the cost. But that’s justifiable, as there’s a business case for independently monitoring the reliability of distinct RPC methods.

Put another way, you might rightly have different error budgets for different RPC methods, so their statistics must be separable.
8/ Bottom line: When it comes to “measuring health,” we often *need* cardinality in order to hone in on the signals we actually care about. Increasing cardinality to *proactively* monitor the signals we care most about is usually a worthwhile tradeoff.
9/ Now what about using cardinality for “More Detailed Explanations of Changes?”

This is the real culprit! And, frankly, should be abolished. :) Metrics cardinality is the wrong way to do observability – to explain changes.

More on that…
10/ Say your monitoring tells you that there’s a problem with a critical symptom – e.g., you’re suddenly burning through an SLO error budget at an unsustainable clip.
11/ After a painful outage, say you realize a single customer DOSed your service. So someone adds a `customer` tag “for next time.”

But this is unsustainable: each incident reveals a new failure mode, devs keep adding tags, and before long your metrics bill is out of control.
12/ The problem, in a nutshell:

Distributed systems can fail for a staggeringly large number of reasons. *You can't use metrics cardinality to isolate each one.*
13/ How *should* we explain changes to production systems?

Understanding change is the central problem of observability. Effective workflows might *start* with metrics, but they must pivot towards a multi-telemetry, multi-service guided analysis.
14/ So, to sum up: spend your limited cardinality budget on *monitoring*, and then look for observability that (a) naturally explains changes and (b) relies on transactional data sources that do not penalize you for high/unbounded cardinality.
15/ PS: For more on how to distinguish monitoring and observability, see this thread: https://t.co/2RHs36Nknc

PPS: If you’d like to discuss/debate/request-more-detail any of the above, reply to this thread or DM me!

More from Society

Two things can be true at once:
1. There is an issue with hostility some academics have faced on some issues
2. Another academic who himself uses threats of legal action to bully colleagues into silence is not a good faith champion of the free speech cause


I have kept quiet about Matthew's recent outpourings on here but as my estwhile co-author has now seen fit to portray me as an enabler of oppression I think I have a right to reply. So I will.

I consider Matthew to be a colleague and a friend, and we had a longstanding agreement not to engage in disputes on twitter. I disagree with much in the article @UOzkirimli wrote on his research in @openDemocracy but I strongly support his right to express such critical views

I therefore find it outrageous that Matthew saw fit to bully @openDemocracy with legal threats, seeking it seems to stifle criticism of his own work. Such behaviour is simply wrong, and completely inconsistent with an academic commitment to free speech.

I am not embroiling myself in the various other cases Matt lists because, unlike him, I think attention to the detail matters and I don't have time to research each of these cases in detail.

You May Also Like

@EricTopol @NBA @StephenKissler @yhgrad B.1.1.7 reveals clearly that SARS-CoV-2 is reverting to its original pre-outbreak condition, i.e. adapted to transgenic hACE2 mice (either Baric's BALB/c ones or others used at WIV labs during chimeric bat coronavirus experiments aimed at developing a pan betacoronavirus vaccine)

@NBA @StephenKissler @yhgrad 1. From Day 1, SARS-COV-2 was very well adapted to humans .....and transgenic hACE2 Mice


@NBA @StephenKissler @yhgrad 2. High Probability of serial passaging in Transgenic Mice expressing hACE2 in genesis of SARS-COV-2


@NBA @StephenKissler @yhgrad B.1.1.7 has an unusually large number of genetic changes, ... found to date in mouse-adapted SARS-CoV2 and is also seen in ferret infections.
https://t.co/9Z4oJmkcKj


@NBA @StephenKissler @yhgrad We adapted a clinical isolate of SARS-CoV-2 by serial passaging in the ... Thus, this mouse-adapted strain and associated challenge model should be ... (B) SARS-CoV-2 genomic RNA loads in mouse lung homogenates at P0 to P6.
https://t.co/I90OOCJg7o