Scalable Prometheus Architecture
What might a scalable monitoring architecture look like?
Published: Tuesday, Aug 10, 2021 Last modified: Wednesday, Mar 29, 2023
tl;dr don’t focus on real time, metrics are not events, scrape don’t push
Prometheus “prom” is the industry standard ecosystem within the observability space for metrics collection. Essentially it’s a time series database queried typically via https://grafana.com/. Architecture considerations come into play if you have extreme requirements, however most of the time you are scraping / collecting metrics via “metric exporters” between AWS accounts.
You can architect Proms to be in a hierarchy and use recording rules or plain federation to aggregate metrics upstream.
With Prometheus, remember you are sampling, you are not capturing every event.
You are looking to observe trends. My philosophy is that it’s not healthy
to stare at “real time” metrics coming in and try to make real time decisions.
If I notice
global:\n scrape_interval: 1m has been modified to <10s, I know
people have misunderstood their metrics pipeline. Quick decisions are triggered
off events from your Cloud provider, not metrics!
Metrics are not events, metrics are for observing trends to pre-empt any Site Reliability issues. Good SRE is knowing in a week you will run out of disk space (hint: predict_linear). Not alerting when it happens!
Normal “observability” Architecture
Metrics are fed (scraped) back to a central “observability” account for shared analysis either by
- Directly from the exporter’s /metrics endpoint (simplest)
- Federation; example on how to scrape via Federation…
- Recording rules which is a way of local team to express what’s important and deliver signal
Notice at any time one can scrape the various addressable target endpoints, without relying on a potential central point of failure. This is not the case when you push metrics.
The highly scalable Architecture
You probably do not need https://thanos.io/ unless you are a Unicorn, maybe. With a dangerous fetish for data retention.
Unfortunately Thanos/Cortex/AWS encourage ingesting metrics via a push remote write, which hits up against Prometheus hierarchical federation and my love for scraping.
❤️Scraping is good, why?
- Prom is designed to pull rather than push
- It’s easier to debug when things are going wrong from /targets ! Endpoints should be addressable, if they are not, it’s like blocking ICMP (completely daft)
- You don’t need to worry about being overwhelmed, as with scraping, you are always in control. Perhaps can be be argued with the Robustness Principal
Scraping sucks less and is idiomatic Architecture of Prometheus.