Scalable Prometheus Architecture

What might a scalable monitoring architecture look like?

Published: Tuesday, Aug 10, 2021 Last modified: Saturday, Jun 14, 2025

tl;dr don’t focus on real time, metrics are not events, scrape don’t push

Prometheus “prom” is the industry standard ecosystem within the observability space for metrics collection. Essentially it’s a time series database queried typically via https://grafana.com/. Architecture considerations come into play if you have extreme requirements, however most of the time you are scraping / collecting metrics via “metric exporters” between AWS accounts.

You can architect Proms to be in a hierarchy and use recording rules or plain federation to aggregate metrics upstream.

With Prometheus, remember you are sampling, you are not capturing every event. You are looking to observe trends. My philosophy is that it’s not healthy to stare at “real time” metrics coming in and try to make real time decisions. If I notice global:\n scrape_interval: 1m has been modified to <10s, I know people have misunderstood their metrics pipeline. Quick decisions are triggered off events from your Cloud provider, not metrics!

Metrics are not events, metrics are for observing trends to pre-empt any Site Reliability issues. Good SRE is knowing in a week you will run out of disk space (hint: predict_linear). Not alerting when it happens!

Normal “observability” Architecture

Metrics are fed (scraped) back to a central “observability” account for shared analysis either by

Directly from the exporter’s /metrics endpoint (simplest)
Federation; example on how to scrape via Federation…
Recording rules which is a way of local team to express what’s important and deliver signal

Notice at any time one can scrape the various addressable target endpoints, without relying on a potential central point of failure. This is not the case when you push metrics.

The highly scalable Architecture

You probably do not need https://thanos.io/ unless you are a Unicorn, maybe. With a dangerous fetish for data retention.

Unfortunately Thanos/Cortex/AWS encourage ingesting metrics via a push remote write, which hits up against Prometheus hierarchical federation and my love for scraping.

❤️Scraping is good, why?

Prom is designed to pull rather than push
It’s easier to debug when things are going wrong from /targets ! Endpoints should be addressable, if they are not, it’s like blocking ICMP (completely daft)
You don’t need to worry about being overwhelmed, as with scraping, you are always in control. Perhaps can be be argued with the Robustness Principal

Scraping sucks less and is idiomatic Architecture of Prometheus.