At the OVH Observability (formerly Metrics) team, we collect, process and analyse most of OVH’s monitoring data. It represents about 500M unique metrics, pushing data points at a steady rate of 5M per second.
This data can be classified in two ways: host or application monitoring. Host monitoring is mostly based on hardware counters (CPU, memory, network, disk…) while application monitoring is based on the service and its scalability (requests, processing, business logic…).
We provide this service for internal teams, who enjoy the same experience as our customers. Basically, our Observability service is SaaS with a compatibility layer (supporting InfluxDB, OpenTSDB, Warp10, Prometheus, and Graphite) that allows it to integrate with most of the existing solutions out there. This way, a team that is used to a particular tool, or have already deployed a monitoring solution, won’t need to invest much time or effort when migrating to a fully managed and scalable service: they just pick a token, use the right endpoint, and they’re done. Besides, our compatibility layer offers a choice: you can push your data with OpenTSDB, then query it in either PromQL or WarpScript. Combining protocols in this way results in a unique open-source interoperability that delivers more value, with no restrictions created by a solution’s query capabilities.
Scollector, Snap, Telegraf, Graphite, Collectd…
Drawing on this experience, we collectively tried most of the collection tools, but we always arrived at the same conclusion: we were witnessing metrics bleeding. Each tool focused on scraping every reachable bit of data, which is great if you are a graph addict, but can be counterproductive from an operational point-of-view, if you have to monitor thousands of hosts. While it’s possible to filter them, teams still need to understand the whole metrics set in order to know what needs to be filtered.
At OVH, we use laser-cut collections of metrics. Each host has a specific template (web server, database, automation…) that exports a set amount of metrics, which can be used for health diagnostics and monitoring application performance.
This finely-grained management leads to greater understanding for operational teams, since they know what’s available and can progressively add metrics to manage their own services.
Beamium & Noderig — The Perfect Fit
Our requirements were rather simple:
— Scalable: Monitor one node in the same way as we’d monitor thousands
— Laser-cut: Only collect the metrics that are relevant
— Reliable: We want metrics to be available even in the worst conditions
— Simple: Multiple plug-and-play components, instead of intricate ones
— Efficient: We believe in impact-free metrics collection
The first solution was Beamium
Beamium handles two aspects of the monitoring process: application data scrapping and metrics forwarding.
Application data is collected is the well-known and widely-used Prometheus format. We chose Prometheus as the community was growing rapidly at the time, and many instrumentation libraries were available for it. There are two key concepts in Beamium: Sources and Sinks.
The Sources, where Beamium will scrape data, are just Prometheus HTTP endpoints. This means it’s as simple as supplying the HTTP endpoint, and eventually adding a few parameters. This data will be routed to Sinks, which allows us to filter them during the routing process between a Source and a Sink. Sinks are Warp 10(R) endpoints, where we can push the data.
Once scraped, metrics are first stored on disk, before being routed to a Sink. The Disk Fail-Over (DFO) mechanism allows for network or remote failure recovery . This way, locally we retain the Prometheus pull logic, but decentralized, and we reverse it to push to feed the platform which has many advantages:
- support for a transactional logic over the metrics platform
- recovery from network partitioning or platform unavailability
- dual writes with data consistency (as there’s otherwise no guarantee that two Prometheus instances would scrape the same data at the same timestamp)
We have many different customers, some of whom use the Time Series store behind the Observability product to manage their product consumption or transactional changes over licensing. These use cases can’t be handled with Prometheus instances, which are better suited to metrics-based monitoring.
The second was Noderig
During conversations with some of our customers, we came to the conclusion that the existing tools needed a certain level of expertise if they were to be used at scale. For example, a team with a 20k node cluster with Scollector would end up with more than 10 million metrics, just for the nodes… In fact, depending on the hardware configuration, Scollector would generate between 350 and 1,000 metrics from a single node.
That’s the reason behind Noderig. We wanted it to be as simple to use as the node-exporter from Prometheus, but with more finely-grained metrics production as the default.
Noderig collects OS metrics (CPU, memory, disk, and network) using a simple level semantic. This allows you to collect the right amount of metrics for any kind of host, which is particularly suitable for containerized environments.
We made it compatible with Scollector’s custom collectors to ease the migration process, and allow for extensibility. External collectors are simple executables that act as providers for data that is collected by Noderig, as with any other metrics.
The collected metrics are available through a simple rest endpoint, allowing you to see your metrics in real-time, and easily integrate them with Beamium.
Does it work?
Beamium and Noderig are extensively used at OVH, and support the monitoring of very large infrastructures. At the time of writing, we collect and store hundreds of millions of metrics using these tools. So they certainly seem to work!
In fact, we’re currently working on the 2.0 release, which will be a rework, incorporating autodiscovery and hot reload.
Stay in touch
For any questions, feel free to join our Gitter!
Follow us on Twitter: @OVH