If you are running an IT system you are most likely using an Observability stack along it. Nowadays, the question’s no more whether or not you need Observability but more like how will you compose your stack. At OVHcloud, we have been running a scalable timeseries backend for years now.
During the last year, we have the opportunity to reassess our technical choices. Prometheus is the de facto standard but this choice is the beginning of the process. Thanks to open source communities, there is at lot of possible choices.
The previous posts were about the process we have followed select our new backend, this one concludes the series and share what we have chosen and why. In case you missed them, this series covers an introduction to Prometheus remote storage, how to bench such solution from both write and read perspective the hard way or like a pro.
And the winner is… Grafana Mimir!
After all the experimentation we have made we have chosen Grafana Mimir. The first reason why this solution’s a good fit for use is Its read/write performance’s excellent as well as its scalability. My team, core-observability, main mission’s to provide a resilient and feature full observability infrastructure. All teams relying on us, each of them has it own particularity. Multitenancy is a must have for us, with it we must be able to prevent side effect or “noisy neighboor”. This is why rate limiting was on our bucket list. Mimir provides a lots of setting both at the cluster level and the tenant level to make sure one tenant does not impact others or simply impact the quality of services.
Like many cloud native technology Mimir relies on an object storage where the timeseries are stored. Doing so allow to decouple the compute from the storage and therefore avoids to add more computing power or bigger disks to offer the retentions your users need. Data are compacted to have the small storage footprint possible and therefore achieve cost efficiency.
As we said in our, Prometheus is today de facto standard when it comes to timeseries. We wanted to offer our users the full experience, 100% compliant with promql, recording and alerting rules. Mimir is fully featured on this side, it’s even part of a bigger picture with more integration which is like icing on the cake. Let’s start with Grafana, which is of course fully compatible with Mimir, you can also manage you recording or alerting rules directly from the UI. Now comes Loki which is like prometheus but for logs, it allow you to query your logs just like your metrics. And finally Tempo which cover the last observability pillar: distributed tracing.
On the operational side, there is no doubt that Mimir has been built with production stability and resiliency in mind. The default settings are production ready, the documentation is crystal clear but you also have the material to facilitate the day to day care of Mimir in production. As SREs running Mimir you can use their knowledge base. You have at your disposal ready to use dashboards, recording & alerting rules and runbook. Of course deployment might be different one from another. This is a very good opportunity to contribute back to the vivid open source community around Grafana Labs. No matter the size of the contribution it is always welcomed and reviewed in a timely manner. Whether you need to adjust the dashboards, add a feature or build deb/rpm packages you can always contribute.
The definitive reason why we have chosen Mimir is the core values of its maintainers. Kudos to them. They are welcoming, easy going and more importantly they take opensource seriously just like us at OVHcloud. If you want to have a glimps of that come by their slack to see how fast they are answering.
My team can’t wait to see all the beautiful things our users will do with this backend. One thing’s sure, we’ll contribute back and make sure Mimir thrives. Let’s reserve this part for a new blog posts.
After 10 years as a Sysadmin in High Performance Computing, Wilfried Roset is now part of OVHcloud as Engineering Manager for their Databases product Unit. He focuses on industrialization, reliability and performances for both internal and public clusters offers.