Our colleagues in the K8S team launched the OVH Managed Kubernetes solution last week, in which they manage the Kubernetes master components and spawn your nodes on top of our Public Cloud solution. I will not describe the details of how it works here, but there are already many blog posts about it (here and here, to get you started).
In the Prescience team, we have used Kubernetes for more than a year now. Our cluster includes 40 nodes, running on top of PCI. We continuously run about 800 pods, and generate a lot of metrics as a result.
Today, we’ll look at how we handle these metrics to monitor our Kubernetes Cluster, and (equally importantly!) how to do this with your own cluster.
OVH Metrics
Like any infrastructure, you need to monitor your Kubernetes Cluster. You need to know exactly how your nodes, cluster and applications behave once they have been deployed in order to provide reliable services to your customers. To do this with our own Cluster, we use OVH Observability.
OVH Observability is backend-agnostic, so we can push metrics with one format and query with another one. It can handle:
- Graphite
- InfluxDB
- Metrics2.0
- OpentTSDB
- Prometheus
- Warp10
It also incorporates a managed Grafana, in order to display metrics and create monitoring dashboards.
Monitoring Nodes
The first thing to monitor is the health of nodes. Everything else starts from there.
In order to monitor your nodes, we will use Noderig and Beamium, as described here. We will also use Kubernetes DaemonSets to start the process on all our nodes.
So let’s start creating a namespace…
kubectl create namespace metrics
Next, create a secret with the write token metrics, which you can find in the OVH Control Panel.
kubectl create secret generic w10-credentials --from-literal=METRICS_TOKEN=your-token -n metrics
Copy metrics.yml
into a file and apply the configuration with kubectl
# This will configure Beamium to scrap noderig
# And push metrics to warp 10
# We also add the HOSTNAME to the labels of the metrics pushed
---
apiVersion: v1
kind: ConfigMap
metadata:
name: beamium-config
namespace: metrics
data:
config.yaml: |
scrapers:
nodering:
url: http://0.0.0.0:9100/metrics
period: 30000
format: sensision
labels:
app: nodering
sinks:
warp:
url: https://warp10.gra1.metrics.ovh.net/api/v0/update
token: $METRICS_TOKEN
labels:
host: $HOSTNAME
parameters:
log-file: /dev/stdout
---
# This is a custom collector that report the uptime of the node
apiVersion: v1
kind: ConfigMap
metadata:
name: noderig-collector
namespace: metrics
data:
uptime.sh: |
#!/bin/sh
echo 'os.uptime' `date +%s%N | cut -b1-10` `awk '{print $1}' /proc/uptime`
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: metrics-daemon
namespace: metrics
spec:
selector:
matchLabels:
name: metrics-daemon
template:
metadata:
labels:
name: metrics-daemon
spec:
terminationGracePeriodSeconds: 10
hostNetwork: true
volumes:
- name: config
configMap:
name: beamium-config
- name: noderig-collector
configMap:
name: noderig-collector
defaultMode: 0777
- name: beamium-persistence
emptyDir:{}
containers:
- image: ovhcom/beamium:latest
imagePullPolicy: Always
name: beamium
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: TEMPLATE_CONFIG
value: /config/config.yaml
envFrom:
- secretRef:
name: w10-credentials
optional: false
resources:
limits:
cpu: "0.05"
memory: 128Mi
requests:
cpu: "0.01"
memory: 128Mi
workingDir: /beamium
volumeMounts:
- mountPath: /config
name: config
- mountPath: /beamium
name: beamium-persistence
- image: ovhcom/noderig:latest
imagePullPolicy: Always
name: noderig
args: ["-c", "/collectors", "--net", "3"]
volumeMounts:
- mountPath: /collectors/60/uptime.sh
name: noderig-collector
subPath: uptime.sh
resources:
limits:
cpu: "0.05"
memory: 128Mi
requests:
cpu: "0.01"
memory: 128Mi
Don’t hesitate to change the collector levels if you need more information.
Then apply the configuration with kubectl…
$ kubectl apply -f metrics.yml
# Then, just wait a minutes for the pods to start
$ kubectl get all -n metrics
NAME READY STATUS RESTARTS AGE
pod/metrics-daemon-2j6jh 2/2 Running 0 5m15s
pod/metrics-daemon-t6frh 2/2 Running 0 5m14s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE AGE
daemonset.apps/metrics-daemon 40 40 40 40 40 122d
You can import our dashboard in to your Grafana from here, and view some metrics about your nodes straight away.
Kube Metrics
As the OVH Kube is a managed service, you don’t need to monitor the apiserver, etcd, or controlplane. The OVH Kubernetes team takes care of this. So we will focus on cAdvisor metrics and Kube state metrics
The most mature solution for dynamically scraping metrics inside the Kube (for now) is Prometheus.
In the next Beamium release, we should be able to reproduce the features of the Prometheus scraper.
To install the Prometheus server, you need to install Helm on the cluster…
kubectl -n kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller \
--clusterrole cluster-admin \
--serviceaccount=kube-system:tiller
helm init --service-account tiller
You then need to create the following two files: prometheus.yml
and values.yml
.
# Based on https://github.com/prometheus/prometheus/blob/release-2.2/documentation/examples/prometheus-kubernetes.yml
serverFiles:
prometheus.yml:
remote_write:
- url: "https://prometheus.gra1.metrics.ovh.net/remote_write"
remote_timeout: 120s
bearer_token: $TOKEN
write_relabel_configs:
# Filter metrics to keep
- action: keep
source_labels: [__name__]
regex: "eagle.*|\
kube_node_info.*|\
kube_node_spec_taint.*|\
container_start_time_seconds|\
container_last_seen|\
container_cpu_usage_seconds_total|\
container_fs_io_time_seconds_total|\
container_fs_write_seconds_total|\
container_fs_usage_bytes|\
container_fs_limit_bytes|\
container_memory_working_set_bytes|\
container_memory_rss|\
container_memory_usage_bytes|\
container_network_receive_bytes_total|\
container_network_transmit_bytes_total|\
machine_memory_bytes|\
machine_cpu_cores"
scrape_configs:
# Scrape config for Kubelet cAdvisor.
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes//proxy/metrics/cadvisor
metric_relabel_configs:
# Only keep systemd important services like docker|containerd|kubelet and kubepods,
# We also want machine_cpu_cores that don't have id, so we need to add the name of the metric in order to be matched
# The string will concat id with name and the separator is a ;
# `/;container_cpu_usage_seconds_total` OK
# `/system.slice;container_cpu_usage_seconds_total` OK
# `/system.slice/minion.service;container_cpu_usage_seconds_total` NOK, Useless
# `/kubepods/besteffort/e2514ad43202;container_cpu_usage_seconds_total` Best Effort POD OK
# `/kubepods/burstable/e2514ad43202;container_cpu_usage_seconds_total` Burstable POD OK
# `/kubepods/e2514ad43202;container_cpu_usage_seconds_total` Guaranteed POD OK
# `/docker/pod104329ff;container_cpu_usage_seconds_total` OK, Container that run on docker but not managed by kube
# `;machine_cpu_cores` OK, there is no id on these metrics, but we want to keep them also
- source_labels: [id,__name__]
regex: "^((/(system.slice(/(docker|containerd|kubelet).service)?|(kubepods|docker).*)?);.*|;(machine_cpu_cores|machine_memory_bytes))$"
action: keep
# Remove Useless parents keys like `/kubepods/burstable` or `/docker`
- source_labels: [id]
regex: "(/kubepods/burstable|/kubepods/besteffort|/kubepods|/docker)"
action: drop
# cAdvisor give metrics per container and sometimes it sum up per pod
# As we already have the child, we will sum up ourselves, so we drop metrics for the POD and keep containers metrics
# Metrics for the POD don't have container_name, so we drop if we have just the pod_name
- source_labels: [container_name,pod_name]
regex: ";(.+)"
action: drop
# Scrape config for service endpoints.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
# pod's declared ports (default is a port-free target if none are declared).
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod_name
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: host
- action: labeldrop
regex: (pod_template_generation|job|release|controller_revision_hash|workload_user_cattle_io_workloadselector|pod_template_hash)
alertmanager:
enabled: false
pushgateway:
enabled: false
nodeExporter:
enabled: false
server:
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: traefik
ingress.kubernetes.io/auth-type: basic
ingress.kubernetes.io/auth-secret: basic-auth
hosts:
- prometheus.domain.com
image:
tag: v2.7.1
persistentVolume:
enabled: false
Don’t forget to replace your token!
The Prometheus scraper is quite powerful. You can relabel your time series, keep a few that match your regex, etc. This config removes a lot of useless metrics, so don’t hesitate to tweak it if you want to see more cAdvisor metrics (for example).
Install it with Helm…
helm install stable/prometheus \
--namespace=metrics \
--name=metrics \
--values=values/values.yaml \
--values=values/prometheus.yaml
Add add a basic-auth secret…
$ htpasswd -c auth foo
New password: <bar>
New password:
Re-type new password:
Adding password for user foo
$ kubectl create secret generic basic-auth --from-file=auth -n metrics
secret "basic-auth" created
You can can access the Prometheus server interface through prometheus.domain.com.
You will see all the metrics for your Cluster, although only the one you have filtered will be pushed to OVH Metrics.
The Prometheus interfaces is a good way to explore your metrics, as it’s quite straightforward to display and monitor your infrastructure. You can find our dashboard here.
Resources Metrics
As @Martin Schneppenheim said in this post, in order to correctly manage a Kubernetes Cluster, you also need to monitor pod resources.
We will install Kube Eagle, which will fetch and expose some metrics about CPU and RAM requests and limits, so they can be fetched by the Prometheus server you just installed.
Create a file named eagle.yml
.
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
labels:
app: kube-eagle
name: kube-eagle
namespace: kube-eagle
rules:
- apiGroups:
- ""
resources:
- nodes
- pods
verbs:
- get
- list
- apiGroups:
- metrics.k8s.io
resources:
- pods
- nodes
verbs:
- get
- list
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
labels:
app: kube-eagle
name: kube-eagle
namespace: kube-eagle
subjects:
- kind: ServiceAccount
name: kube-eagle
namespace: kube-eagle
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-eagle
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: kube-eagle
labels:
app: kube-eagle
name: kube-eagle
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: kube-eagle
name: kube-eagle
labels:
app: kube-eagle
spec:
replicas: 1
selector:
matchLabels:
app: kube-eagle
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
labels:
app: kube-eagle
spec:
serviceAccountName: kube-eagle
containers:
- name: kube-eagle
image: "quay.io/google-cloud-tools/kube-eagle:1.0.0"
imagePullPolicy: IfNotPresent
env:
- name: PORT
value: "8080"
ports:
- name: http
containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: http
readinessProbe:
httpGet:
path: /health
port: http
$ kubectl create namespace kube-eagle
$ kubectl apply -f eagle.yml
Next, add import this Grafana dashboard (it’s the same dashboard as Kube Eagle, but ported to Warp10).
You now have an easy way of monitoring your pod resources in the Cluster!
Custom Metrics
How does Prometheus know that it needs to scrape kube-eagle? If you looks at the metadata of the eagle.yml
, you’ll see that:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080" # The port where to find the metrics
prometheus.io/path: "/metrics" # The path where to find the metrics
Theses annotations will trigger the Prometheus auto-discovery process (described in prometheus.yml
line 114).
This means you can easily add these annotations to pods or services that contain a Prometheus exporter, and then forward these metrics to OVH Observability. You can find a non-exhaustive list of Prometheus exporters here.
Volumetrics Analysis
As you saw on the prometheus.yml
, we’ve tried to filter a lot of useless metrics. For example, with cAdvisor on a fresh cluster, with only three real production pods, and with the whole kube-system and Prometheus namespace, have about 2,600 metrics per node. With a smart cleaning approach, you can reduce this to 126 series.
Here’s a table to show the approximate number of metrics you will generate, based on the number of nodes (N) and the number of production pods (P) you have:
Noderig | cAdvisor | Kube State | Eagle | Total | |
nodes | N * 13(1) | N * 2(2) | N * 1(3) | N * 8(4) | N * 24 |
system.slice | 0 | N * 5(5) * 16(6) | 0 | 0 | N * 80 |
kube-system + kube-proxy + metrics | 0 | N * 5(9) * 26(6) | 0 | N * 5(9) * 6(10) | N * 160 |
Production Pods | 0 | P * 26(6) | 0 | P * 6(10) | P * 32 |
For example, if you run three nodes with 60 Pods, you will generate 264 * 3 + 32 * 60 ~= 2,700 metrics
NB: A pod has a unique name, so if you redeploy a deployment, you will create 32 new metrics each time.
(1) Noderig metrics: os.mem / os.cpu / os.disk.fs / os.load1 / os.net.dropped (in/out) / os.net.errs (in/out) / os.net.packets (in/out) / os.net.bytes (in/out)/ os.uptime
(2) cAdvisor nodes metrics: machine_memory_bytes / machine_cpu_cores
(3) Kube state nodes metrics: kube_node_info
(4) Kube Eagle nodes metrics: eagle_node_resource_allocatable_cpu_cores / eagle_node_resource_allocatable_memory_bytes / eagle_node_resource_limits_cpu_cores / eagle_node_resource_limits_memory_bytes / eagle_node_resource_requests_cpu_cores / eagle_node_resource_requests_memory_bytes / eagle_node_resource_usage_cpu_cores / eagle_node_resource_usage_memory_bytes
(5) With our filters, we will monitor around five system.slices
(6) Metrics are reported per container. A pod is a set of containers (a minimum of two): your container + the pause container for the network. So we can consider (2* 10 + 6) for the number of metrics per pod. 10 metrics from the cAdvisor and six for the network (see below) and for system.slice we will have 10 + 6, because it’s treated as one container.
(7) cAdvisor will provide these metrics for each container: container_start_time_seconds / container_last_seen / container_cpu_usage_seconds_total / container_fs_io_time_seconds_total / container_fs_write_seconds_total / container_fs_usage_bytes / container_fs_limit_bytes / container_memory_working_set_bytes / container_memory_rss / container_memory_usage_bytes
(8) cAdvisor will provide these metrics for each interface: container_network_receive_bytes_total * per interface / container_network_transmit_bytes_total * per interface
(9) kube-dns / beamium-noderig-metrics / kube-proxy / canal / metrics-server
(10) Kube Eagle pods metrics: eagle_pod_container_resource_limits_cpu_cores / eagle_pod_container_resource_limits_memory_bytes / eagle_pod_container_resource_requests_cpu_cores / eagle_pod_container_resource_requests_memory_bytes / eagle_pod_container_resource_usage_cpu_cores / eagle_pod_container_resource_usage_memory_bytes
Conclusion
As you can see, monitoring your Kubernetes Cluster with OVH Observability is easy. You don’t need to worry about how and where to store your metrics, leaving you free to focus on leveraging your Kubernetes Cluster to handle your business workloads effectively, like we have in the Machine Learning Services Team.
The next step will be to add an alerting system, to notify you when your nodes are down (for example). For this, you can use the free OVH Alert Monitoring tool.
Stay in touch
For any questions, feel free to join the Observability Gitter or Kubernetes Gitter!
Follow us on Twitter: @OVH