Network devices overheat monitoring

Problem to solve

Monitoring our network devices is important to ensure a good quality of service for our customers. In particular, avoiding any overheat safety shutdowns is critical, even if we ensure failure resilience in our infrastructure with device redundancy. In the best scenario, if a device falls, its twin is not impacted. It is able to handle the entire load and the redundancy is operational. In such a case, the twin device can make up for the temporary loss of the first one correctly. Even if the infrastructure is temporary at risk, end users are not impacted and this is what matters.

The problem is that two device twins have often been put in service at the same time: in other words, their hardware cooling factors may be similar (fan obstruction, thermal paste efficiency loss, dust…). In addition, since they are often located in close proximity of one another, their environment is often similar (surrounding air temperature). So, if the first one gets too hot, there is a good chance that once loaded with the additional traffic, its twin will overheat too. Even if we have redundancy, overheating is definitely something we want to keep an eye on.

This post outlines the way we monitor our network devices and avoid overheat throttles and/or shutdowns.

Facts

21000+: number of network devices we have in our datacenters
150+: number of distinct vendor models – to avoid any confusion with machine learning models, from here on we won’t talk about ‘model’, instead we will use the term ‘device series’ to refer to vendor models.
4000: number of distinct sensors – each of the device series comes with its own bunch of supported sensors, essentially meant to monitor electrical power supply, temperature at strategic places/pieces in the device (CPU core temperature, outgoing flow temperature…).

Huge number and variety of devices to monitor

How to find a unified way to monitor all these different devices?

For any given series, its associated vendor provides hard coded sensor thresholds (sometimes thresholds can be configured by a network administrators). These thresholds trigger alerts, which can be classified as two types:

‘Soft’ thresholds: when reached, an event will be recorded in the device log with no further hardware security measures taken. This can be seen as a warning that your device’s temperature is not yet critical but abnormally high.

‘Hard’ thresholds: when reached, some hardware safety measures will be triggered. For example, on your home computer, if your CPU gets too hot, it will be throttled, meaning its frequency will be restrained, in an attempt to decrease its temperature. For a network device on the other hand, the action is often not a CPU throttle but a full shutdown. Two reasons for such a drastic safety measure:
1. Cost of a single device: better safe than sorry
2. Redundancy: better to power off a device completely and rely on its twin to take over rather than throttling its CPU and operating in a degraded state

So, a question arises, why not stick to the aforementioned vendor thresholds and wait for soft thresholds to trigger to take action?

This was how we once monitored our devices. And actually, we still keep an eye on these soft/hard alerts – vendors do know their hardware best, so their thresholds should be carefully watched.

But it appeared that it was not enough, mostly because if you just wait for these vendor alerts to trigger, you can end up in the following situation: you’ve had no alerts in months, and on a particular hot day, you get all alarms at once, due to your environment being warmer than usual. So once vendor alerts are triggered, it’s actually too late. Datacenter operators become overwhelmed and have no way to prioritize alerts, because you do not always know among all the alerts, which ones are the most critical and will actually result in safety shutdowns. This is for soft thresholds. For the hard thresholds, you have approximately 30 seconds to intervene before safety measures are taken.

To avoid being overwhelmed by alerts on hot days, we needed a couple of things:

Crash prediction model: during crisis situations (hot days) we wanted to be able to forecast crashes/safety shutdowns better, to give datacenter operators more time to intervene and provide them with a way to sort alerts, according to the device crash risk they represent. We mined into our device sensors data and crash records, to find out which vendor series were most sensitive to overheat and which sensors (or sensor combinations) we could use to predict crashes and built a supervised machine learning model with the crash examples we had, that learned to predict crashes up to two hours before they occurred – this left datacenter operators with enough time to intervene in such emergency cases.
Preventive planned maintenance: this is a continuous effort. The purpose is to make sure that we detect and continuously maintain devices that could cause us trouble on hot days, in order to not become overwhelmed. To achieve this, we built an unsupervised conditional outlier detection model, to learn how to detect devices which operate abnormally hot under certain relevant factors (environment temperature, load…).

Let’s take a look at both points in more details.

Available data

As you can see in Table 1, for each device, we get a multi-variable and sparse time series, related to sensor records. The resulting matrix is sparse because for every given device and among the 4000 possible sensors, only a small subset of sensors is supported.

Next to these time series, we get device details (Table 2), allowing us to retrieve for every device its vendor series and location.

Finally, we get environmental data (temperature, humidity…) time series per location in Table 3, which we can join to the first time series thanks to the device details association table.

Device series clustering

From the first table we can see several possible approaches for our crash prediction. We can keep the raw data (high dimensional and sparse time series) and build one machine learning model to tackle the more than 150 device series we have.

By doing so, we’d be forced to confront the curse of dimensionality:

Heavy compute and memory cost
Overfitting risk: for some device series, we only have a few devices, with no crash example at all. As this is not statistically meaningful, the model might just learn to predict that no crash will occur when it detects such a series signature (from the supported sensor set for example). We may prefer to ignore these devices for now, since we have no positive examples for them. In addition, keeping too many features given the number of distinct samples is not good regarding the overfitting risk either.

On the other extreme side, instead of building one single model to rule them all, we could build a crash prediction machine learning model per single device series. In this scenario, this would lead us to having more than 150 distinct models to maintain in production, which is very unaffordable.

So, we attempted to cluster device series by similarity (defined below), making a tradeoff between the two extremes mentioned above.

We tried the following approaches:

For each sensor of each device series, we started by estimating the 10%, 30%, 50%, 70%, 90% quantiles of the values taken by sensor and used these percentages to roughly describe the shape of their scaled distribution. Using these features, we could compute some distances (Euclidean or cosine similarity) based on these features (roughly some distribution similarity) and cluster (series, sensor) tuples based on similarity. We quickly gave up with this approach because as previously mentioned, some devices, had too few samples to compute relevant quantiles and describe the sensor distribution properly. Furthermore, we could tell, from their supported sensor name set, that some devices were close to other device series with far more samples.
Using the observation made in the previous approach (some devices appearing to be close to others from their supported sensor set), we decided to cluster sensors based on their name stems: we wanted to leverage the assumption that close device series have close hardware/firmware components and therefore have close sensor names for the same sensor function. To do so, we gathered sensors using a DBScan algorithm coupled with the Levenshtein distance metric.

Recursive definition of the Levenshtein distance (Wikipedia):

Recursive definition of the Levenshtein's distance (Wikipedia) — Levenshtein distance (Wikipedia)

This gave us sensor groups like described in the table below:

Then for every device series, we built the subset of supported sensor groups, and clustered our device series using a DBScan clustering.

Then using these computed clusters, we split the original dataset:

Instead of building one big model, we built smaller models whose data scope became diagonal blocks of the original matrix. We obtained the trade-off we longed for, overcoming the sparseness and curse of dimensionality and keeping the total number of distinct machine learning models to maintain under control.

Once we have defined our device series clusters, we can focus on every cluster, which we have crash examples for, to build our supervised crash prediction model.

2-hour crash prediction

Collecting the ground truth

Collecting the ground truth (e.g. labeling your data) is a challenge that is often ignored when talking about supervised learning models. Yet, having a reliable label is critical to obtain a good dataset and subsequently a correct classifier. In many supervised machine learning examples, the labeled dataset is already available, so the most tedious part, which consist in collecting and labeling data, is often understated.

The first challenge we encountered was to gather the crashed devices along with their timestamps from our data history. For every single network device crash, we have a postmortem and a crash report. However, it is an unstructured text blob. We would need to extract the device identifier, date, and crash reason from the text. In order to do so, we would need to build a model specialized in Named Entity Recognition (NER) on the specific blobs. Not only would the effort have been quite big, especially for people that are no experts in Natural Language Processing, it would also likely have failed at being a robust way to label our data, given the few amount of reports our NER model would have been trained on.

Given the fact that we only had around one hundred or less positive examples, manually extracting positive samples was definitely an option we considered, and not so tedious (a few hours of effort). The drawback was that we would then have had no way to automatically extract and refresh data later.

Fortunately, thanks to our sensor data, we also had the device uptime at our disposal. So, we designed the following workflow to automatically build our labelled dataset:

First, using the uptime data which we had at hand for our network devices, we could quickly see if a device had rebooted. If not, it had not crashed (label: 0). If it did, then it did not necessarily reboot because of overheat (more likely to be a planned maintenance). Then we looked at sensors, more specifically we looked for device outliers by looking at sensors individually. Note, this may seem surprising to data scientists, and incorrect at first glance (because then you just look for outliers along your predefined axes/features, not in the whole vector space). But this saved us some computing time and was acceptable in this specific case, under the assumption that most vendors implement their hardware safety measures against individual sensor thresholds. If the device was not an outlier regarding any of the temperature sensors, then no conclusion was drawn yet (label: -1). If the record was an outlier, the record was pushed further along the labeling workflow. Finally, if the device had rebooted along with many other non-regular ones, then it was most likely a planned maintenance reboot and no conclusion was drawn (label: -1). Otherwise, the sample was labeled as a crash.

We ended up with an intermediate semi-labeled dataset. To finish labeling the dataset, we needed a strategy to fill in the -1 (choose between 0 and 1). We opted for a trivial label spreading strategy, which consisted in filling -1 with 0. We stuck to this strategy as it gave us decent enough results as can be seen below.

Now we have a labeled dataset. We also have a way to automatically detect overheat crashes to refresh our data later whenever needed and, perhaps even more importantly, to detect crashes missed by our crash prediction models once in production.

Since we want to build a 2-hour crash prediction model, we decide for a positive sample to spread the positive labels across the timeline, on the records concerning the two hours preceding a given device’s crash.

Now our data look like this:

Note that rather than applying sequence approaches (like with rnn), we just keep our tabular data shape, by using sensor records during one or several previous steps as additional features.

Undersampling the negative class in the training set

The obtained dataset is highly imbalanced (positive ratio order: 10^-5). SMOTE was considered to oversample the positive class. As usual, we started with an even more trivial approach: we undersampled the negative class (on the training set only, not the test set!). We kept all the records for a device that had a positive record and completed the negative class with randomly selected device records among devices with no crashes, so that we obtained a positive ratio order of 0.01 on the training set.

Then we trained a classifier on these labeled and undersampled data.

In this case, a random forest.

Famous decision tree example — Famous decision tree

Evaluation on test set

Finally, we evaluated it on the still imbalanced test set. Here is an overview of the metric obtained for a classifier related to a specific group of series. We won’t give the exact vendor series, but keep in mind, they were particularly vulnerable to overheat issues:

Metric	Score
ROC AUC	0.999
PR AUC	0.230
Accuracy	0.999
Precision	0.040
Recall	1.00
False positive rate	0.001

The almost perfect values of the ROC and accuracy should not be given much consideration, given the highly imbalanced nature of the dataset. Instead, we focused on the positive class: the PR AUC may seem bad at first glance, only 0.23, but keep in mind that a random/unskilled classifier would have obtained the positive ratio in the test set here so 10^-5, which means our classifier won 4 powers of 10! Not bad. Our recall was perfect in this case, but at the cost of a high false positive rate and a low precision (only 4%).

Still, in our case we prefer having a good recall at the cost of a high fp rate. As said above, the solution is viable if the number of raised alerts:

provides enough time for datacenter operators to intervene
does not overwhelm datacenter operators, especially on hot days

Real time simulation

Rather than focusing on these metrics on our test set, we took our freshly built model and used it for a monitoring simulation in our datacenter (on a different day than the one we had used to train it of course). We conducted the simulation on 2019-06-29, a particularly hot day, where we encountered trouble with many network device crashes for the considered model in our datacenters. Here are the simulation results (we lowered the positive detection threshold from 0.5 to 0.4 in this simulation, any device crash predicted less than 30 minutes before it occurs is considered missed):

KPI	Value
Missed	0
Detected	10
Alerts	33
Precision	0.3
Recall	1
Proactivity mean (hours)	1.7
Proactivity min (hours)	0.5
Proactivity std (hours)	0.85
Mean alerts per hour	1.375
Max alerts per hour in a given datacenter	2.6

We still get the 1 recall we want and a nice surprise regarding precision, higher than the one we obtained during the evaluation (0.3 > 0.04). Also note that some devices considered as false positives on this day (because they did not crash), actually crashed on the next heat wave (2019-07-24). Strictly speaking they are false positives but keeping them in operation would still have been a good call. The proactivity mean time is a bit less than the announced 2 hours at 1.7 hours with a high standard deviation, and the minimal proactivity value is 30 minutes (simulation prediction’s success condition in our simulation as mentioned above). The 33 alerts were mainly concentrated during 2019-06-29 hot hours (3PM-7PM), across three distinct datacenter locations (Roubaix, Gravelines and Strasbourg), and at the alert peak, 2.6 alerts would have had to be handled per hour. A lot, but still manageable with alert now prioritized, with the crash risk estimation we provided 🙂

On 2019-06-24, the system was not deployed yet, but here is a visualization of what the system would have done:

As you can see the crash occurred around 2:30PM UTC. Provided we set the detection threshold to 0.5 (assuming operators were already highly loaded, otherwise we keep it lower, to intervene even more proactively), the device crash would have been predicted at 12:00PM, which would have left more than enough time to take preventive action (fan checks and dust cleaning are quick interventions that usually have great benefit on the device’s temperature).

Feature importance and knowledge extraction for preventive maintenance

When digging in the model decision boundaries/interpretability, through feature importance and decision boundary visualization, let’s face it, we realized the decision making was actually trivial and each time focused on one or two sensors, and consisted in making a regression on the successive values in time and comparing them with a given threshold (which physically matches to the actual hardcoded vendor thresholds for these particular sensors).

In addition to getting prediction models, they provided us with a way to mine into the high quantity of sensor data and gain experience/knowledge on which few meaningful sensors to focus on and monitor per device series clusters in priority. Thus, it helped us in building a relevant preventive maintenance plan in a second step, which we describe below.

Preventive maintenance

What is it?

This approach must be seen as a best effort. Rather than passively waiting for crashes to be predicted on hot days, we use the identified critical sensors to watch for our devices all year long round. Contrary to the previous supervised models, which we could only build for clusters of series which we had positive samples for, this approach can be applied to any device series cluster, though for clusters concerned by crashes, we benefit from the knowledge previously gained, regarding the sensors to focus on.

If this preventive maintenance plan is efficient and we follow its maintenance recommendations, we should, in particular, see less alerts reported by the crash prediction models built above.

Outlier detection here will consist of detecting devices with a temperature considered to be abnormally high and planning maintenance operations on them.

How does it work?

We assume the temperatures measured by sensors depend on many factors/features. Among others, we could have: the device series (model), the dirt level on fans, the surrounding airflow temperature in the cold aisle or in the hot aisle, surrounding humidity, and of course the device traffic load.

An easy feature to act on is the dirt level, since it’s only a matter of minutes to fix it, and it’s generally a big game changer in temperature. No big infrastructure scale is required as it would be the case if, for example, we wanted to act on the device load.

If we built a classical outlier detection model with all the pre-cited features, the model would not only detect devices that are abnormally hot, but also those which are, for example, in a warmer or damper environment than usual and we are not interested in this here. So, we cannot do it this way. What we actually want is retrieving the devices which are abnormally hot, conditionally to the state of features which we cannot easily act on (device load, aisle temperature). To get rid of one or several particular feature’s dependence among those cited above, one way we found was to partition the feature space into small parts (buckets). Then we projected our samples in these buckets. Once in a given bucket, we performed a classical outlier detection on temperatures. While doing so, you can project on as many features as you want. The only constraint while splitting the space in buckets is that you need to end up with enough samples in every bucket to be able to perform a reliable outlier detection. For example, Gaussian mixtures will require you to estimate correctly means and standard deviations for the Gaussian vectors composing your signal. In our case, we just estimated temperature quantiles, so that we did not make any assumptions on the distribution shape at all, but to estimate reliable quantiles (especially the extreme ones, which we are interested in for outlier detection), you need enough samples. Because of this constraint, we only projected our samples in buckets built using only the surrounding hot aisle temperature feature.

In short, the efficiency of this method relies on the assumption that a device which is abnormally hot when the surrounding airflow temperature is 27C is likely to also be abnormally hot, when the temperature is only 20C (thus it will be shown as an outlier in winter as well).

Below is an example of a sensor of interest identified during the crash prediction supervised learning step through feature importance.

The detected outliers are the red points on the graph. We report them as devices that would need a maintenance.

The graph above is interesting, as we can see that projecting on only one feature is not enough to separate two modes (the two crocodile ‘jaws’ shape you can see). It would definitely be interesting, in a next iteration, to try to project the samples conditionally to a second feature, like the device’s load. The two modes we observe might match to globally idle vs globally loaded devices.

Overall results

To measure the impact on these two monitoring methods, we can compare device crash counts due to overheating before and after we put them in production.

In 2019, we had several particular hot days (2019-06-24, 2019-06-29, 2019-07-24, 2019-07-25). For two particular device series, we encountered 39 crashes (35+4, fortunately, most of them had no impact on end users thanks to redundancy).

The system was prodded in early 2020 and ever since, we have not encountered any crashes for these particular series. Note, however, that the results can be biased because apart from this monitoring, some other improvements were made regarding cooling in the datacenter.

References