SMAUG, the brand new OVHcloud backbone network infrastructure

The OVHcloud network has 34 PoPs (Points of Presence); located in Europe, North America and Asia-Pacific. With a global capacity of around 21Tbps, the OVHcloud network can handle much more traffic than other providers.

At OVHcloud, network traffic is constantly growing to meet the needs of millions of internet users, who utilise our 31 data centers across the globe.

Over the past few years, OVHcloud have been working to improve the infrastructure of our worldwide backbone network.

Scaling up our network further

With almost two years of research and development, the OVHcloud network team has come up with a brand new infrastructure. Each datacenter is connected to multiple PoPs (Point of Presence). PoPs share traffic with various other OVHcloud datacenters as well as exchanging traffic with different providers (known as Peers).

The current architecture, known internally as ‘sticky router’, is based on a router for routing and a switch which plays the role of the power board. For a few years, it had worked quite well (and at a low cost), but the method eventually reached its limit in terms of bandwidth. We needed to find and design another system which could handle the increasing traffic. Our requirements were simple; we needed a low cost, low-power efficient, and scalable infrastructure.

Providing the best possible connectivity for our worldwide customers has been the company’s driving force since it was created by Octave Klaba. To do this, we wanted to be connected to local providers as far as possible, requiring multiple ports in 10Gbps or 100Gbps.

Re-thinking our PoP topologies, we considered new scalable technologies, including:

  • Port capacity: would have to be based on 100Gbps and 400Gbps ports, and have a good cost efficiency. This would help eliminate any bottlenecks in the network. It also needed to maintain 10Gbps links for those providers who were not ready for 100Gbps ports.
  • Easy to upgrade: The new architecture needed to be easy to upgrade, in terms of capacity. This is partly for the growth of the company, but also to maintain availability when network maintenance is required on the PoPs.
  • Power: The team needed to find the best hardware for maximising power-consumption efficiency; especially in countries like Singapore where it’s expensive.
  • Security: A key requirements would be to work with our security teams to find the best solution for protecting the network against threats (massive DDoS attacks).

After almost 1 year of research and tests, the design team came up with a brand new architecture; which was scalable, power efficient, easy to install and robust. The new architecture was named SMAUG, after the dragon in The Hobbit.

Overview of SMAUG

To be adaptable, the architecture has multiple capacity options, depending on how big the PoP is. This is because different amounts of traffic are exchanged at different datacenters. Each capacity option has its own specificity, with the objective of no bottlenecks.

Spine and Leaf infrastructure

SMAUG is a ‘Spine and Leaf’ infrastructure. This means the ‘spine’ (called SBB for SuperBackBone) aggregates the leaves and connects each datacenter. The ‘leaf’ devices (called PB for PeeringBox) are used for connecting providers and also internal services such as xDSL equipment or OVHcloud Connect.

SMAUG infrastructure also provides another way to connect a datacenter where there is a minimum of two PoPs, both in a different location. For example, in Singapore, our datacenter is connected to two PoPs, which have more than 30 kms between them – this meets the rule of not using the same power source for two PoPs.

To ensure redundancy, both the PoPs need to be connected to each other with a huge amount of capacity – in 100Gbps or 400Gbps, depending on the PoPs. The transmission team was also involved in developing a new infrastructure called ‘GALAXY’. GALAXY was based on a different constructor – combined with a simple-to-deploy, scalable, operational model, with less energy consumption.

The role of the leaf is pretty simple and is similar to the top of the rack in a datacenter. It has a huge amount of uplink capacity towards the spines and has the configuration for connecting BGP peers; such as transit providers, private interconnection (PNI) and Internet eXchange.

The spine role is more complex and has added functionality, including:

  • Long-haul links: It has the connection of the links, based on 100Gbps, towards datacenter(s) and PoP(s).
  • Routing: It has the entire routing table in order to take the best path towards the OVHcloud datacenters or towards the external networks.
  • Protection: The VAC team has been involved in it by developing a new protecting tool (for more information: https://www.ovh.com/blog/using-fpgas-in-an-agile-development-workflow/) in order to help to protect the entire OVHcloud network from DDoS attacks.
  • Mitigation: It also helps the detecting system used by the VAC infrastructure (https://www.ovh.co.uk/anti-ddos/)

Testing and deploying

After the design team and management were sure it was the best architecture for the OVHcloud network, we started testing different brands, providing all the functionality was implemented correctly. After all the tests had been completed in the laboratory, and we were confident that the solution was viable, we started the deployment of the infrastructure in Singapore. This region was one of the most important in terms of traffic growth. It was also easier because we already had the dark fibers for the links between the datacenter and the PoPs.

In January, we ordered all the devices and transceivers, then prepared the migration plan in order to schedule the whole deployment at the end of March. At the end of February, we prepared the configuration and tested the brand new devices. When everything was in order, we sent all of them to Singapore PoPs.

At the beginning we planned to do this migration mid-march 2020 and we were supposed to send our technicians from France to Singapore, but due to COVID-19 we had to change our plans. We had to find another solution and ask our local technicians working in our Singaporean datacenter to do the job. The migration plan was more complicated due to the new reorganization that needed to be in place for the pandemic.

After a long discussion between the management, the network team and the Singaporean OVHcloud technicians, it was decided to do the migration of the first PoP at the beginning of April and the second one at the end of April. The migration began by racking the brand new devices in two new racks, preparing the wiring for the migration, and doing some checks before the hotswaps.

Migrating to SMAUG

The pressure for this migration to be successful was high, as we did not want it to impact our customers. On the first night of the migration – once we had dried out the traffic from the first PoP – we asked the technicians to move all the long-haul links to the Singapore DC, and for Australia, France and USA, to new devices. After some tests, we put the new devices in production. The first step of the migration went well.

The next day wasn’t quite so smooth, as it was the first time we were putting our new Peering Box in production with our new border protection system, based on FPGA servers. After we removed the traffic from our peers, traffic was then leaving the OVHcloud network via the second PoP. Then we moved the fibers to the new Peering Box via a hot swap from our datacenter supplier. After we got everything plugged into the new equipment, we then began putting production back in slowly. We needed to do this in conjunction with our security team, in order to check that this new border protection system was working properly and not dropping legitimate traffic.

On the last day of migration, another technology had been put in place by our transmission team. The goal here was to add capacity between the Singapore datacenter and the PoP where we installed these new devices. After we isolated the traffic between both ends, we migrated the dark fiber to the new DWDM optical system (Galaxy) in order to add 400Gbps of capacity to the datacenter. As it was new, we had some trouble explaining how to fix some cabling issues in the system. After it was all fixed and ready, one by one, we put 4x100Gbps links in production.

After completing all these different steps we analysed and fixed some issues in order to make the second PoP faster with the same schedule.

Once both PoPs were in production we monitored how it was handling the traffic and the DDoS attacks. We also contacted our close customers to ensure that there were no issues.

In conclusion

This new infrastructure, SMAUG, brings significant improvements in power and space efficiencies as well as reducing complexity in the overall network design. It will also help OVHcloud to keep up with increasing applications and service demands.

Looking ahead, we believe the flexibility of SMAUG design will help OVHcloud to leverage faster network capacity and be ready for future technologies.

Thanks to the teams and industry partners that helped make this new infrastructure possible.

Senior Network Engineer at OVHcloud | Website | + posts

Florian is a senior network engineer based in OVHcloud’s Australian offices, in Melbourne. A member of the core network team, he developed an expertise at building, troubleshooting and consulting on network architectures for network designs with more than three hundred thousand servers. He also works on improving the network by adding capacity, connecting new providers and installing new equipment in the worldwide network.