Loops: Powering Continuous Queries with Observability FaaS

We’re all familiar with that small snippet of code that adds reasonable value to your business unit. It can materialise as a script, a program, a line of code… and it will produce a report, new metrics,  KPIs, or create new composite data. This code is intended to run periodically, to meet requirements for up-to-date information.

In the Observability team, we encounter these snippets as queries within the Time Series Database (TSDB), to express continuous queries that are responsible for automating different use cases like: deletes, rollups or any business logic that needs to manipulate Time Series data.

We already introduced TSL in a previous blog post, which demonstrated how our customers use the available OVH Metrics protocols, like Graphite, OpenTSDB, PromQL and WarpScript™, but when it comes to manipulating, or even creating new data,  you don’t have a lot of options, although you can use WarpScript™ or TSL as scripting language instead of a query one.

In most cases, this business logic requires building an application, which is more time-consuming than expressing the logic as a query targeting a TSDB. Building the base application code is the first step, followed by the CI/CD (or any delivery process), and setting up its monitoring. However, managing hundreds of little apps like these will add an organic cost, due to the need to maintain them along with the underlying infrastructure.

We wanted to ensure these valuable tasks did not stack up on the heads of few developers, who would then need to carry the responsibilities of data ownership and computing resources, so we wondered how we could automate things, without relying on the team to setup the compute jobs each time someone needed something.

We wanted a solution that would focus on the business logic, without needing to run an entire app. This way, someone wanting to generate a JSON file with a daily data report (for example) would only need to express the corresponding query.

Running business logic over Loops

You shall not FaaS!

Scheduling jobs is an old, familiar routine. Be it bash cron jobs, runners, or specialised schedulers, when it comes to wrapping a snippet of code and making it run periodically, there is a name for it: FaaS.

FaaS was born with a simple goal in mind: reduce development time. We could have found an open source implementation to evaluate (e.g. OpenFaas), but most of these relied upon a managed container stack. Having one container per query would be very costly, plus warming up a container to execute the function and then freezing it would have been very counterproductive.

This would have required more scheduling and automation than we wanted for our end-goal, would have lead to suboptimal performance, and would have introduced a new requirement for cluster capacity management. There is also a build time required to deploy a new function in a container, which is consequently not free.

#def <Loops>

That was when we decided to build “Loops”: an application platform where you can push the code you want to run. That’s all. The goal is to push a function (literally!) rather than a module, like all current FaaS solutions do:

function dailyReport(event) {
    return Promise.resolve('Today, everything is fine !')
}

You can then execute it manually, with either an HTTP call or a Cron-like scheduler.
These both aspects are necessary, since you might (for example) have a monthly report, but one day will require an additional one, 15 days after the last report. Loops will make it easy to manually generate your new report, in addition to the monthly one.

There were some necessary constraints when we began building Loops:

  • This platform must be able to easily scale, to support OVH’s production load
  • It must be highly available
  • It must be language-agnostic, because some of us prefer Python, and others JavaScript
  • It must be reliable
  • The scheduling part mustn’t be correlated with the execution one (μService culture)
  • It must be secure and isolated, so anybody can push obscure code on the platform

Loops implementation

We choose to build our first version on V8. We chose JavaScript as the first language, because it’s easy to learn, and asynchronous data flows are easily managed using Promises. Also, it fits very well with a FaaS, since Javascript functions are highly expressive. We built it around the new NodeJS VM module, which allows you to execute code in a dedicated V8 context.

A  V8 context is like an object (JSON), isolated from your execution. In context, you can find native functions and objects. However, if you craft a new V8 context, you will see that some variables or functions are not natively available (setTimeout(), setInterval() or Buffer, for example). If you want to use these, you will have to inject them into your new context. The last important thing to remember is that when you have your new context, you can easily execute a JavaScript script under string form on it.

Contexts fulfil the most important part of our original list of requirements: isolation. Each V8 context is isolated, so it cannot talk to another context. This means a global variable defined in one context is not available in a different one. You will have to build a bridge between them if you want this to be the case.

We didn’t want to execute scripts with eval(), since a call to this function allows you to execute JS code on the main shared context, with the code calling it. You can then access to the same objects, constants, variables, etc. This security issue was a deal breaker for the new platform.

Now we know how to execute our scripts, let’s implement some management for them. To be stateless, each Loops worker instance (i.e. a JavaScript engine able to run code in a VM context) must have the last version of each Loop (a loop is a script to execute). This means that when a user pushes a new Loop, we have to sync it on each Loops worker. This model fits well with the pub/sub paradigm, and since we already use Kafka as a pub/sub infrastructure, it was just a matter of creating a dedicated topic and consuming it from the workers. In this case, publication involves an API where a user submits their Loops, which produce a Kafka event containing the function body. As each worker has its own Kafka consumer group, they all receive the same messages.

Workers subscribe to Loops updates as Kafka consumers and maintain a Loop store, which is an embedded key (the Loop hash)/Value (the function’s current revision). In the API part, Loop hashes are used as URL parameters to identify which Loop to execute. Once called, a Loop is retrieved from the map, then injected in a V8 context, executed, and dropped. This hot code reload mechanism ensures that each Loop can be executed on every worker. We can also leverage our load balancers’ capabilities to distribute the load on the workers. This simple distribution model avoids complex scheduling and eases the maintainability of the overall infrastructure.

In order to be reboot-proof, we make use of Kafka’s very handy log compaction feature. Log compaction allows Kafka to keep the last version of each keyed message. When a user creates a new Loop, it will be given a unique ID, which is used as a Kafka message key. When a user updates a Loop, this new message will be forwarded to all consumers, but since the key already exists, only the last revision will be kept by Kafka. When a worker restarts, it will consume all messages to rebuild its internal KV, so the previous state will be restored. Kafka is used here as a persistent store.

Creating, editing and deleting loops

Loops runtimes

Even if the underlying engine is able to run native Javascript, as stated above, we wanted it to run more idiomatic Time Series queries like TSL or WarpScript™. To achieve this, we created a Loops Runtime abstraction that wraps not only Javascript, but also TSL and WarpScript™ queries into Javascript code. Users have to declare a Loop with it’s runtime, after which it’s just a matter of wrappers working. For example, executing a WarpScript™ Loop involves taking the plain WarpScript™ and sending it through a node-request HTTP call.

Running a Loop
Running a Loop

Loops feedback

Executing code safely is a start, but when it comes to executing arbitrary code, it’s also useful to get some feedback on the execution state. Was it successful or not? Is there an error in the function? If a Loop is in a failure state, the user should be notified straight away.

This leads us to one special condition: a user’s scripts must be able to tell if everything is OK or not.  There are two ways to do that in the underlying JavaScript engine: callbacks and Promises.
We choose to go with Promises which offers a better asynchronous management. Every Loop returns a Promise at the end of the script. A rejected promise will produce an HTTP 500 error status, while a resolved one will produce an HTTP 200 status.

Loops scheduling

When publishing Loops, you can declare several triggers, in a similar way to Cron. Each trigger will perform an HTTP call to your Loop, with optional parameters.

Based on this semantic, to generate multiple reports, we can register a single function that would be scheduled with different contexts, defined by various parameters (region, rate, etc.). See the example below:

functions:
  warp_apps_by_cells:
    handler: apps-by-cells.mc2
    runtime: ws
    timeout: 30
    environment:
    events:
      - agra:
          rate: R/2018-01-01T00:00:00Z/PT5M/ET1M
          params:
            cell: ovh-a-gra
      - abhs:
          rate: R/2018-01-01T00:00:00Z/PT5M/ET1M
          params:
            cell: ovh-a-bhs

The scheduling is based on Metronome, which is an open-source event scheduler, with a specific focus on scheduling rather than execution. It’s a perfect fit for Loops, since Loops handle the execution, while relying on Metronome to drive execution calls.

Loops pipelines

A Loops project can have several Loops. One of our customers’ common use cases was having was to use Loops as a data platform, in a data flow fashion. Data flow is a way to describe a pipeline of execution steps. In a Loops context, there is a global `Loop` object, which allows the script to execute another Loop with this name. You can then chain Loop executions that will act as step functions.

Pain points: scaling a NodeJS application

Loops workers are NodeJS applications. Most of NodeJS developers know that NodeJS uses an mono-threaded event loop. If you don’t take care of the threading model of your nodeJS  application, you would likely suffer for a lack of performance, since only one host thread will be used.

NodeJS also has a cluster module available, which allows an app to use multiple threads. That’s why in a Loops worker, we start with an N-1 thread for handling API calls, where N is the total number of threads available, which leaves one dedicated to the master thread.

The master thread is in charge of consuming Kafka topics and maintaining the Loops store, while the worker thread starts an API server. For every requested Loop execution, it asks the master for the script content, and executes it in a dedicated thread.

With this setup, one NodeJS application with one Kafka consumer is started per server, which make it very easy to scale out the infrastructure, just by adding additional servers or cloud Instances.

Conclusion

In this post, we previewed Loops, a scalable, metrics-oriented FaaS with native JavaScript support, and extended WarpScript™ and TSL support.

We still have a few things to enhance, like ES5-style dependency imports and metrics previews for our customers’ Loops projects. We also plan to add more runtimes, especially WASM, which would allow many other languages that can target it, like Go, Rust or Python, to suit most developer preferences.

The Loops platform was part of a requirement to build higher-level features around OVH Observability products. It’s a first step towards offering more automated services, like metrics rollups, aggregation pipelines, or logs-to-metrics extractors.

This tool was built part of the Observability products suite with a higher abstraction level in mind, but you might also want direct access to the API, in order to implement your own automated logic for your metrics. Would you be interested in such a feature? Visit our Gitter channel to discuss it with us!

+ posts