python Archives - OVHcloud Blog

Deploy a custom Docker image for Data Science project – A spam classifier with FastAPI (Part 3)

Eléa Petton — Fri, 30 Dec 2022 10:39:54 +0000

A guide to deploy a custom Docker image for an API with FastAPI and AI Deploy.

Welcome to the third article concerning custom Docker image deployment. If you haven’t read the previous ones, you can check it:

– Gradio sketch recognition app
– Streamlit app for EDA and interactive prediction

When creating code for a Data Science project, you probably want it to be as portable as possible. In other words, it can be run as many times as you like, even on different machines.

Unfortunately, it is often the case that a Data Science code works fine locally on a machine but gives errors during runtime. It can be due to different versions of libraries installed on the host machine.

To deal with this problem, you can use Docker.

The article is organized as follows:

Objectives
Concepts
Define a model for spam classification
Build the FastAPI app with Python
Containerize your app with Docker
Launch the app with AI Deploy

All the code for this blogpost is available in our dedicated GitHub repository. You can test it with OVHcloud AI Deploy tool, please refer to the documentation to boot it up.

Objectives

In this article, you will learn how to develop FastAPI API for spam classification.

Once your app is up and running locally, it will be a matter of containerizing it, then deploying the custom Docker image with AI Deploy.

Concepts

In Artificial Intelligence, you have probably heard of Natural Language Processing (NLP). NLP gathers several tasks related to language processing such as text classification.

This technique is ideal for distinguishing spam from other messages.

Spam Ham Collection Dataset

The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research.

The dataset contains 5,574 messages in English. The SMS are tagged as follow:

HAM if the message is legitimate
SPAM if it is not

The collection is a text file, where each line has the correct class followed by the raw message.

Logistic regression

What is a Logistic Regression?

Logistic regression is a statistical model. It allows to study the relationships between a set of i qualitative variables (Xi) and a qualitative variable (Y).

It is a generalized linear model using a logistic function as a link function.

A logistic regression model can also predict the probability of an event occurring (value close to 1) or not (value close to 0) from the optimization of the regression coefficients. This result always varies between 0 and 1.

For the spam classification use case, words are inputs and class (spam or ham) is output.

FastAPI

What is FastAPI?

FastAPI is a web framework for building RESTful APIs with Python.

FastAPI is based on Pydantic and type guidance to validate, serialize and deserialize data, and automatically generate OpenAPI documents.

Docker

Docker platform allows you to build, run and manage isolated applications. The principle is to build an application that contains not only the written code but also all the context to run the code: libraries and their versions for example

When you wrap your application with all its context, you build a Docker image, which can be saved in your local repository or in the Docker Hub.

To get started with Docker, please, check this documentation.

To build a Docker image, you will define 2 elements:

the application code (FastAPI app)
the Dockerfile

In the next steps, you will see how to develop the Python code for your app, but also how to write the Dockerfile.

Finally, you will see how to deploy your custom docker image with OVHcloud AI Deploy tool.

AI Deploy

AI Deploy enables AI models and managed applications to be started via Docker containers.

To know more about AI Deploy, please refer to this documentation.

Define a model for spam classification

❗ To develop an API that uses a Machine Learning model, you have to load the model in the correct format. For this tutorial, a Logistic Regression is used and the Python file model.py is used to define it.

To better understand the model.py code, refer to the notebook which details all the steps.

First of all, you have to import the Python libraries needed to create the Logistic Regression in the model.py file.

import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

Now, you can create the Logistic Regression based on the Spam Ham Collection Dataset. The Python framework named Scikit-Learn is used to define this model.

Firstly, you can load the dataset and transform your input file into a dataframe.

You will also be able to define the input and the output of the model.

def load_data():

    PATH = 'SMSSpamCollection'
    df = pd.read_csv(PATH, delimiter = "\t", names=["classe", "message"])

    X = df['message']
    y = df['classe']

    return X, y

In a second step, you split the data in a training and a test set.

To separate the dataset fairly and to have a test_size between 0 and 1, you can calculate ntest as follows.

def split_data(X, y):

    ntest = 2000/(3572+2000)

    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=ntest, random_state=0)

    return X_train, y_train

Now you can concentrate on creating the Machine Learning model. To do this, create a spam_classifier_model function.

To fully understand the code, refer to Steps 6 to 9 of this notebook. In these steps you will learn how to:

create the model using Logistic Regression
evaluate on the test set
do dimension reduction with stop words and term frequency
do dimension reduction to post-processing of the model

def spam_classifier_model(Xtrain, ytrain):

    model_logistic_regression = LogisticRegression()
    model_logistic_regression = model_logistic_regression.fit(Xtrain, ytrain)

    coeff = model_logistic_regression.coef_
    coef_abs = np.abs(coeff)

    quantiles = np.quantile(coef_abs,[0, 0.25, 0.5, 0.75, 0.9, 1])

    index = np.where(coeff[0] > quantiles[1])
    newXtrain = Xtrain[:, index[0]]

    model_logistic_regression = LogisticRegression()
    model_logistic_regression.fit(newXtrain, ytrain)

    return model_logistic_regression, index

Once these Python functions are defined, you can call and apply them as follows.

Firstly, extract input and output data with load_data():

data_input, data_output = load_data()

Secondly, split the data using the split_data(data_input, data_output):

X_train, ytrain = split_data(data_input, data_output)

❗ Here, there is no need to use the test set. Indeed, the evaluation of the final model has already been done in Step 9 - Dimensionality reduction: post processing of the model of the notebook.

Thirdly, transform and fit training set. In order to prepare the data, you can use CountVectorizer from Scikit-Learn to remove stop-words and then fit_transform to fit the inputs.

vectorizer = CountVectorizer(stop_words='english', binary=True, min_df=10)
Xtrain = vectorizer.fit_transform(X_train.tolist())
Xtrain = Xtrain.toarray()

Fourthly, use the model and index for prediction by calling spam_classifier_model function.

model_logistic_regression, index = spam_classifier_model(Xtrain, ytrain)

Find out the full Python code here.

Have you successfully defined your model? Good job 🥳 !

Let’s go for the creation of the API!

Build the FastAPI app with Python

❗ All the codes below are available in the app.py file. You can find the complete Python code of the app.py file here.

To begin, you can import dependencies for FastAPI app.

uvicorn
fastapi
pydantic

import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from model import model_logistic_regression, index, vectorizer

In the first place, you can initialize an instance of FastAPI.

app = FastAPI()

Next, you can define the data format by creating the Python class named request_body. Here, the string (str) format is required.

class request_body(BaseModel):
    message : str

Now, you can create the process function in order to prepare the sent message to be used by the model.

def process_message(message):

    desc = vectorizer.transform(message)
    dense_desc = desc.toarray()
    dense_select = dense_desc[:, index[0]]

    return dense_select

At the exit of this function the message does not contain any more stop words, it is put in the right format for the model thanks to the transform and is then represented as an array.

Now that the function for processing the input data is defined, you can pass the GET and POST methods.

First, let’s go for the GET method!

@app.get('/')
def root():
    return {'message': 'Welcome to the SPAM classifier API'}

Here you can see the welcome message when you arrive on your API.

{"message":"Welcome to the SPAM classifier API"}

Now it’s the turn of the POST method. In this part of the code, you will be able to:

define the message format
check if a message has been sent or not
process the message to fit with the model
extract the probabilities
return the results

@app.post('/spam_detection_path')
def classify_message(data : request_body):

    message = [
        data.message
    ]

    if (not (message)):
        raise HTTPException(status_code=400, detail="Please Provide a valid text message")

    dense_select = process_message(message)

    label = model_logistic_regression.predict(dense_select)
    proba = model_logistic_regression.predict_proba(dense_select)

    if label[0]=='ham':
        label_proba = proba[0][0]
    else:
        label_proba = proba[0][1]

    return {'label': label[0], 'label_probability': label_proba}

❗ Again, you can find the full code here.

Before deploying your API, you can test it locally using the following command:

uvicorn app:app --reload

Then, you can test your app locally at the following address: http://localhost:8000/

You will arrive on the following page:

How to interact with your API?

You can add /docs at the end of the url of your app: http://localhost:8000/docs

A new page opens to you. It provides a complete dashboard for interacting with the API!

To be able to send a message for classification, select /spam_detection_path in the green box. Click on Try it out and type the message of your choice in the dedicated zone.

Enter the message of your choice. It must be in the form of a string.

Example: "A new free service for you only"

To get the result of the prediction, click on the Execute button.

Finally, you obtain the result of the prediction with the label and the confidence score.

Your app works locally? Congratulations 🎉 !

Now it’s time to move on to containerization!

Containerize your app with Docker

First of all, you have to build the file that will contain the different Python modules to be installed with their corresponding version.

Create the requirements.txt file

The requirements.txt file will allow us to write all the modules needed to make our application work.

fastapi==0.87.0
pydantic==1.10.2
uvicorn==0.20.0
pandas==1.5.1
scikit-learn==1.1.3

This file will be useful when writing the Dockerfile.

Write the Dockerfile

Your Dockerfile should start with the the FROM instruction indicating the parent image to use. In our case we choose to start from a classic Python image.

For this Streamlit app, you can use version 3.8 of Python.

FROM python:3.8

Next, you have to to fill in the working directory and add all files into.

❗ Here you must be in the /workspace directory. This is the basic directory for launching an OVHcloud AI Deploy.

WORKDIR /workspace
ADD . /workspace

Install the requirements.txt file which contains your needed Python modules using a pip install… command.

RUN pip install -r requirements.txt

Set the listening port of the container. For FastAPI, you can use the port 8000.

EXPOSE 8000

Then, you have to define the entrypoint and the default launching command to start the application.

ENTRYPOINT ["uvicorn"]
CMD [ "streamlit", "run", "/workspace/app.py", "--server.address=0.0.0.0" ]

Finally, you can give correct access rights to OVHcloud user (42420:42420).

RUN chown -R 42420:42420 /workspace
ENV HOME=/workspace

Once your Dockerfile is defined, you will be able to build your custom docker image.

Build the Docker image from the Dockerfile

First, you can launch the following command from the Dockerfile directory to build your application image.

docker build . -t fastapi-spam-classification:latest

⚠️ The dot . argument indicates that your build context (place of the Dockerfile and other needed files) is the current directory.

⚠️ The -t argument allows you to choose the identifier to give to your image. Usually image identifiers are composed of a name and a version tag :. For this example we chose fastapi-spam-classification:latest.

Test it locally

Now, you can run the following Docker command to launch your application locally on your computer.

docker run --rm -it -p 8080:8080 --user=42420:42420 fastapi-spam-classification:latest

⚠️ The -p 8000:8000 argument indicates that you want to execute a port redirection from the port 8000 of your local machine into the port 8000 of the Docker container.

⚠️ Don't forget the --user=42420:42420 argument if you want to simulate the exact same behaviour that will occur on AI Deploy. It executes the Docker container as the specific OVHcloud user (user 42420:42420).

Once started, your application should be available on http://localhost:8000.

Your Docker image seems to work? Good job 👍 !

It’s time to push it and deploy it!

Push the image into the shared registry

❗ The shared registry of AI Deploy should only be used for testing purpose. Please consider attaching your own Docker registry. More information about this can be found here.

Then, you have to find the address of your shared registry by launching this command.

ovhai registry list

Next, log in on the shared registry with your usual OpenStack credentials.

docker login -u  -p

To finish, you need to push the created image into the shared registry.

docker tag fastapi-spam-classification:latest /fastapi-spam-classification:latest

docker push /fastapi-spam-classification:latest

Once you have pushed your custom Docker image into the shared registry, you are ready to launch your app 🚀 !

Launch the AI Deploy app

The following command starts a new job running your FastAPI application.

ovhai app run \
      --default-http-port 8000 \
      --cpu 4 \
      /fastapi-spam-classification:latest

Choose the compute resources

First, you can either choose the number of GPUs or CPUs for your app.

--cpu 4 indicates that we request 4 CPUs for that app.

Make the app public

Finally, if you want your app to be accessible without the need to authenticate, specify it as follows.

Consider adding the --unsecure-http attribute if you want your application to be reachable without any authentication.

Conclusion

Well done 🎉 ! You have learned how to build your own Docker image for a dedicated spam classification API!

You have also been able to deploy this app thanks to OVHcloud’s AI Deploy tool.

Want to find out more?

Notebook

You want to access the notebook? Refer to the GitHub repository.

App

You want to access to the full code to create the FastAPI API? Refer to the GitHub repository.

To launch and test this app with AI Deploy, please refer to our documentation.

References

Deploy a custom Docker image for Data Science project – Streamlit app for EDA and interactive prediction (Part 2)

Eléa Petton — Tue, 11 Oct 2022 07:38:35 +0000

A guide to deploy a custom Docker image for a Streamlit app with AI Deploy.

Welcome to the second article concerning custom Docker image deployment. If you haven’t read the previous one, you can read it on the following link. It was about Gradio and sketch recognition.

When creating code for a Data Science project, you probably want it to be as portable as possible. In other words, it can be run as many times as you like, even on different machines.

To deal with this problem, you can use Docker.

The article is organized as follows:

Objectives
Concepts
Load the trained PyTorch model
Build the Streamlit app with Python
Containerize your app with Docker
Launch the app with AI Deploy

All the code for this blogpost is available in our dedicated GitHub repository. You can test it with OVHcloud AI Deploy tool, please refer to the documentation to boot it up.

Objectives

In this article, you will learn how to develop Streamlit app for two Data Science tasks: Exploratory Data Analysis (EDA) and prediction based on ML model.

Once your app is up and running locally, it will be a matter of containerizing it, then deploying the custom Docker image with AI Deploy.

Concepts

In Artificial Intelligence, you probably hear about the famous use case of the Iris dataset. How about learning more about the iris dataset?

Iris dataset

Iris Flower Dataset is considered as the Hello World for Data Science. The Iris Flower Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris:

Iris setosa
Iris virginica
Iris versicolor

The dataset is in csv format and you can also find it directly as a dataframe. It contains five columns namely:

Petal length
Petal width
Sepal length
Sepal width
Species type

The objective of the models based on this dataset is to classify the three Iris species. The measurements of petals and sepals are used to create, for example, a linear discriminant model to classify species.

❗ A model to classify Iris species was trained in a previous tutorial, in notebook form, which you can find and test here.

This model is registered in an OVHcloud Object Storage container.

In this article, the first objective is to create an app for Exploratory Data Analysis (EDA). Then you will see how to obtain interactive prediction.

EDA

What is EDA in Data Science?

Exploratory Data Analysis (EDA) is a technique to analyze data with visual techniques. In this way, you have detailed information about the statistical summary of the data.

In addition, EDA allows duplicate values, outliers to be dealt with, and also to see certain trends or patterns present in the dataset.

For Iris dataset, the aim is to observe the source data on visual graphs using the Streamlit tool.

Streamlit

What is Streamlit?

Streamlit allows you to transform data scripts into quickly shareable web applications using only the Python language. Moreover, this framework does not require front-end skills.

This is a time saver for the data scientist who wants to deploy an app around the world of data!

To make this app accessible, you need to containerize it using Docker.

Docker

When you wrap your application with all its context, you build a Docker image, which can be saved in your local repository or in the Docker Hub.

To get started with Docker, please, check this documentation.

To build a Docker image, you will define 2 elements:

the application code (Streamlit app)
the Dockerfile

In the next steps, you will see how to develop the Python code for your app, but also how to write the Dockerfile.

Finally, you will see how to deploy your custom docker image with OVHcloud AI Deploy tool.

AI Deploy

AI Deploy enables AI models and managed applications to be started via Docker containers.

To know more about AI Deploy, please refer to this documentation.

Load the trained PyTorch model

❗ To develop an app that uses a machine learning model, you must first load the model in the correct format. For this tutorial, a PyTorch model is used and the Python file utils.py is used to load it.

The first step is to import the Python libraries needed to load a PyTorch model in the utils.py file.

import torch
import torch.nn as nn
import torch.nn.functional as F

To load your PyTorch model, it is first necessary to define its model architecture by using the Model class defined previously in the part “Step 2 – Define the neural network model” of the notebook.

class Model(nn.Module):
    def __init__(self):

        super().__init__()
        self.layer1 = nn.Linear(in_features=4, out_features=16)
        self.layer2 = nn.Linear(in_features=16, out_features=12)
        self.output = nn.Linear(in_features=12, out_features=3)

    def forward(self, x):

        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        x = self.output(x)

        return x

In a second step, you fill in the access path to the model. To save this model in pth format, refer to the part “Step 6 – Save the model for future inference” of the notebook.

path = "model_iris_classification.pth"

Then, the load_checkpoint function is used to load the model’s checkpoint.

def load_checkpoint(path):

    model = Model()
    print("Model display: ", model)
    model.load_state_dict(torch.load(path))
    model.eval()

    return model

Finally, the function load_model is used to load the model and to use it to obtain the result of the prediction.

def load_model(X_tensor):

    model = load_checkpoint(path)
    predict_out = model(X_tensor)
    _, predict_y = torch.max(predict_out, 1)

    return predict_out.squeeze().detach().numpy(), predict_y.item()

Find out the full Python code here.

Have you successfully loaded your model? Good job 🥳 !

Let’s go for the creation of the Streamlit app!

Build the Streamlit app with Python

❗ All the codes below are available in the app.py file. The key functions are explained in this article. However, the "main" part of the app.py file is not described. You can find the complete Python code of the app.py file here.

To begin, you can import dependencies for Streamlit app.

Numpy
Pandas
Seaborn
load_model function from utils.py
Torch
Streamlit
Scikit-Learn
Ploty
PIL

import numpy as np
import pandas as pd
import seaborn as sns
from utils import load_model
import torch
import streamlit as st
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import plotly.graph_objects as go
import plotly.express as px
from PIL import Image

Then, you must load source dataset of Iris flowers to be able to extract the characteristics and thus, visualize data. Scikit-Learn allows to load this dataset without having to download it!

Next, you can separate the dataset in an input dataframe and an output dataframe.

Finally, this load_data function is cached so that you don’t have to download again the dataset.

@st.cache
def load_data():
    dataset_iris = load_iris()
    df_inputs = pd.DataFrame(dataset_iris.data, columns=dataset_iris.feature_names)
    df_output = pd.DataFrame(dataset_iris.target, columns=['variety'])

    return df_inputs, df_output

The creation of this Streamlit app is separated into two parts.

Firstly, you can look into the creation of the EDA part. Then you will see how to create an interactive prediction tool using the PyTorch model.

EDA on Iris Dataset

As a first step, you can look at the source dataset by displaying different graphs using the Python Seaborn library.

Seaborn Pairplot allows you to get the relationship between each variable present in Pandas dataframe.

sns.pairplot plots the graph in pairs of several features in a grid format.

@st.cache(allow_output_mutation=True)
def data_visualization(df_inputs, df_output):

    df = pd.concat([df_inputs, df_output['variety']], axis=1)
    eda = sns.pairplot(data=df, hue="variety", palette=['#0D0888', '#CB4779', '#F0F922'])

    return eda

Later, this function will display the following graph thanks to a call in the “main” of app.py file.

Here it can be seen that the setosa 0 variety is easily separated from the other two (versicolor 1 and virginica 2).

Were you able to display your graph? Well done 🎉 !

So, let’s go to the interactive prediction tool 🔜 !

Create an interactive prediction tool

To create an interactive prediction tool, you will need several elements:

Firstly, you need four sliders to play with the input parameters
Secondly, you have to create a function to display the Principal Component Analysis (PCA) graph to visualize the point corresponding to the output of the model
Thirdly, you can build a histogram representing the result of the prediction
Fourthly, you will have a function to display the image of the predicted Iris species

Ready to go? Let’s start creating sliders!

Create a sidebar with sliders for input data

In order to facilitate the visual reading of the Streamlit app, sliders are added in a sidebar.

In this sidebar, four sliders are added so that users can choose the length and width of petals and sepals.

How to create a slider? Well, nothing could be easier than with Streamlit!

You need to define the function st.sidebar.slider() to add a slider to the sidebar. Then you can specify arguments such as minimum and maximum values or the average value which will be the default value. Finally, you can specify the step of your slider.

❗ Here you can see the example for a single slider. Find the complete code of the other sliders on the GitHub repo here.

def create_slider(df_inputs):

    sepal_length = st.sidebar.slider(
        label='Sepal Length',
        min_value=float(df_inputs['sepal length (cm)'].min()),
        max_value=float(df_inputs['sepal length (cm)'].max()),
        value=float(round(df_inputs['sepal length (cm)'].mean(), 1)),
        step=0.1)

    sepal_width = st.sidebar.slider(
        ...
        )

    petal_length = st.sidebar.slider(
        ...
        )

    petal_width = st.sidebar.slider(
        ...
        )

    return sepal_length, sepal_width, petal_length, petal_width

Later, this function will be call in the “main” of the app.py file. Afterwards, you will see the following interface:

Thanks to these sliders, you can now obtain the result of the prediction in an interactive way by playing on one or more parameters.

Display PCA graph

Once your sliders are up and running, you can create a function to display the graph of the Principal Component Analysis (PCA).

PCA is a technique that transforms high-dimensional data into lower dimensions while retaining as much information as possible.

What about the Iris dataset? The aim is to be able to display the point resulting from the model prediction on a two-dimensional graph.

The run_pca function below displays the two-dimensional graph with iris of the source dataset.

@st.cache
def run_pca():

    pca = PCA(2)
    X = df_inputs.iloc[:, :4]
    X_pca = pca.fit_transform(X)
    df_pca = pd.DataFrame(pca.transform(X))
    df_pca.columns = ['PC1', 'PC2']
    df_pca = pd.concat([df_pca, df_output['variety']], axis=1)

    return pca, df_pca

Thereafter, the black point corresponding to the result of the prediction is placed on the same graph in the “main” of the Python app.py file.

With this method you were able to visualize your point in space. However, the numerical result of the prediction is not filled in.

Therefore, you can also display the results as a histogram.

Return predictions histogram

At the output of the neural network, the results can be positive or negative and the highest value corresponds to the iris species predicted by the model.

To create a histogram, negative values can be removed. To do this, the predictions with positive values are extracted and sent to a list before being transformed into a dataframe.

The negative values are all replaced by the null value.

To summarize, the extract_positive_value function can be translated into the following mathematical formula:
f(prediction) = max(0, prediction)

def extract_positive_value(prediction):

    prediction_positive = []
    for p in prediction:
        if p < 0:
            p = 0
        prediction_positive.append(p)

    return pd.DataFrame({'Species': ['Setosa', 'Versicolor', 'Virginica'], 'Confidence': prediction_positive})

This function is then called to build the histogram in the “main” of the Python app.py file. The library plotly allows to build this bar chart as follows.

fig = px.bar(extract_positive_value(prediction), x='Species', y='Confidence', width=400, height=400, color='Species', color_discrete_sequence=['#0D0888', '#CB4779', '#F0F922'])

Show Iris species image

The final step is to display the predicted iris image using a Streamlit button. Therefore, you can define the display_image function to select the correct image based on the prediction.

def display_img(species):

    list_img = ['setosa.png', 'versicolor.png', 'virginica.png']

    return Image.open(list_img[species])

Finally, in the main Python code app.py, st.image() displays the image when the user requests it by pressing the “Show flower image” button.

if st.button('Show flower image'):
    st.image(display_img(species), width=300)
    st.write(df_pred.iloc[species, 0])

❗ Again, you can find the full code here.

Before deploying your Streamlit app, you can test it locally using the following command:

streamlit run app.py

Then, you can test your app locally at the following address: http://localhost:8080/

Your app works locally? Congratulations 🎉 !

Now it’s time to move on to containerization!

Containerize your app with Docker

First of all, you have to build the file that will contain the different Python modules to be installed with their corresponding version.

Create the requirements.txt file

The requirements.txt file will allow us to write all the modules needed to make our application work.

pandas==1.4.4
numpy==1.23.2
torch==1.12.1
streamlit==1.12.2
scikit-learn==1.1.2
plotly==5.10.0
Pillow==9.2.0
seaborn==0.12.0

This file will be useful when writing the Dockerfile.

Write the Dockerfile

Your Dockerfile should start with the the FROM instruction indicating the parent image to use. In our case we choose to start from a classic Python image.

For this Streamlit app, you can use version 3.8 of Python.

FROM python:3.8

Next, you have to to fill in the working directory and add all files into.

❗ Here you must be in the /workspace directory. This is the basic directory for launching an OVHcloud AI Deploy.

WORKDIR /workspace
ADD . /workspace

Install the requirements.txt file which contains your needed Python modules using a pip install… command:

RUN pip install -r requirements.txt

Then, you can give correct access rights to OVHcloud user (42420:42420).

RUN chown -R 42420:42420 /workspace
ENV HOME=/workspace

Finally, you have to define your default launching command to start the application.

CMD [ "streamlit", "run", "/workspace/app.py", "--server.address=0.0.0.0" ]

Once your Dockerfile is defined, you will be able to build your custom docker image.

Build the Docker image from the Dockerfile

First, you can launch the following command from the Dockerfile directory to build your application image.

docker build . -t streamlit-eda-iris:latest

⚠️ The dot . argument indicates that your build context (place of the Dockerfile and other needed files) is the current directory.

⚠️ The -t argument allows you to choose the identifier to give to your image. Usually image identifiers are composed of a name and a version tag :. For this example we chose streamlit-eda-iris:latest.

Test it locally

Now, you can run the following Docker command to launch your application locally on your computer.

docker run --rm -it -p 8501:8501 --user=42420:42420 streamlit-eda-iris:latest

⚠️ The -p 8501:8501 argument indicates that you want to execute a port redirection from the port 8501 of your local machine into the port 8501 of the Docker container.

Once started, your application should be available on http://localhost:8080.

Your Docker image seems to work? Good job 👍 !

It’s time to push it and deploy it!

Push the image into the shared registry

❗ The shared registry of AI Deploy should only be used for testing purpose. Please consider attaching your own Docker registry. More information about this can be found here.

Then, you have to find the address of your shared registry by launching this command.

ovhai registry list

Next, log in on the shared registry with your usual OpenStack credentials.

docker login -u  -p

To finish, you need to push the created image into the shared registry.

docker tag streamlit-eda-iris:latest /streamlit-eda-iris:latest
docker push /streamlit-eda-iris:latest

Once you have pushed your custom docker image into the shared registry, you are ready to launch your app 🚀 !

Launch the AI Deploy app

The following command starts a new job running your Streamlit application.

ovhai app run \
      --default-http-port 8501 \
      --cpu 12 \
      /streamlit-eda-iris:latest

Choose the compute resources

First, you can either choose the number of GPUs or CPUs for your app.

--cpu 12 indicates that we request 12 CPUs for that app.

If you want, you can also launch this app with one or more GPUs.

Make the app public

Finally, if you want your app to be accessible without the need to authenticate, specify it as follows.

Consider adding the --unsecure-http attribute if you want your application to be reachable without any authentication.

Conclusion

Well done 🎉 ! You have learned how to build your own Docker image for a dedicated EDA and interactive prediction app!

You have also been able to deploy this app thanks to OVHcloud’s AI Deploy tool.

In a third article, you will see how it is possible to deploy a Data Science project with an API for Spam classification.

Want to find out more?

Notebook

You want to access the notebook? Refer to the GitHub repository.

App

You want to access to the full code to create the Streamlit app? Refer to the GitHub repository.

To launch and test this app with AI Deploy, please refer to our documentation.

References

OpenAPI with Python — a state of the art and our latest contribution

François Magimel — Fri, 05 Feb 2021 16:20:04 +0000

At OVHcloud, we love using and building APIs. And to build good software, the first thing you need to do is look at the state of the art in your domain. As a matter of fact, there are more and more tools available and it’s often hard to make a choice without comparisons. Maybe you’ll be tempted to build a new module instead of contributing to an existing one.

We have just open-sourced a Python module, apispec-fromfile, to simplify the usage of OpenAPI in Python by importing OpenAPI specifications from files. And to better explain why we do it, here you have a state of the art about OpenAPI with Python (and more specifically with Flask).

What is OpenAPI?

As you can read on the official website, the OpenAPI Specification is “a broadly adopted industry standard for describing modern APIs”. This standard, formerly named Swagger, is used to describe, produce, consume, and visualize APIs in a vendor neutral format. This format is based on JSON Schema and specification files can be either in YAML or JSON format.

Nowadays, there are two versions in the wild: 2 and 3. Version 2 is the Swagger specification and is quite common thanks to the many tools available. Version 3 is the latest one, the first one from the OpenAPI Initiative (OAI).

openapi: 3.0.3
info:
  title: My Cutie Marks Catalog
  description: This is a sample server for a cutie marks catalog.
  termsOfService: http://example.com/terms/
  contact:
    name: API Support
    url: http://www.example.com/support
    email: support@example.com
  license:
    name: Apache 2.0
    url: https://www.apache.org/licenses/LICENSE-2.0.html
  version: 1.0.1

Related links:

Why would you use it?

Well, there are two main purposes:

to describe your API, with nice documentation (and potentially try your endpoints with it directly)
to generate your API.

Moreover, you will want to use the latest version of the OpenAPI Specification to take advantage of the new features: the version 3.

Related links:

Scenarios

When you want to use OpenAPI Specification, you will fall into one of these three scenarios:

Contract-first driven API: when you start from the specification and get an API as a result
Server-first driven API: when you start from an existing or a new API and get the specification as a result
Legacy API: when you already have an API and want the OpenAPI Specification.

In the first case, you can choose the technology you will build your API with. Many tools allow you to generate your API quickly and easily, like Swagger tools (codegen, editor). The Swagger Editor even encourages you to use specific technologies to start your project with, like Connexion in Python. This is the more convenient way of generating an API with OpenAPI Specification.

In the second case, you have two possibilities:

starting from scratch: you can choose a technology that can generate the OpenAPI Specification
using a legacy API: you may need to adapt your code to generate the OpenAPI Specification.

In the third case: no generation, no choice of technology, you just have to write your specification manually. 🎈Easy peasy, puddin’ in the freezy 🎈 🎊 🧁 🐊.

So, two things need to be clarified when you are using Python and Flask for your API:

which technology to build your new API with, in order to to generate your OpenAPI Specification
which technology to generate your OpenAPI Specification with, from a legacy API.

Here is a graph to help us to answer to those questions:

Related links:

Python code to specification file

To get the specification file from your code, you would probably want to use docstrings. Then, one solution is to use the apispec library with its ecosystem. It is a pluggable API specification generator with built-in support for marshmallow. And if you are using Voluptuous, you can use that too . And it supports OAS v2 and v3!

from apispec import APISpec
from apispec_fromfile import FromFilePlugin
 
 
# Create an APISpec
spec = APISpec(
    title="My Cutie Marks Catalog",
    version="1.0.1",
    openapi_version="3.0.3",
    plugins=[FromFilePlugin("resource")],
)
print(spec.to_yaml())

You can use it directly, with its web frameworks plugin and its other plugins. Or you can use Flask extensions.
Some frameworks are based on this library to do more things easily (and then support OAS v2 and v3):

flask-smorest: based on apispec and marshmallow, use decorators a lot
flask-apispec: same as flask-smorest, inspired by Flask-RESTful.

Other frameworks are not (totally) based on it, and they often support the OpenAPI Specification v2 :

flask-swagger (OAS v2): as said in its description, it is “a Swagger 2.0 specification extractor for Flask”, compatible with Flask-RESTful
flasgger (OAS v2 & v3): a complete fork of flask-swagger, compatible with apispec, with experimental support for OpenAPI v3.

At OVHcloud, some of our APIs are using apispec with our new plugin apispec-fromfile, to avoid putting YAML into docstrings.

from apispec_fromfile import from_file
from extensions import spec
 
 
# Create an endpoint
@from_file("my/spec/file.yml")
def hello():
return {"hello"}
 
# Register entities and paths
spec.path(resource=hello)

Related links:

Specification file to documentation

Now that you have your specification file, in JSON or YAML format, you want to use it to describe your API. One way is to use Sphinx to generate a documentation in Python. Those extensions will help you to do that:

sphinxcontrib-openapi: this is using sphinxcontrib-httpdomain to generate a static page
sphinxcontrib-redoc: this is using ReDoc to generate a more dynamic page.

Specification file to Swagger UI

Another cool thing you can do with your specification file is to expose your API over the Swagger UI. This can even be used as a quick dynamic documentation. To do that, you can either spawn a Swagger UI in a container (you may need to expose your specification file) or if you are using one of the following frameworks, it is already embedded:

Python code to Swagger UI

If you choose to start with a framework, some of them can do all the graph traversal and expose an endpoint to a Swagger UI:

Which tool can I use?

After all this reading, you may wonder which tool to use. Let’s complete the three scenarios, with some suggestions:

Contract-first driven API: you can use Connexion (Swagger Editor is using it to generate Python code) or another framework
Server-first driven API: you can either start with a framework (for example flask-smorest) or complete your code with apispec or flasgger
Legacy API: you can complete your code with apispec or flasgger.

And for the documentation, you can use Swagger UI and / or ReDoc.

One thing you need to pay attention to is the version of the OpenAPI Specification. It could be a good thing to start with the latest version (v3).

What is our Python module for: apispec-fromfile?

For one our API, we were in the “server-first driven API” scenario, with an existing API based on Voluptuous and Flask. We wanted to generate the OpenAPI Specification version 3 and documentation from the code, even if we needed to adapt the code a bit.

Flasgger were a good starting point, with its decorator, but the support of apispec is still experimental and we are not using Marshmallow. The flask-swagger library uses a keyword in docstrings to import specification files for each endpoint, but it only supports OpenAPI v2.

Therefore, we kept the idea of using a decorator instead of putting YAML into docstrings, and we built an apispec plugin, which supports OpenAPI v2 and v3: https://github.com/ovh/python-apispec-fromfile ✨ 🍰 🎉. Then we just have to write small specification files gradually, and add a decorator to our functions.

Doing BIG automation with Celery

Bartosz Rabiega — Fri, 06 Mar 2020 16:14:18 +0000

Intro

TL;DR: You might want to skip the intro and jump right into “Celery – Distributed Task Queue”.

Hello! I’m Bartosz Rabiega, and I’m part of the R&D/DevOps teams at OVHcloud. As part of our daily work, we’re developing and maintaining the Ceph-as-a-Service project, in order to provide highly available, solid, distributed storage for various applications. We’re dealing with 60PB+ of data, across 10 regions, so as you might imagine, we’ve got quite a lot of work ahead in terms of replacing broken hardware, handling natural growth, provisioning new regions and datacentres, evaluating new hardware, optimising software and hardware configurations, researching new storage solutions, and much more!

Because of the wide scope of our work, we need to offload as many repetitive tasks as possible. And we do that through automation.

Automating your work

To some extent, every manual process can be described as set of actions and conditions. If we somehow managed to force something to automatically perform the actions and check the conditions, we would be able to automate the process, resulting in an automated workflow. Take a look at the example below, which shows some generic steps for manually replacing hardware in our project.

Hmm… What could help us do this automatically? Doesn’t a computer sound like a perfect fit? 🙂 There are many ways to force computers to process automated workflows, but first we need to define some building blocks (let’s call them tasks) and get them to run sequentially or in parallel (i.e. a workflow). Fortunately, there are software solutions that can help with that, among which is Celery.

Celery – Distributed Task Queue

Celery is a well-known and widely adopted piece of software that allows us to process tasks asynchronously. The description of the project on its main page (http://www.celeryproject.org/) may sound a little bit enigmatic, but we can narrow down its basic functionality to something like this:

Such machinery is perfectly suited to tasks like sending emails asynchronously (i.e. ‘fire and forget’), but it can also be used for different purposes. So what other tasks could it handle? Basically, any tasks you can implement in Python (the main Celery language)! I won’t go too much into the details, as they are available in the Celery documentation. What matters is that since we can implement any task we want, we can use that to create the building blocks for our automation.

There is one more important thing… Celery natively supports combining such tasks into workflows (Celery primitives: chains, groups, chords, etc.). So let’s get through some examples…

We’ll use the following task definitions – single task, printing args and kwargs:

@celery_app.task
def noop(*args, **kwargs):
    # Task accepts any arguments and does nothing
    print(args, kwargs)
    return True

Now we can execute the task asynchronously, using the following code:

task = noop.s(777)
task.apply_async()

The elementary tasks can be parametrised and combined into a complex workflow using celery methods, i.e. “chain”, “group”, and “chord”. See the examples below. In each of them, the left side shows a visual representation of a workflow, while the right side shows the code snippet that generates it. The green box is the starting point, after which the workflow execution progresses vertically.

Chain – a set of tasks processed sequentially

workflow = (
    chain([noop.s(i) for i in range(3)])
)

Group – a set of tasks processed in parallel

workflow = (
    group([noop.s(i) for i in range(5)])
)

Chord – a group of tasks chained to the following task

workflow = chord(
        [noop.s(i) for i in range(5)],
        noop.s(i)
)

# Equivalent:
workflow = chain([
        group([noop.s(i) for i in range(5)]),
        noop.s(i)
])

An important point: the execution of a workflow will always stop in the event of a failed task. As a result, a chain won’t be continued if some task fails in the middle of it. This gives us quite a powerful framework for implementing some neat automation, and that’s exactly what we’re using for Ceph-as-a-Service at OVHcloud! We’ve implemented lots of small, flexible, parameterisable tasks, which we combine together to reach a common goal. Here are some real-life examples of elementary tasks, used for the automatic removal of old hardware:

Change weight of Ceph node (used to increase/decrease the amount of data on node. Triggers data rebalance)
Set service downtime (data rebalance triggers monitoring probes, but this is expected, so set downtime for this particular monitoring entry)
Wait until Ceph is healthy (wait until the data rebalance is complete – repeating task)
Remove Ceph node from a cluster (node is empty so it can simply be uninstalled)
Send info to technicians in DC (hardware is ready to be replaced)
Add new Ceph node to a cluster (install new empty node)

We parametrise these tasks and tie them together, using Celery chains, groups and chords to create the desired workflow. Celery then does the rest by asynchronously executing the workflow.

Big workflows and Celery

As our infrastructure grows, so doo our automated workflows grow, with more tasks per workflow, higher complexity of workflows… What do we understand as a big workflow? A workflow consisting of 1,000-10,000 tasks. Just to visualize it take a look on following examples:

A few chords chained together (57 tasks in total)

workflow = chain([
    noop.s(0),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    noop.s()
])

More complex graph structure built from chains and groups (23 tasks in total)

# | is ‘chain’ operator in celery
workflow = (
    group(
        group(
            group([noop.s() for i in range(5)]),
            chain([noop.s() for i in range(5)])
        ) |
        noop.s() |
        group([noop.s() for i in range(5)]) |
        noop.s(),
        chain([noop.s() for i in range(5)])
    ) |
    noop.s()
)

As you can probably imagine, visualisations get quite big and messy when 1,000 tasks are involved! Celery is a powerful tool, and has lots of features that are well-suited for automation, but it still struggles when it comes to processing big, complex, long-running workflows. Orchestrating the execution of 10,000 tasks, with a variety of dependencies, is no trivial thing. There are several issues we encountered when our automation grew too big:

Memory issues during workflow building (client side)
Serialisation issues (client -> Celery backend transfer)
Nondeterministic, broken execution of workflows
Memory issues in Celery workers (Celery backend)
Disappearing tasks
And more…

Take a look at some GitHub tickets:

Using Celery for our particular use case became difficult and unreliable. Celery’s native support for workflows doesn’t seem to be the right choice for handling 100/1,000/10,000 tasks. In its current state, it’s just not enough. So here we stand, in front of a solid, concrete wall… Either we somehow fix Celery, or we rewrite our automation using a different framework.

Celery – to fix… or to fix?

Rewriting all of our automation would be possible, although relatively painful. Since I’m a rather lazy person, perhaps attempting to fix Celery wasn’t an entirely bad idea? So I took some time to dig through Celery’s code, and managed to find the parts responsible for building workflows, and executing chains and chords. It was still a little bit difficult for me to understand all the different code paths handling the wide range of use cases, but I realised it would be possible to implement a clean, straightforward orchestration that would handle all the tasks and their combinations in the same way. What’s more, I had a glimpse that it wouldn’t take too much effort to integrate it into our automation (let’s not forget the main goal!).

Unfortunately, introducing new orchestration into the Celery project would probably be quite hard, and would most likely break some backwards compatibility. So I decided to take a different approach – writing an extension or a plugin that wouldn’t require changes in Celery. Something pluggable, and as non-invasive as possible. That’s how Celery Dyrygent emerged…

Celery Dyrygent

https://github.com/ovh/celery-dyrygent

How to represent a workflow

You can think of a workflow as a directed acyclic graph (DAG), where each task is a separate graph node. When it comes to acyclic graphs, it is relatively easy to store and resolve dependencies between nodes, which leads to straightforward orchestration. Celery Dyrygent was implemented based on these features. Each task in the workflow has an unique identifier (Celery already assigns task IDs when a task is pushed for execution) and each one of them is wrapped into a workflow node. Each workflow node consists of a task signature (a plain Celery signature) and a list of IDs for the tasks it depends on. See the example below:

How to process a workflow

So we know how to store a workflow in a clean and easy way. Now we just need to execute it. How about using… Celery? Why not? For this, Celery Dyrygent introduces a workflow processor task (an ordinary Celery task). This task wraps a whole workflow and schedules an execution of primitive tasks, according to their dependencies. Once the scheduling part is over, the task repeats itself (it ‘ticks’ with some delay).

Throughout the whole processing cycle, workflow processor retains the state of the entire workflow internally. As a result, it updates the state with each repetition. You can see an orchestration example below:

Most notably, workflow processor stops its execution in two cases:

Once the whole workflow finishes, with all tasks successfully completed
When it can’t proceed any further, due to a failed task

How to integrate

So how do we use this? Fortunately, I was able to find a way to use Celery Dyrygent quite easily. First of all, you need to inject the workflow processor task definition into your Celery applicationP:

from celery_dyrygent.tasks import register_workflow_processor
app = Celery() #  your celery application instance
workflow_processor = register_workflow_processor(app)

Next, you need to convert your Celery defined workflow into a Celery Dyrygent workflow:

from celery_dyrygent.workflows import Workflow

celery_workflow = chain([
    noop.s(0),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    noop.s()
])

workflow = Workflow()
workflow.add_celery_canvas(celery_workflow)

Finally, simply execute the workflow, just as you would an ordinary Celery task:

workflow.apply_async()

That’s it! You can always go back if you wish, as the small changes are very easy to undo.

Give it a try!

Celery Dyrygent is free to use, and its source code is available on Github (https://github.com/ovh/celery-dyrygent). Feel free to use it, improve it, request features, and report any bugs! It has a few additional features not described here, so I’d encourage you to take a look at the project’s readme file. For our automation requirements, it’s already a solid, battle-tested solution. We’ve been using it since the end of 2018, and it has processed thousands of workflows, consisting of hundreds of thousands of tasks. Here are some productions stats, from June 2019 to February 2020:

936,248 elementary tasks executed
11,170 workflows processed
4,098 tasks in the biggest workflow so far
~84 tasks per workflow, on average

Automation is always a good idea!

Introducing Director – a tool to build your Celery workflows

Nicolas Crocfer — Wed, 26 Feb 2020 12:38:57 +0000

As developers, we often need to execute tasks in the background. Fortunately, some tools already exist for this. In the Python ecosystem, for instance, the most well-known library is Celery. If you have already used it, you know how great it is! But you will also have probably discovered how complicated it can be to follow the state of a complex workflow.

Celery Director is a tool we created at OVHcloud to fix this problem. The code is now open-sourced and is available on Github.

Following the talk we did during FOSDEM 2020, this post aims to present the tool. We’ll take a close look at what Celery is, why we created Director, and how to use it.

What is Celery?

Here is the official description of Celery:

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

The important words here are “task queue”. This is a mechanism used to distribute work across a pool of machines or threads.

The queue, in the middle of the above diagram, stores messages sent by the producers (APIs, for instance). On the other side, consumers are constantly reading the queue to display new messages and execute tasks.

In Celery, a message sent by the producer is the signature of a Python function: send_email("john.doe"), for example.

The queue (named broker in Celery) stores this signature until a worker reads it and really executes the function within the given parameter.

But why execute a Python function somewhere else? The main reason is to quickly return a response in cases of long-running functions. Indeed, it’s not an option to keep users waiting for a response for several seconds or minutes.

Just as we can imagine producers without enough resources, with a CPU-bound task, a more robust worker could handle its execution.

How to use Celery

So Celery is a library used to execute a Python code somewhere else, but how does it do that? In fact, it’s really simple! To illustrate this, we’ll use some of the available methods to send tasks to the broker, then we’ll start a worker to consume them.

Here is the code to create a Celery task:

# tasks.py
from celery import Celery

app = Celery("tasks", broker="redis://127.0.0.1:6379/0")

@app.task
def add(x, y):
    return x + y

As you can see, a Celery task is just a Python function transformed to be sent in a broker. Note that we passed the redis connection to the Celery application (named app) to inform the broker where to store the messages.

This means it’s now possible to send a task in the broker:

>>> from tasks import add
>>> add.delay(2, 3)

That’s all! We used the .delay() method, so our producer didn’t execute the Python code but instead sent the task signature to the broker.

Now it’s time to consume it with a Celery worker:

$ celery worker -A tasks --loglevel=INFO
[...]
[2020-02-14 17:13:38,947: INFO/MainProcess] Received task: tasks.add[0e9b6ff2-7aec-46c3-b810-b62a32188000]
[2020-02-14 17:13:38,954: INFO/ForkPoolWorker-2] Task tasks.add[0e9b6ff2-7aec-46c3-b810-b62a32188000] succeeded in 0.0024250600254163146s: 5

It’s even possible to combine the Celery tasks with some primitives (the full list is here):

Chain: will execute tasks one after the other.
Group: will execute tasks in parallel by routing them to multiple workers.

For example, the following code will make two additions in parallel, then sum the results:

from celery import chain, group

# Create the canvas
canvas = chain(
    group(
        add.si(1, 2),
        add.si(3, 4)
    ),
    sum_numbers.s()
)

# Execute it
canvas.delay()

You probably noted we didn’t use the .delay() method here. Instead we created a canvas, used to combine a selection of tasks.

The .si() method is used to create an immutable signature (i.e. one that does not receive data from a previous task), while .s() relies on the data returned by the two previous tasks.

This introduction to Celery has just covered its very basic usage. If you’re keen to find out more, I invite you to read the documentation, where you’ll discover all the powerful features, including rate limits, tasks retrying, or even periodic tasks.

As a developer, I want…

I’m part of a team whose goal is to deploy and monitor internal infrastructures. As part of this, we needed to launch some background tasks, and as Python developers our natural choice was to use Celery. But, out of the box, Celery didn’t supported certain specific requirements for our projects:

Tracking the tasks’ evolution and their dependencies in a WebUI.
Executing the workflows using API calls, or simply with a CLI.
Combining tasks to create workflows in YAML format.
Periodically executing a whole workflow.

Some other cool tools exist for this, like Flower, but this only allows us to track each task individually, not a whole workflow and its component tasks.

And as we really needed these features, we decided to create Celery Director.

How to use Director

The installation can be done using the pipcommand:

$ pip install celery-director

Director provides a simple command to create a new workspace folder:

$ director init workflows
[*] Project created in /home/ncrocfer/workflows
[*] Do not forget to initialize the database
You can now export the DIRECTOR_HOME environment variable

A new tasks folder and a workflow example has been created for you below:

$ tree -a workflows/
├── .env
├── tasks
│   └── etl.py
└── workflows.yml

The tasks/*.py files will contain your Celery tasks, while the workflows.yml file will combine them:

$ cat workflows.yml
---
ovh.SIMPLE_ETL:
  tasks:
    - EXTRACT
    - TRANSFORM
    - LOAD

This example, named ovh.SIMPLE_ETL, will execute three tasks, one after the other. You can find more examples in the documentation.

After exporting the DIRECTOR_HOME variable and initialising the database with director db upgrade, you can execute this workflow :

$ director workflow list
+----------------+----------+-----------+
| Workflows (1)  | Periodic | Tasks     |
+----------------+----------+-----------+
| ovh.SIMPLE_ETL |    --    | EXTRACT   |
|                |          | TRANSFORM |
|                |          | LOAD      |
+----------------+----------+-----------+
$ director workflow run ovh.SIMPLE_ETL

The broker has received the tasks, so now you can launch the Celery worker to execute them:

$ director celery worker --loglevel=INFO

And then display the results using the webserver command (director webserver):

This is just the beginning, as Director provides other features, allowing you to parametrise a workflow or periodically execute it, for example. You will find more details on these features in the documentation.

Conclusion

Our teams use Director regularly to launch our workflows. No more boilerplating, and no more need for advanced Celery knowledge… A new colleague can easily create its tasks in Python and combine them in YAML, without using the Celery primitives discussed earlier.

Sometimes we need to execute a workflow periodically (to populate a cache, for instance), and sometimes we need to manually call it from another web service (note that a workflow can also be executed through an API call). This is now possible using our single Director instance.

We invite you to try Director for yourself, and give us your feedback via Github, so we can continue to enhance it. The source code can be found in Github, and the 2020 FOSDEM presentation is available here.