Simplify your research experiments with Kubernetes

Abstract

As a researcher I need to conduct experiments to validate my hypotheses. When the field of Computer Science is involved, it is well known that practitioners tend to drive experiments on different environments (at the hardware level: x86/arm/…, CPU frequency, available memory, or at the software level: operating system, versions of libraries). The problem with these different environments is the difficulty of accurately reproducing an experiment as it has been presented in a research article.

In this post we provide a way of conducting experiments that can be reproduced by using Kubernetes-as-a-service, a managed platform to perform distributed computations along with other tools (Argo, MinIO) that take the advantages of the platform into consideration.

Simplify your research experiments with Kubernetes

The article is organised as follow, we first recall the context and the problem faced by a researcher who needs to conduct experiments. Then we explain how to solve the problem with Kubernetes and why we did not choose other solutions (e.g., HPC software). Finally, we give some tips on improving setup.

Introduction

When I started my PhD, I read a bunch of articles related to the field I’m working on, i.e. AutoML. From this research, I realised how important it is to conduct experiments well in order to make them credible and verifiable. I started asking my colleagues how they carried out their experiments, and there was a common pattern: develop your solution, look at other(s) solution(s) that are related to the same problem, run each solution 30 times if it is stochastic with equivalent resources and compare your results to the other(s) solution(s) with statistical tests: Wilcoxon-Mann-Whitney when comparing two algorithms, or else Friedman test. As it is not the main topic of this article, I will not discuss statistical tests in detail.

As an experienced DevOps, I had one question about automation: How do I find out how to reproduce an experiment, especially of another solution? Guess the answer? Meticulously read a paper, or find a repository with all the information.

Either you are lucky and a source code is available, or else a pseudo-code is provided in the publication. In this case you need to re-implement the solution to be able to test it and compare it. Even if you are lucky and there is a source code available, often the whole environment is missing (e.g., exact version of the packages, python version itself, JDK version, etc…). Not having the right information impacts performance and may potentially bias experiments. For example, new versions of packages, languages, and so on, usually have better optimisations that your implementation can use. Sometimes it is hard to find the versions that have been used by practitioners.

The other problem, is the complexity of setting up a cluster with HPC software (e.g., Slurm, Torque). Indeed, it requires technical knowledge to manage such a solution: configuration of the network, verifying that each node has the dependencies required by the runs installed, checking that nodes have the same versions of libraries, etc… These technical steps consume time for researchers, thus take them away from their main job. Moreover, to extract the results, researchers usually do it manually, they retrieve the different files (through FTP or NFS), and then perform statistical tests that they save by hand. Consequently, the workflow to perform an experiment is relatively costly and precarious.

In my point of view, it raise one big problem: that an experiment can not really be reproduced in the field of Computer Science.

Solution

OVH offers Kubernetes-as-a-service, a managed cluster platform where you do not have to worry about how to configure the cluster (add node, configure network, and so on), so I started to investigate how I could perform experiments similarly to the HPC solutions. Argo Workflows, came out of the box. This tool allows you to define a workflow of steps that you can perform on your Kubernetes cluster within each step is confined in a container, loosely called image. A container allows you to run a program under a specific environment software (language version, libraries, third-parties), additionally to limiting the resources (CPU time, memory) used by the program.

The solution is linked to our big problem: make sure you can reproduce an experiment that is equivalent to run a workflow composed of steps under a specific environment.

Simplify your research experiments with Kubernetes: architecture

Use case: Evaluate an AutoML solution

The use case that we use in our research will be related to measuring the convergence of a Bayesian Optimization (SMAC) on the problem of the AutoML

For this use case, we stated the Argo workflow in the following yaml file

Set up the infrastructure

First we will to setup a Kubernetes cluster, secondly we will install the services on our cluster and lastly we will run an experiment.

Kubernetes cluster

Installing a Kubernetes cluster with OVH is child’s play. Connect to the OVH Control Panel, go to Public Cloud > Managed Kubernetes Service, then Create a Kubernetes cluster and follow the steps depending on your needs.

Once the cluster is created:

Take into consideration the change upgrade policy. If you are a researcher, and your experiment takes some time to run, you want to avoid an update that would shutdown your infrastructure with your runs. To avoid this situation, it is better to choose “Minimum unavailability” or “Do not update”.
Download the kubeconfig file, it will serve later with kubectl to connect on our cluster.
Add at least one node on your cluster.

Once installed, you will need kubectl, a tool that allows you to manage your cluster.

If everything has been properly set up, you should get something like this:

kubectl top nodes
NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node01   64m          3%     594Mi           11%

Installation of Argo

As we mentioned before, Argo allows us to run a workflow composed of steps. To install the client and the service on the cluster, we were inspired by this tutorial.

First we download and install Argo (client):

curl -sSL -o /usr/local/bin/argo https://github.com/argoproj/argo/releases/download/v2.3.0/argo-linux-amd64
chmod +x /usr/local/bin/argo

Then the controller and UI on our cluster:

kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.3.0/manifests/install.yaml

Configure the service account:

kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=default:default

Then, with the client try a simple hello-world workflow to confirm the stack is working (Status: Succeeded):

argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
Name:                hello-world-2lx9d
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Tue Aug 13 16:51:32 +0200 (24 seconds ago)
Started:             Tue Aug 13 16:51:32 +0200 (24 seconds ago)
Finished:            Tue Aug 13 16:51:56 +0200 (now)
Duration:            24 seconds

STEP                  PODNAME            DURATION  MESSAGE
 ✔ hello-world-2lx9d  hello-world-2lx9d  23s

You can also access the UI dashboard through http://localhost:8001:

kubectl port-forward -n argo service/argo-ui 8001:80

Configure an Artifact repository (MinIO)

Artifact is a term used by Argo, it represents an archive containing files returned by a step. In our case we will use this feature to return final results, and to share intermediate results between steps.

In order to get Artifact working, we need an object storage. If you already have one you can pass the installation part but still need to configure it.

As specified in the tutorial, we used MinIO, here is the manifest to install it (minio-argo-artifact.install.yml):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  # This name uniquely identifies the PVC. Will be used in deployment below.
  name: minio-pv-claim
  labels:
    app: minio-storage-claim
spec:
  # Read more about access modes here: https://kubernetes.io/docs/user-guide/persistent-volumes/#access-modes
  accessModes:
    - ReadWriteOnce
  resources:
    # This is the request for storage. Should be available in the cluster.
    requests:
      storage: 10
  # Uncomment and add storageClass specific to your requirements below. Read more https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
  #storageClassName:
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  # This name uniquely identifies the Deployment
  name: minio-deployment
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        # Label is used as selector in the service.
        app: minio
    spec:
      # Refer to the PVC created earlier
      volumes:
      - name: storage
        persistentVolumeClaim:
          # Name of the PVC created earlier
          claimName: minio-pv-claim
      containers:
      - name: minio
        # Pulls the default MinIO image from Docker Hub
        image: minio/minio
        args:
        - server
        - /storage
        env:
        # MinIO access key and secret key
        - name: MINIO_ACCESS_KEY
          value: "TemporaryAccessKey"
        - name: MINIO_SECRET_KEY
          value: "TemporarySecretKey"
        ports:
        - containerPort: 9000
        # Mount the volume into the pod
        volumeMounts:
        - name: storage # must match the volume name, above
          mountPath: "/storage"
---
apiVersion: v1
kind: Service
metadata:
  name: minio-service
spec:
  ports:
    - port: 9000
      targetPort: 9000
      protocol: TCP
  selector:
    app: minio

Note: Please edit the following key/values:

spec > resources > requests > storage > 10 correspond to 10 GB storage requested by MinIO to the cluster
TemporaryAccessKey
TemporarySecretKey

kubectl create ns minio
kubectl apply -n minio -f minio-argo-artifact.install.yml

Note: alternatively, you can install MinIO with Helm.

Now we need to configure Argo in order to use our object storage MinIO:

kubectl edit cm -n argo workflow-controller-configmap
...
data:
  config: |
    artifactRepository:
      s3:
        bucket: my-bucket
        endpoint: minio-service.minio:9000
        insecure: true
        # accessKeySecret and secretKeySecret are secret selectors.
        # It references the k8s secret named 'argo-artifacts'
        # which was created during the minio helm install. The keys,
        # 'accesskey' and 'secretkey', inside that secret are where the
        # actual minio credentials are stored.
        accessKeySecret:
          name: argo-artifacts
          key: accesskey
        secretKeySecret:
          name: argo-artifacts
          key: secretkey

Add credentials:

kubectl create secret generic argo-artifacts --from-literal=accesskey="TemporaryAccessKey" --from-literal=secretkey="TemporarySecretKey"

Note: Use the correct credentials you specified above

Create the bucket my-bucket with the rights Read and write by connecting to the interface http://localhost:9000:

kubectl port-forward -n minio service/minio-service 9000

Check that Argo is able to use Artifact with the object storage:

argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/artifact-passing.yaml
Name:                artifact-passing-qzgxj
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Wed Aug 14 15:36:03 +0200 (13 seconds ago)
Started:             Wed Aug 14 15:36:03 +0200 (13 seconds ago)
Finished:            Wed Aug 14 15:36:16 +0200 (now)
Duration:            13 seconds

STEP                       PODNAME                            DURATION  MESSAGE
 ✔ artifact-passing-qzgxj
 ├---✔ generate-artifact   artifact-passing-qzgxj-4183565942  5s
 └---✔ consume-artifact    artifact-passing-qzgxj-3706021078  7s

Note: In case you are stuck with a message ContainerCreating, there is a lot of chance that Argo is not able to access MinIO, e.g., bad credentials.

Install a private registry

Now that we have a way to run a workflow, we want each step to represent a specific software environment (i.e., an image). We defined this environment in a Dockerfile.

Because each step can run on different nodes in our cluster, the image needs to be stored somewhere, in the case of Docker we require a private registry.

You can get a private registry in different ways:

Docker Hub
Gitlab.com
OVH – tutorial
Harbor: allows you to have your own registry on your Kubernetes cluster

In our case we used OVH private registry.

# First we clone the repository
git clone git@gitlab.com:automl/automl-smac-vanilla.git
cd automl-smac-vanilla

# We build the image locally
docker build -t asv-environment:latest .

# We push the image to our private registry
docker login REGISTRY_SERVER -u REGISTRY_USERNAME
docker tag asv-environment:latest REGISTRY_IMAGE_PATH:latest
docker push REGISTRY_IMAGE_PATH:latest

Allow our cluster to pull images from the registry:

kubectl create secret docker-registry docker-credentials --docker-server=REGISTRY_SERVER --docker-username=REGISTRY_USERNAME --docker-password=REGISTRY_PWD

Try our experiment on the infrastructure

git clone git@gitlab.com:automl/automl-smac-vanilla.git
cd automl-smac-vanilla

argo submit --watch misc/workflow-argo -p image=REGISTRY_IMAGE_PATH:latest -p git_ref=master -p dataset=iris
Name:                automl-benchmark-xlbbg
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Tue Aug 20 12:25:40 +0000 (13 minutes ago)
Started:             Tue Aug 20 12:25:40 +0000 (13 minutes ago)
Finished:            Tue Aug 20 12:39:29 +0000 (now)
Duration:            13 minutes 49 seconds
Parameters:
  image:             m1uuklj3.gra5.container-registry.ovh.net/automl/asv-environment:latest
  dataset:           iris
  git_ref:           master
  cutoff_time:       300
  number_of_evaluations: 100
  train_size_ratio:  0.75
  number_of_candidates_per_group: 10

STEP                       PODNAME                            DURATION  MESSAGE
 ✔ automl-benchmark-xlbbg
 ├---✔ pre-run             automl-benchmark-xlbbg-692822110   2m
 ├-·-✔ run(0:42)           automl-benchmark-xlbbg-1485809288  11m
 | └-✔ run(1:24)           automl-benchmark-xlbbg-2740257143  9m
 ├---✔ merge               automl-benchmark-xlbbg-232293281   9s
 └---✔ plot                automl-benchmark-xlbbg-1373531915  10s

Note:

Here we only have 2 parallel runs, you can have much more by adding them to the list withItems. In our case, the list correspond to the seeds.
run(1:24) correspond to the run 1 with the seed 24
We limit the resources per run by using requests and limits, see also Managing Compute Resources.

Then we just retrieve the results through the MinIO web user interface http://localhost:9000 (you can also do that with the client).

The results are located in a directory with the same name as the argo workflow name, in our example it is my-bucket > automl-benchmark-xlbbg.

Limitation to our solution

The solution is not able to run the parallel steps on multiple nodes. This limitation is due to the way we are merging our results from the parallel steps to the merge step. We are using volumeClaimTemplates, i.e., we are mounting a volume, and this can’t be done between different nodes. The problem can be solved by two manners:

Using parallel artifacts and aggregate them, however it is an ongoing issue with Argo
Directly implement in the code of your run, a way to store the result on an accessible storage (MinIO SDK for example)

The first manner is preferred, it means you don’t have to change and customize the code for a specific storage file system.

Hints to improve the solution

In case you are interested in going further with your setup, you should take a look on the following topics:

Controlling the access: in order to confine the users in different spaces (for security reasons, or to control the resources).
Exploring Argo selector and Kubernetes selector: in case you have a cluster composed of nodes that have different hardware and that you require an experiment using a specific hardware (e.g., specific cpu, gpu).
Configure a distributed MinIO: it ensures that your data are replicated on multiple nodes and stay available in case of a node fail.
Monitoring your cluster.

Conclusion

Without needing in-depth technical knowledge, we have shown that we can easily set up a complex cluster to perform research experiments and make sure they can be reproduced.