Docker Archives - OVHcloud Blog

Understanding Image Generation: A Beginner’s Guide to Generative Adversarial Networks

Mathieu Busquet — Tue, 05 Sep 2023 09:21:57 +0000

All the code related to this article is available in our dedicated GitHub repository . You can reproduce all the experiments with OVHcloud AI Notebooks.

Fake samples generated by the model during training

Have you ever been amazed by what generative artificial intelligence could do, and wondered how it can generate realistic images 🤯🎨?

In this tutorial, we will embark on an exciting journey into the world of Generating Adversarial Networks (GANs), a revolutionary concept in generative AI. No prior experience is necessary to follow along. We will walk you through every step, starting with the basic concepts and gradually building up to the implementation of Deep Convolutional GANs (DCGANs).

By the end of this tutorial, you will be able to generate your own images!

Introduction

GANs have been introduced by Ian Goodfellow et al. in 2014 in the paper Generative Adversarial Nets. GANs have become very popular last years, allowing us, for example, to:

Generate high-resolution images (avatars, objects and scenes)
Augment our data (generating synthetic (fake) data samples for limited datasets)
Enhance the resolution of low-resolution images (upscaling images)
Transfer image style of one image to another (Black and white to color)
Predict facial appearances at different ages (Face Aging)

What is a GAN and how it works?

A GAN is composed of two main components: a generator G and a discriminator D.

Each component is a neural network, but their roles are different:

The purpose of the generator G is to reproduce the data distribution of the training data 𝑥, to generate synthetic samples for the same data distribution. These data are often images, but can also be audio or text.

On the other hand, the discriminator D is a kind of judge who will estimate whether a sample 𝑥 is real or fake (has been generated). It is in fact a classifier that will say if a sample comes from the real data distribution or the generator.

Illustration of GAN training

During training, the generator starts with a vector of random noise (z) as input and produces synthetic samples G(z).

As training progresses, it refines its output, making the generated data G(z) more and more similar to the real data. The goal of the generator is to outsmart the discriminator into classifying its generated samples as real.

Meanwhile, the discriminator is presented with both real samples from the training data and fake samples from the generator. As it learns to discriminate between the two, it provides feedback to the generator about the quality of its generated samples. This is why the term “adversarial“ is used here.

Mathematical approach

In fact, GANs come from game theory, where D and G are playing a two-player minimax game with the following value function:

As we can observe, the discriminator aims to maximize the V function. To do this, it must maximize each of the two parts of the equation that will be added together. This means maximizing log(D(x)), so D(x) in the first equation part (probability of real data), and minimizing the D(G(z)) in the second part (probability of fake data).

Simultaneously, the generator tries to minimize the function. It only comes into play in the second part of the function, where it tries to obtain the highest value of D(G(z)) in order to fool the discriminator.

This constant confrontation between the generator and the discriminator creates an iterative learning process, where the generator gradually improves to produce increasingly realistic G(z) samples, and the discriminator becomes increasingly accurate in its distinction of the data presented to it.

In an ideal scenario, this iterative process would reach an equilibrium point, where the generator produces data that is indistinguishable from real data, and the discriminator’s performance is 50% (random guessing).

GANs may not always reach this equilibrium due to the training process being sensitive to factors (architecture, hyperparameters, dataset complexity). The generator and discriminator may reach a dead end, oscillating between solutions or facing mode collapse, resulting in limited sample diversity. Also, it is important that discriminator does not start off too strong, otherwise the generator will not get any information on how to improve itself, since it does not know what the real data looks like, as shown in the illustration above.

DCGAN (Deep Convolutional GANs)

DCGAN has been introduced in 2016 by Alec Radford et al. in the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.

Its new convolutional architecture has considerably improved the quality and stability of image synthesis compared to classical GANs. Here are the major changes:

Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator), making them exceptionally well-suited for image generation tasks.
Use batchnorm in both the generator and the discriminator.
Removing fully connected hidden layers for deeper architectures.
Use ReLU activation in generator for all layers except for the output, which uses tanh.
Use LeakyReLU activation in the discriminator for all layer.

The operation principles of these layers will not be explained in this tutorial.

Use case & Objective

Now that we know the concept of image generation, let’s try to put it into practice!

In this tutorial, we will implement a DCGAN architecture and train it on a medical dataset to generate new images. This dataset is the Chest X-Ray Pneumonia. All the code explained here will run on a single GPU, linked to OVHcloud AI Notebooks, and is given in our GitHub repository .

1 – Explore dataset and prepare it for training

The Chest X-Ray Pneumonia dataset contains 5,863 X-Ray images. This may not be sufficient for training a robust DCGAN, but we are going to try! Indeed, the DCGAN research paper is conducting its study on a dataset of over 60,000 images.

Additionally, it is important to consider that the dataset contains two classes (Pneumonia/Normal). While we will not separate the classes for data quantity purposes, improving our network’s performance could be beneficial. Furthermore, it is advisable to verify if the classes are well-balanced.

Only the training subset will be used here (5,221 images). Let’s take a look at our images:

Chest X-Ray Pneumonia dataset real samples

We notice that we have quite similar images. The backgrounds are identical, and the chests are often centered in the same way, which should help the network learn.

Preprocessing

Data pre-processing is a crucial step when you want to facilitate and accelerate model convergence and obtain high-quality results. This pre-processing can be broken down into various generic operations that are commonly applied.

Each image in the dataset will be transformed. They are then assembled in packets of 128 images, which we call batches. This avoids loading the dataset all at once, which could use up a lot of memory. This also makes the most of GPUs parallelism.

The applied transformation will:

Resize images to (64x64xchannels), dimensions expected by our DCGAN. This avoids keeping the original dimensions of the images, which are all different. This also reduces the number of pixels which accelerates the model training (computation cost).
Convert images to tensors (format expected by models).
Standardize & Normalize the image’s pixel values, which improves training performance in AI.

If original images are smaller than the desired size, transformation will pad the images to reach the specified size.

We won’t show you the code for these transformations here, but as mentioned earlier, you can find it in its entirety on our GitHub repository . You can reproduce all the experiments with OVHcloud AI Notebooks.

Step 2 – Define the models

Now that the images are ready, we can define our DCGAN:

Generator implementation

As shown in the image above, the generator architecture is designed to take a random noise vector z as input and transform it into a (3x64x64) image, which is the same size as the images in the training dataset.

To do this, it uses transposed convolutions (also falsely known as deconvolutions) to progressively upsample the noise vector z until it reaches the desired output image size. In fact, the transposed convolutions help the generator capture complex patterns and generate realistic images during the training process.

The final Tanh() activation function ensures that the pixel values of the generated images are in the range [-1, 1], which also corresponds to our transformed training images (we had normalized them).

The code for implementing this generator is given in its entirety on our GitHub repository .

Discriminator implementation

As a reminder, the discriminator acts as a sample classifier. Its aim is to distinguish the data generated by the generator from the real data in the training dataset.

As shown in the image above, the discriminator takes an input image of size (3x64x64) and outputs a probability, indicating if the input image is real (1) or fake (0).

To do this, it uses convolutional layers, batch normalization layers, and LeakyReLU functions, which are presented in the paper as architecture guidelines to follow. Each convolutional block is designed to capture features of the input images, moving from low-level features such as edges and textures for the first blocks, to more abstract and complex features such as shapes and objects for the last.

Probability is obtained thanks to the use of the sigmoid activation, which squashes the output to the range [0, 1].

The code for implementing this discriminator is given in its entirety on our GitHub repository .

Define loss function and labels

Now that we have our adversarial networks, we need to define the loss function.

The adversarial loss V(D, G) can be approximated using the Binary Cross Entropy (BCE) loss function, which is commonly used for GANs because it measures the binary cross-entropy between the discriminator’s output (probability) and the ground truth labels during training (here we fix real=1 or fake=0). It will calculate the loss for both the generator and the discriminator during backpropagation.

BCE Loss is computed with the following equation, where target is the ground truth label (1 or 0), and ŷ is the discriminator’s probability output:

If we compare this equation to our previous V(D, G) objective, we can see that BCE loss term for real data samples corresponds to the first term in V(D, G), log(D(x)), and the BCE loss term for fake data samples corresponds to the second term in V(D, G), log(1 – D(G(z))).

In this binary case, the BCE can be represented by two distinct curves, which describe how the loss varies as a function of the predictions ŷ of the model. The first shows the loss as a function of the calculated probability, for a synthetic sample (label y = 0). The second describes the loss for a real sample (label y = 1).

Variations of BCE loss over the interval ]0;1[ for different targeted labels (y =0 and y = 1)

We can see that the further the prediction ŷ is from the actual label assigned (target), the greater the loss. On the other hand, a prediction that is close to the truth will generate a loss very close to zero, which will not impact the model since it appears to classify the samples successfully.

During training, the goal is to minimize the BCE loss. This way, the discriminator will learn to correctly classify real and generated samples, while the generator will learn to generate samples that can “fool” the discriminator into classifying them as real.

Hyperparameters

Hyperparameters were chosen according to the indications given by in the DCGAN paper.

Step 3 – Train the model

We are now ready to train our DCGAN !

To monitor the generator’s learning progress, we will create a constant noise vector, denoted as fixed_noise.

During the training loop, we will regularly feed this fixed_noise into the generator. Using a same constant vector makes it possible to generate similar images each time, and to observe the evolution of the samples produced by the generator over the training cycles.

fixed_noise = torch.randn(64, nz, 1, 1, device=device)

Also, we will compute the BCE Loss of the Discriminator and the Generator separately. This will enable them to improve over the training cycles. For each batch, these losses will be calculated and saved into lists, enabling us to plot the losses after training for each training iteration.

Training Process

Thanks to our fixed noise vector, we were able to capture the evolution of the generated images, providing an overview of how the model learned to reproduce the distribution of training data over time.

Here are the samples generated by our model during training, when fed with a fixed noise, over 100 epochs. For visualization, a display of 9 generated images was chosen :

Evolution of the synthetic samples produced by the generator over time, from a constant random vector of noise z

At the start of the training process (epoch 1), the images generated show the characteristics of the random noise vector.

As the training progresses, the weights of the discriminator and generator are updated. Noticeable changes occur in the generated images. Epochs 5, 10 and 20 show quick and subtle evolution of the model, which begins to capture more distinct shapes and structures.

Next epochs show an improvement in edges and details. Generated samples become sharper and more identifiable, and by epoch 100 the images are quite realistic despite the limited data available (5,221 images).

Do not hesitate to play with the hyperparameters to try and vary your results! You can also check out the GAN hacks repo, which shares many tips dedicated to training GANs. Training time will vary according to your resources and the number of images.

Step 4 – Results & Inference

Once the generator has been trained over 100 epochs, we are free to generated unlimited new images, based on a new random noise vector each time.

In order to retain only relevant samples, a data post-processing step was set up to assess the quality of the images generated. All generated images were sent to the trained discriminator. Its job is to evaluate the probability of the generated samples, and keep only those which have obtained a probability greater than a fixed threshold (0.8 for example).

This way, we have obtained the following images, compared to the original ones. We can see that despite the small number of images in our dataset, the model was able to identify and learn the distribution of the real images data and reproduce them in a realistic way:

Original dataset images (left), compared with images selected from generated samples (right)

Step 5 – Evaluate the model

A DCGAN model (and GANs in general) can be evaluated in several ways. A research paper has been published on this subject.

Quantitative measures

On the quantitative side, the evolution of the BCE loss of the generator and the discriminator provide indications of the quality of the model during training.

The evolution of these losses is illustrated in the figure below, where the discriminator losses are shown in orange and the generator losses in blue, over a total of 4100 iterations. Each iteration corresponds to a complete pass of the dataset, which is split into 41 batches of 128 images. Since the model has been trained over 100 epochs, loss tracking is available over 4100 iterations (41*100).

At the start of training, both curves show high loss values, indicating an unstable start of the DCGAN. This results in very unrealistic images being generated, where the nature of the random noise is still too present (see epoch 1 on the previous image). The discriminator is therefore too powerful for the moment.

A few iterations later, the losses converge towards lower values, demonstrating the improvement in the model’s performance.

However, from epoch 10, a trend emerges. The discriminator loss begins to decrease very slightly, indicating an improvement in its ability to determine which samples are genuine and which are synthetic. On the other hand, the generator’s loss shows a slight increase, suggesting that it needs to improve in order to generate images capable of deceiving its adversary.

More generally, fluctuations are observed throughout training due to the competitive nature of the network, where the generator and discriminator are constantly adjusting relative to each other. These moments of fluctuation may reflect attempts to adjust the two networks. Unfortunately, they do not ultimately appear to lead to an overall reduction in network loss.

Qualitative measures

Losses are not the only performance indicator. They are often insufficient to assess the visual quality of the images generated.

This is confirmed by an analysis of the previous graphs, where we inevitably notice that the images generated at epoch 10 are not the most realistic, while the loss is approximately the same as that obtained at epoch 100.

One commonly used method is human visual assessment. However, this manual assessment has a number of limitations. It is subjective, does not fully reflect the capabilities of the models, cannot be reproduced and is expensive.

Research is therefore focusing on finding new, more reliable and less costly methods. This is particularly the case with CAPTCHAs, tests designed to check whether a user is a human or a robot before accessing content. These tests sometimes present pairs of generated and real images where the user has to indicate which of the two seems more authentic. This ultimately amounts to training a discriminator and a generator manually.

All the code related to this article is available in our dedicated GitHub repository . You can reproduce all the experiments with OVHcloud AI Notebooks.

Conclusion

I hope you have enjoyed this post!

You are now more comfortable with image generation and the concept of Generative Adversarial Networks! Now you know how to generate images from your own dataset, even if it’s not very large!

You can train your own network on your dataset and generate images of faces, objects and landscapes. Happy GANning! 🎨🚀

You can check our other computer vision articles to learn how to:

Paper references

Image segmentation: Train a U-Net model to segment brain tumors

Mathieu Busquet — Wed, 19 Apr 2023 12:03:29 +0000

A guide to discover image segmentation and train a convolutional neural network on medical images to segment brain tumors

All the code related to this article is available in our dedicated GitHub repository. You can reproduce all the experiments with AI Notebooks.

Comparison of the original and predicted segmentation, with non-enhancing tumors in blue, edema in green and enhancing tumors in yellow.

Over the past few years, the field of computer vision has experienced a significant growth. It encompasses a wide range of methods for acquiring, processing, analyzing and understanding digital images.

Among these methods, one is called image segmentation.

What is Image Segmentation? 🤔

Image segmentation is a technique used to separate an image into multiple segments or regions, each of which corresponds to a different object or part of the image.

The goal is to simplify the image and make it easier to analyze, so that a computer can better understand and interpret the content of the image, which can be really useful!

Application fields

Indeed, image segmentation has a lot of application fields such as object detection & recognition, medical imaging, and self-driving systems. In all these cases, the understanding of the image content by the computer is essential.

Example

In an image of a street with cars, the segmentation algorithm would be able to divide the image into different regions, with one for the cars, one for the road, another for the sky, one for the trees and so on.

Semantic image segmentation from Wikipedia Creative Commons

Different types of segmentation

There are two main types of image segmentation: semantic segmentation and instance segmentation.

Semantic segmentation is the task of assigning a class label to each pixel in an image. For example, in an image of a city, the task of semantic segmentation would be to label each pixel as belonging to a certain class, such as “building”, “road”, “sky”, …, as shown in the image above.

Instance segmentation not only assigns a class label to each pixel, but also differentiates instances of the same class within an image. In the previous example, the task would be to not only label each pixel as belonging to a certain class, such as “building”, “road”, …, but also to distinguish different instances of the same class, such as different buildings in the image. Each building will then be represented by a different color.

Use case & Objective

Now that we know the concept of image segmentation, let’s try to put it into practice!

In this article, we will focus on medical imaging. Our goal will be to segment brain tumors. To do this, we will use the BraTS2020 Dataset.

1 – BraTS2020 dataset exploration

This dataset contains magnetic resonance imaging (MRI) scans of brain tumors.

To be more specific, each patient of this dataset is represented through four different MRI scans / modalities, named T1, T1CE, T2 and FLAIR. These 4 images come with the ground truth segmentation of the tumoral and non-tumoral regions of their brains, which has been manually realized by experts.

Display of the 4 modalities of a patient and its segmentation

Why 4 modalities ?

As you can see, the four modalities bring out different aspects for the same patient. To be more specific, here is a description of their interest:

T1 : Show the structure and composition of different types of tissue.
T1CE: Similar to T1 images but with the injection of a contrast agent, which will enhance the visibility of abnormalities.
T2: Show the fluid content of different types of tissue.
FLAIR: Used to suppress this fluid content, to better identify lesions and tumors that are not clearly visible on T1 or T2 images.

For an expert, it can be useful to have these 4 modalities in order to analyze the tumor more precisely, and to confirm its presence or not.

But for our artificial approach, using only two modalities instead of four is interesting since it can reduce the amount of manipulated data and therefore the computational and memory requirements of the segmentation task, making it faster and more efficient.

That is why we will exclude T1, since we have its improved version T1CE. Also, we will exclude the T2 modality. Indeed, the fluids it presents could degrade our predictions. These fluids are removed in the flair version, which highlights the affected regions much better, and will therefore be much more interesting for our training.

Images format

It is important to understand that all these MRI scans are NIfTI files (.nii format). A NIfTI image is a digital representation of a 3D object, such as a brain in our case. Indeed, our modalities and our annotations have a 3-dimensional (240, 240, 155) shape.

Each dimension is composed of a series of two-dimensional images, known as slices, which all contain the same number of pixels, and are stacked together to create a 3D representation. That is why we have been able to display 2D images just above. Indeed, we have displayed the 100th slice of a dimension for the 4 modalities and the segmentation.

Here is a quick presentation of these 3 planes:

Planes of the body from Wikipedia Creative Commons

– Sagittal Plane: Divides the body into left and right sections and is often referred to as a “front-back” plane.

– Coronal Plane: Divides the body into front and back sections and is often referred to as a “side-side” plane.

– Axial or Transverse Plane: Divides the body into top and bottom sections and is often referred to as a “head-toe” plane.

Each modality can then be displayed through its different planes. For example, we will display the 3 axes of the T1 modality:

100th slice of the T1 modality of the first patient, in the 3 planes of the human body

Why choose to display the 100th slice?

Now that we know why we have three dimensions, let’s try to understand why we chose to display a specific slice.

To do this, we will display all the slices of a modality:

Display of all slices of T1 of the first patient in the sagittal plane

As you can see, two black parts are present on each side of our montage. However, these black parts correspond to slices. This means that a large part of the slices does not contain much information. This is not surprising since the MRI scanner goes through the brain gradually.

This analysis is the same on all other modalities, all planes and also on the images segmented by the experts. Indeed, they were not able to segment the slices that do not contain much information.

This is why we can exclude these slices in our analysis, in order to reduce the number of manipulated images, and speed up our training. Indeed, you can see that a (60:135) slices range will be much more interesting:

Display of slices 60 to 135 of T1 of the first patient in the sagittal plane

What about segmentations?

Now, let’s focus on the segmentations provided by the experts. What information do they give us?

100th slice of the segmentation modality of the first patient

Regardless of the plane you are viewing, you will notice that some slices have multiple colors, which means that the experts have assigned multiple values / classes to the segmentation (one color represents one value).

Actually, we only have 4 possible pixels values in this dataset. These 4 values will form our 4 classes. Here is what they correspond to:

Class value	Class color	Class meaning
0	Purple	Not tumor (healthy zone or image background)
1	Blue	Necrotic and non-enhancing tumor
2	Green	Peritumoral Edema
4	Yellow	Enhancing Tumor

Explanation of the BraTS2020 dataset classes

As you can see, class 3 does not exist. We go directly to 4. We will therefore modify this “error” before sending the data to our model.

Our goal is to predict and segment each of these 4 classes for new patients to find out whether or not they have a brain tumor and which areas are affected.

To summarize data exploration:

We have for each patient 4 different modalities (T1, T1CE, T2 & FLAIR), accompanied by a segmentation that indicates tumor areas.
Modalities T1CE and FLAIR are the more interesting to keep, since these 2 provide complementary information about the anatomy and tissue contrast of the patient’s brain.
Each image is 3D, and can therefore be analyzed through 3 different planes that are composed of 2D slices.
Many slices contain little or no information. We will only keep the (60:135) slices range.
A segmentation image contains 1 to 4 classes.
Class number 4 must be reassigned to 3 since value 3 is missing.

Now that we know more about our data, it is time to prepare the training of our model.

2 – Training preparation

Split data into 3 sets

In the world of AI, the quality of a model is determined by its ability to make accurate predictions on new, unseen data. To achieve this, it is important to divide our data into three sets: Training, Validation and Test.

Reminder of their usefulness:

Training set is used to train the model. During training, the model is exposed to the training data and adjusts its parameters to minimize the error between its predictions and the Ground truth (original segmentations).
Validation set is used to fine-tune the hyperparameters of our model, which are set before training and determine the behavior of our model. The aim is to compare different hyperparameters and select the best configuration for our model.
Test set is used to evaluate the performance of our model after it has been trained, to see how well it performs on data that was not used during the training of the model.

The dataset contains 369 different patients. Here is the distribution chosen for the 3 data sets:

Data preprocessing

In order to train a neural network to segment objects in images, it is necessary to feed it with both the raw image data (X) and the ground truth segmentations (y). By combining these two types of data, the neural network can learn to recognize tumor patterns and make accurate predictions about the contents of a patient’s scan.

Unfortunately, our modalities images (X) and our segmentations (y) cannot be sent directly to the AI model. Indeed, loading all these 3D images would overload the memory of our environment, and will lead to shape mismatch errors. We have to do some image preprocessing before, which will be done by using a Data Generator, where we will perform any operation that we think is necessary when loading the images.

As we have explained, we will, for each sample:

Retrieve the paths of its 2 selected modalities (T1CE & FLAIR) and of its ground truth (original segmentation)
Load modalities & segmentation
Create a X array (image) that will contain all the selected slices (60-135) of these 2 modalities.
Generate an y array (image) that will contain all the selected slices (60-135) of the ground truth.
Assign to all the 4 in the y array the value 3 (in order to correct the class 3 missing case).

In addition to these preprocessing steps, we will:

Work in the axial plane

Since the images are square in shape (240×240) in this plane. But since we will manipulate a range of slices, we will be able to visualize the predictions in the 3 planes, so it doesn’t really have an impact.

Apply a One-Hot Encoder to the y array

Since our goal is to segment regions that are represented as different classes (0 to 3), we must use One-Hot Encoding to convert our categorical variables (classes) into a numerical representation that can be used by our neural network (since they are based on mathematical equations).

Indeed, from a mathematical point of view, sending the y array as it is would mean that some classes are superior to others, while there is no superiority link between them. For example, class 1 is inferior to class 4 since 1 < 4. A One-Hot encoder will allow us to manipulate only 0 and 1.

Here is what it consists of, for one slice:

One-Hot encoding applied to the BraTS2020 dataset

Resize each slice of our images from (240×240) to a (128, 128) shape.

Resizing is needed since we need image shapes that are a power of two (2ⁿ, where n is an integer). This is due to the fact that we will use pooling layers (MaxPooling2D) in our convolutional neural network (CNN), which reduce the spatial resolution by 2.

You may wonder why we didn’t resize the images in a (256, 256) shape, which also is a power of 2 and is closer to 240 than 128 is.

Indeed, resizing images to (256, 256) may preserve more information than resizing to (128, 128), which could lead to better performance. However, this larger size also means that the model will have more parameters, which will increase the training time and memory requirements. This is why we will choose the (128, 128) shape.

To summarize the preprocessing steps:

We use a data generator to be able to process and send our data to our neural network (since all our images cannot be stored in memory at once).
For each epoch (single pass of the entire training dataset through a neural network), the model will receive 250 samples (those contained in our training dataset).
For each sample, the model will have to analyze 150 slices (since there are two modalities, and 75 selected slices for both of them), received in a (128, 128) shape, as an X array of a (128, 128, 75, 2) shape. This array will be provided with the ground truth segmentation of the patient, which will be One-Hot encoded and will then have a (75, 128, 128, 4) shape.

3 – Define the model

Now that our data is ready, we can define our segmentation model.

U-Net

We will use the U-Net architecture. This convolutional neural network (CNN) is designed for biomedical image segmentation, and is particularly well-suited for segmentation tasks where the regions of interest are small and have complex shapes (such as tumors in MRI scans).

U-Net architecture

This neural network was first introduced in 2015 by Olaf Ronneberger, Philipp Fischer, Thomas Brox and reported in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation.

Loss function

When training a CNN, it’s important to choose a loss function that accurately reflects the performance of the network. Indeed, this function will allow to compare the predicted pixels to those of the ground truth for each patient. At each epoch, the goal is to update the weights of our model in a way that minimizes this loss function, and therefore improves the accuracy of its predictions.

A commonly used loss function for multi-class classification problems is categorical cross-entropy, which measures the difference between the predicted probability distribution of each pixel and the real value of the one-hot encoded ground truth. Note that segmentations models sometimes use the dice loss function as well.

Output activation function

To get this probability distribution over the different classes for each pixel, we apply a softmax activation function to the output layer of our neural network.

This means that during training, our CNN will adjust its weights to minimize our loss function, which compares predicted probabilities given by the softmax function with those of the ground truth segmentation.

Other metrics

It is also important to monitor the model’s performance using evaluation metrics.

We will of course use accuracy, which is a very popular measure. However, this metric can be misleading when working with imbalanced datasets like BraTS2020, where Background class is over represented. To address this issue, we will use other metrics such as the intersection over union (IoU), the Dice coefficient, precision, sensitivity, and specificity.

Accuracy: Measures the overall proportion of correctly classified pixels, including both positive and negative pixels.
IoU: Measures the overlap between the predicted and ground truth segmentations.
Precision (positive predictive value): Measures the proportion of predicted positive pixels that are actually positive.
Sensitivity (true positive rate): Measures the proportion of positive ground truth pixels that were correctly predicted as positive.
Specificity (true negative rate): Measures the proportion of negative ground truth pixels that were correctly predicted as negative.

4 – Analysis of training metrics

Model has been trained on 35 epochs.

Graphical display of training metrics over epochs

On the accuracy graph, we can see that both training accuracy and validation accuracy are increasing over epochs and reaching a plateau. This indicates that the model is learning from the data (training set) and generalizing well to new one (validation set). It does not seem that we are facing overfitting since both metrics are improving.

Then, we can see that our models is clearly learning from the training data, since both losses decrease over time on the second graph. We also notice that the best version of our model is reached around epoch 26. This conclusion is reinforced by the third graph, where both dice coefficients are increasing over epochs.

5 – Segmentation results

Once the training is completed, we can look at how the model behaves against the test set by calling the .evaluate() function:

Metric	Score
Categorical cross-entropy loss	0.0206
Accuracy	0.9935
MeanIOU	0.8176
Dice coefficient	0.6008
Precision	0.9938
Sensitivity	0.9922
Specificity	0.9979

We can conclude that the model performed very well on the test dataset, achieving a low test loss (0.0206), a correct dice coefficient (0.6008) for an image segmentation task, and good scores on other metrics which indicate that the model has good generalization performance on unseen data.

To understand a little better what is behind these scores, let’s try to plot some randomly selected patient predicted segmentations:

Graphical comparison of original and predicted segmentations for randomly selected patients

Predicted segmentations seem quite accurate but we need to do some post-processing in order to convert the probabilities given by the softmax function in a single class, for each pixel, corresponding to the class that has obtained the highest probability.

Argmax() function is chosen here. Applying this function will also allow us to remove some false positive cases, and to plot the same colors between the original segmentation and the prediction, which will be easier to compare than just above.

For the same patients as before, we obtain:

Graphical comparison of original and post-processed predicted segmentations for randomly selected patients

Conclusion

I hope you have enjoyed this tutorial, you are now more comfortable with image segmentation!

Keep in mind that even if our results seem accurate, we have some false positive in our predictions. In a field like medical imaging, it is crucial to evaluate the balance between true positives and false positives and assess the risks and benefits of an artificial approach.

As we have seen, post-processing techniques can be used to solve this problem. However, we must be careful with the results of these methods, since they can lead to a loss of information.

Want to find out more?

Notebook

All the code is available on our GitHub repository.

App

A Streamlit application was created around this use case to predict and observe the predictions generated by the model. Find the segmentation app’s code here.

Deploy a custom Docker image for Data Science project – A spam classifier with FastAPI (Part 3)

Eléa Petton — Fri, 30 Dec 2022 10:39:54 +0000

A guide to deploy a custom Docker image for an API with FastAPI and AI Deploy.

Welcome to the third article concerning custom Docker image deployment. If you haven’t read the previous ones, you can check it:

– Gradio sketch recognition app
– Streamlit app for EDA and interactive prediction

When creating code for a Data Science project, you probably want it to be as portable as possible. In other words, it can be run as many times as you like, even on different machines.

Unfortunately, it is often the case that a Data Science code works fine locally on a machine but gives errors during runtime. It can be due to different versions of libraries installed on the host machine.

To deal with this problem, you can use Docker.

The article is organized as follows:

Objectives
Concepts
Define a model for spam classification
Build the FastAPI app with Python
Containerize your app with Docker
Launch the app with AI Deploy

All the code for this blogpost is available in our dedicated GitHub repository. You can test it with OVHcloud AI Deploy tool, please refer to the documentation to boot it up.

Objectives

In this article, you will learn how to develop FastAPI API for spam classification.

Once your app is up and running locally, it will be a matter of containerizing it, then deploying the custom Docker image with AI Deploy.

Concepts

In Artificial Intelligence, you have probably heard of Natural Language Processing (NLP). NLP gathers several tasks related to language processing such as text classification.

This technique is ideal for distinguishing spam from other messages.

Spam Ham Collection Dataset

The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research.

The dataset contains 5,574 messages in English. The SMS are tagged as follow:

HAM if the message is legitimate
SPAM if it is not

The collection is a text file, where each line has the correct class followed by the raw message.

Logistic regression

What is a Logistic Regression?

Logistic regression is a statistical model. It allows to study the relationships between a set of i qualitative variables (Xi) and a qualitative variable (Y).

It is a generalized linear model using a logistic function as a link function.

A logistic regression model can also predict the probability of an event occurring (value close to 1) or not (value close to 0) from the optimization of the regression coefficients. This result always varies between 0 and 1.

For the spam classification use case, words are inputs and class (spam or ham) is output.

FastAPI

What is FastAPI?

FastAPI is a web framework for building RESTful APIs with Python.

FastAPI is based on Pydantic and type guidance to validate, serialize and deserialize data, and automatically generate OpenAPI documents.

Docker

Docker platform allows you to build, run and manage isolated applications. The principle is to build an application that contains not only the written code but also all the context to run the code: libraries and their versions for example

When you wrap your application with all its context, you build a Docker image, which can be saved in your local repository or in the Docker Hub.

To get started with Docker, please, check this documentation.

To build a Docker image, you will define 2 elements:

the application code (FastAPI app)
the Dockerfile

In the next steps, you will see how to develop the Python code for your app, but also how to write the Dockerfile.

Finally, you will see how to deploy your custom docker image with OVHcloud AI Deploy tool.

AI Deploy

AI Deploy enables AI models and managed applications to be started via Docker containers.

To know more about AI Deploy, please refer to this documentation.

Define a model for spam classification

❗ To develop an API that uses a Machine Learning model, you have to load the model in the correct format. For this tutorial, a Logistic Regression is used and the Python file model.py is used to define it.

To better understand the model.py code, refer to the notebook which details all the steps.

First of all, you have to import the Python libraries needed to create the Logistic Regression in the model.py file.

import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

Now, you can create the Logistic Regression based on the Spam Ham Collection Dataset. The Python framework named Scikit-Learn is used to define this model.

Firstly, you can load the dataset and transform your input file into a dataframe.

You will also be able to define the input and the output of the model.

def load_data():

    PATH = 'SMSSpamCollection'
    df = pd.read_csv(PATH, delimiter = "\t", names=["classe", "message"])

    X = df['message']
    y = df['classe']

    return X, y

In a second step, you split the data in a training and a test set.

To separate the dataset fairly and to have a test_size between 0 and 1, you can calculate ntest as follows.

def split_data(X, y):

    ntest = 2000/(3572+2000)

    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=ntest, random_state=0)

    return X_train, y_train

Now you can concentrate on creating the Machine Learning model. To do this, create a spam_classifier_model function.

To fully understand the code, refer to Steps 6 to 9 of this notebook. In these steps you will learn how to:

create the model using Logistic Regression
evaluate on the test set
do dimension reduction with stop words and term frequency
do dimension reduction to post-processing of the model

def spam_classifier_model(Xtrain, ytrain):

    model_logistic_regression = LogisticRegression()
    model_logistic_regression = model_logistic_regression.fit(Xtrain, ytrain)

    coeff = model_logistic_regression.coef_
    coef_abs = np.abs(coeff)

    quantiles = np.quantile(coef_abs,[0, 0.25, 0.5, 0.75, 0.9, 1])

    index = np.where(coeff[0] > quantiles[1])
    newXtrain = Xtrain[:, index[0]]

    model_logistic_regression = LogisticRegression()
    model_logistic_regression.fit(newXtrain, ytrain)

    return model_logistic_regression, index

Once these Python functions are defined, you can call and apply them as follows.

Firstly, extract input and output data with load_data():

data_input, data_output = load_data()

Secondly, split the data using the split_data(data_input, data_output):

X_train, ytrain = split_data(data_input, data_output)

❗ Here, there is no need to use the test set. Indeed, the evaluation of the final model has already been done in Step 9 - Dimensionality reduction: post processing of the model of the notebook.

Thirdly, transform and fit training set. In order to prepare the data, you can use CountVectorizer from Scikit-Learn to remove stop-words and then fit_transform to fit the inputs.

vectorizer = CountVectorizer(stop_words='english', binary=True, min_df=10)
Xtrain = vectorizer.fit_transform(X_train.tolist())
Xtrain = Xtrain.toarray()

Fourthly, use the model and index for prediction by calling spam_classifier_model function.

model_logistic_regression, index = spam_classifier_model(Xtrain, ytrain)

Find out the full Python code here.

Have you successfully defined your model? Good job 🥳 !

Let’s go for the creation of the API!

Build the FastAPI app with Python

❗ All the codes below are available in the app.py file. You can find the complete Python code of the app.py file here.

To begin, you can import dependencies for FastAPI app.

uvicorn
fastapi
pydantic

import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from model import model_logistic_regression, index, vectorizer

In the first place, you can initialize an instance of FastAPI.

app = FastAPI()

Next, you can define the data format by creating the Python class named request_body. Here, the string (str) format is required.

class request_body(BaseModel):
    message : str

Now, you can create the process function in order to prepare the sent message to be used by the model.

def process_message(message):

    desc = vectorizer.transform(message)
    dense_desc = desc.toarray()
    dense_select = dense_desc[:, index[0]]

    return dense_select

At the exit of this function the message does not contain any more stop words, it is put in the right format for the model thanks to the transform and is then represented as an array.

Now that the function for processing the input data is defined, you can pass the GET and POST methods.

First, let’s go for the GET method!

@app.get('/')
def root():
    return {'message': 'Welcome to the SPAM classifier API'}

Here you can see the welcome message when you arrive on your API.

{"message":"Welcome to the SPAM classifier API"}

Now it’s the turn of the POST method. In this part of the code, you will be able to:

define the message format
check if a message has been sent or not
process the message to fit with the model
extract the probabilities
return the results

@app.post('/spam_detection_path')
def classify_message(data : request_body):

    message = [
        data.message
    ]

    if (not (message)):
        raise HTTPException(status_code=400, detail="Please Provide a valid text message")

    dense_select = process_message(message)

    label = model_logistic_regression.predict(dense_select)
    proba = model_logistic_regression.predict_proba(dense_select)

    if label[0]=='ham':
        label_proba = proba[0][0]
    else:
        label_proba = proba[0][1]

    return {'label': label[0], 'label_probability': label_proba}

❗ Again, you can find the full code here.

Before deploying your API, you can test it locally using the following command:

uvicorn app:app --reload

Then, you can test your app locally at the following address: http://localhost:8000/

You will arrive on the following page:

How to interact with your API?

You can add /docs at the end of the url of your app: http://localhost:8000/docs

A new page opens to you. It provides a complete dashboard for interacting with the API!

To be able to send a message for classification, select /spam_detection_path in the green box. Click on Try it out and type the message of your choice in the dedicated zone.

Enter the message of your choice. It must be in the form of a string.

Example: "A new free service for you only"

To get the result of the prediction, click on the Execute button.

Finally, you obtain the result of the prediction with the label and the confidence score.

Your app works locally? Congratulations 🎉 !

Now it’s time to move on to containerization!

Containerize your app with Docker

First of all, you have to build the file that will contain the different Python modules to be installed with their corresponding version.

Create the requirements.txt file

The requirements.txt file will allow us to write all the modules needed to make our application work.

fastapi==0.87.0
pydantic==1.10.2
uvicorn==0.20.0
pandas==1.5.1
scikit-learn==1.1.3

This file will be useful when writing the Dockerfile.

Write the Dockerfile

Your Dockerfile should start with the the FROM instruction indicating the parent image to use. In our case we choose to start from a classic Python image.

For this Streamlit app, you can use version 3.8 of Python.

FROM python:3.8

Next, you have to to fill in the working directory and add all files into.

❗ Here you must be in the /workspace directory. This is the basic directory for launching an OVHcloud AI Deploy.

WORKDIR /workspace
ADD . /workspace

Install the requirements.txt file which contains your needed Python modules using a pip install… command.

RUN pip install -r requirements.txt

Set the listening port of the container. For FastAPI, you can use the port 8000.

EXPOSE 8000

Then, you have to define the entrypoint and the default launching command to start the application.

ENTRYPOINT ["uvicorn"]
CMD [ "streamlit", "run", "/workspace/app.py", "--server.address=0.0.0.0" ]

Finally, you can give correct access rights to OVHcloud user (42420:42420).

RUN chown -R 42420:42420 /workspace
ENV HOME=/workspace

Once your Dockerfile is defined, you will be able to build your custom docker image.

Build the Docker image from the Dockerfile

First, you can launch the following command from the Dockerfile directory to build your application image.

docker build . -t fastapi-spam-classification:latest

⚠️ The dot . argument indicates that your build context (place of the Dockerfile and other needed files) is the current directory.

⚠️ The -t argument allows you to choose the identifier to give to your image. Usually image identifiers are composed of a name and a version tag :. For this example we chose fastapi-spam-classification:latest.

Test it locally

Now, you can run the following Docker command to launch your application locally on your computer.

docker run --rm -it -p 8080:8080 --user=42420:42420 fastapi-spam-classification:latest

⚠️ The -p 8000:8000 argument indicates that you want to execute a port redirection from the port 8000 of your local machine into the port 8000 of the Docker container.

⚠️ Don't forget the --user=42420:42420 argument if you want to simulate the exact same behaviour that will occur on AI Deploy. It executes the Docker container as the specific OVHcloud user (user 42420:42420).

Once started, your application should be available on http://localhost:8000.

Your Docker image seems to work? Good job 👍 !

It’s time to push it and deploy it!

Push the image into the shared registry

❗ The shared registry of AI Deploy should only be used for testing purpose. Please consider attaching your own Docker registry. More information about this can be found here.

Then, you have to find the address of your shared registry by launching this command.

ovhai registry list

Next, log in on the shared registry with your usual OpenStack credentials.

docker login -u  -p

To finish, you need to push the created image into the shared registry.

docker tag fastapi-spam-classification:latest /fastapi-spam-classification:latest

docker push /fastapi-spam-classification:latest

Once you have pushed your custom Docker image into the shared registry, you are ready to launch your app 🚀 !

Launch the AI Deploy app

The following command starts a new job running your FastAPI application.

ovhai app run \
      --default-http-port 8000 \
      --cpu 4 \
      /fastapi-spam-classification:latest

Choose the compute resources

First, you can either choose the number of GPUs or CPUs for your app.

--cpu 4 indicates that we request 4 CPUs for that app.

Make the app public

Finally, if you want your app to be accessible without the need to authenticate, specify it as follows.

Consider adding the --unsecure-http attribute if you want your application to be reachable without any authentication.

Conclusion

Well done 🎉 ! You have learned how to build your own Docker image for a dedicated spam classification API!

You have also been able to deploy this app thanks to OVHcloud’s AI Deploy tool.

Want to find out more?

Notebook

You want to access the notebook? Refer to the GitHub repository.

App

You want to access to the full code to create the FastAPI API? Refer to the GitHub repository.

To launch and test this app with AI Deploy, please refer to our documentation.

References

How to build a Speech-To-Text Application with Python (3/3)

Mathieu Busquet — Mon, 26 Dec 2022 14:22:42 +0000

A tutorial to create and build your own Speech-To-Text Application with Python.

At the end of this third article, your Speech-To-Text Application will offer many new features such as speaker differentiation, summarization, video subtitles generation, audio trimming, and others!

Final code of the app is available in our dedicated GitHub repository.

Overview of our final Speech to Text Application

Overview of our final Speech-To-Text application

Objective

In the previous article, we have created a form where the user can select the options he wants to interact with.

Now that this form is created, it’s time to deploy the features!

This article is organized as follows:

Trim an audio
Puntuate the transcript
Differentiate speakers with diarization
Display the transcript correctly
Rename speakers
Create subtitles for videos (.SRT)
Update old code

⚠️ Since this article uses code already explained in the previous notebook tutorials, we will not re-explained its usefulness here. We therefore recommend that you read the notebooks first.

Trim an audio ✂️

The first option we are going to add add is to be able to trim an audio file.

Indeed, if the user’s audio file is several minutes long, it is possible that the user only wants to transcribe a part of it to save some time. This is where the sliders of our form become useful. They allow the user to change default start & end values, which determine which part of the audio file is transcribed.

For example, if the user’s file is 10 minutes long, the user can use the sliders to indicate that he only wants to transcribe the [00:30 -> 02:30] part, instead of the full audio file.

⚠️ With this functionality, we must check the values set by the user! Indeed, imagine that the user selects an end value which is lower than the start one (ex : transcript would starts from start=40s to end=20s), this would be problematic.

This is why you need to add the following function to your code, to rectify the potential errors:

def correct_values(start, end, audio_length):
    """
    Start or/and end value(s) can be in conflict, so we check these values
    :param start: int value (s) given by st.slider() (fixed by user)
    :param end: int value (s) given by st.slider() (fixed by user)
    :param audio_length: audio duration (s)
    :return: approved values
    """
    # Start & end Values need to be checked

    if start >= audio_length or start >= end:
        start = 0
        st.write("Start value has been set to 0s because of conflicts with other values")

    if end > audio_length or end == 0:
        end = audio_length
        st.write("End value has been set to maximum value because of conflicts with other values")

    return start, end

If one of the values has been changed, we immediately inform the user with a st.write().

We will call this function in the transcription() function, that we will rewrite at the end of this tutorial.

Split a text

If you have read the notebooks, you probably remember that some models (punctuation & summarization) have input size limitations.

Let’s reuse the split_text() function, used in the notebooks, which will allow to send our whole transcript to these models by small text blocks, limited to a max_size number of characters:

def split_text(my_text, max_size):
    """
    Split a text
    Maximum sequence length for this model is max_size.
    If the transcript is longer, it needs to be split by the nearest possible value to max_size.
    To avoid cutting words, we will cut on "." characters, and " " if there is not "."

    :return: split text
    """

    cut2 = max_size

    # First, we get indexes of "."
    my_split_text_list = []
    nearest_index = 0
    length = len(my_text)
    # We split the transcript in text blocks of size <= max_size.
    if cut2 == length:
        my_split_text_list.append(my_text)
    else:
        while cut2 <= length:
            cut1 = nearest_index
            cut2 = nearest_index + max_size
            # Find the best index to split

            dots_indexes = [index for index, char in enumerate(my_text[cut1:cut2]) if
                            char == "."]
            if dots_indexes != []:
                nearest_index = max(dots_indexes) + 1 + cut1
            else:
                spaces_indexes = [index for index, char in enumerate(my_text[cut1:cut2]) if
                                  char == " "]
                if spaces_indexes != []:
                    nearest_index = max(spaces_indexes) + 1 + cut1
                else:
                    nearest_index = cut2 + cut1
            my_split_text_list.append(my_text[cut1: nearest_index])

    return my_split_text_list

Punctuate the transcript

Now, we need to add the function that allows us to send a transcript to the punctuation model in order to punctuate it:

def add_punctuation(t5_model, t5_tokenizer, transcript):
    """
    Punctuate a transcript
    transcript: string limited to 512 characters
    :return: Punctuated and improved (corrected) transcript
    """

    input_text = "fix: { " + transcript + " } "

    input_ids = t5_tokenizer.encode(input_text, return_tensors="pt", max_length=10000, truncation=True,
                                    add_special_tokens=True)

    outputs = t5_model.generate(
        input_ids=input_ids,
        max_length=256,
        num_beams=4,
        repetition_penalty=1.0,
        length_penalty=1.0,
        early_stopping=True
    )

    transcript = t5_tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

    return transcript

The punctuation feature is now ready. We will call these functions later.
For the summarization model, you don’t have to do anything else either.

Differentiate speakers with diarization

Now, let’s reuse all the diarization functions studied in the notebook tutorials, so we can differentiate speakers during a conversation.

Convert mp3/mp4 files to .wav

Remember pyannote’s diarization only accepts .wav files as input.

def convert_file_to_wav(aud_seg, filename):
    """
    Convert a mp3/mp4 in a wav format
    Needs to be modified if you want to convert a format which contains less or more than 3 letters

    :param aud_seg: pydub.AudioSegment
    :param filename: name of the file
    :return: name of the converted file
    """
    filename = "../data/my_wav_file_" + filename[:-3] + "wav"
    aud_seg.export(filename, format="wav")

    newaudio = AudioSegment.from_file(filename)

    return newaudio, filename

Get diarization of an audio file

The following function allows you to diarize an andio file.

def get_diarization(dia_pipeline, filename):
    """
    Diarize an audio (find numbers of speakers, when they speak, ...)
    :param dia_pipeline: Pyannote's library (diarization pipeline)
    :param filename: name of a wav audio file
    :return: str list containing audio's diarization time intervals
    """
    # Get diarization of the audio
    diarization = dia_pipeline({'audio': filename})
    listmapping = diarization.labels()
    listnewmapping = []

    # Rename default speakers' names (Default is A, B, ...), we want Speaker0, Speaker1, ...
    number_of_speakers = len(listmapping)
    for i in range(number_of_speakers):
        listnewmapping.append("Speaker" + str(i))

    mapping_dict = dict(zip(listmapping, listnewmapping))
    diarization.rename_labels(mapping_dict, copy=False)
    # copy set to False so we don't create a new annotation, we replace the actual on

    return diarization, number_of_speakers

Convert diarization results to timedelta objects

This conversion makes it easy to manipulate the results.

def convert_str_diarlist_to_timedelta(diarization_result):
    """
    Extract from Diarization result the given speakers with their respective speaking times and transform them in pandas timedelta objects
    :param diarization_result: result of diarization
    :return: list with timedelta intervals and their respective speaker
    """

    # get speaking intervals from diarization
    segments = diarization_result.for_json()["content"]
    diarization_timestamps = []
    for sample in segments:
        # Convert segment in a pd.Timedelta object
        new_seg = [pd.Timedelta(seconds=round(sample["segment"]["start"], 2)),
                   pd.Timedelta(seconds=round(sample["segment"]["end"], 2)), sample["label"]]
        # Start and end = speaking duration
        # label = who is speaking
        diarization_timestamps.append(new_seg)

    return diarization_timestamps

Merge the diarization segments that follow each other and that mention the same speaker

This will reduce the number of audio segments we need to create, and will give less sequenced, less small transcripts, which will be more pleasant for the user.

def merge_speaker_times(diarization_timestamps, max_space, srt_token):
    """
    Merge near times for each detected speaker (Same speaker during 1-2s and 3-4s -> Same speaker during 1-4s)
    :param diarization_timestamps: diarization list
    :param max_space: Maximum temporal distance between two silences
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :return: list with timedelta intervals and their respective speaker
    """

    if not srt_token:
        threshold = pd.Timedelta(seconds=max_space/1000)

        index = 0
        length = len(diarization_timestamps) - 1

        while index < length:
            if diarization_timestamps[index + 1][2] == diarization_timestamps[index][2] and \
                    diarization_timestamps[index + 1][1] - threshold <= diarization_timestamps[index][0]:
                diarization_timestamps[index][1] = diarization_timestamps[index + 1][1]
                del diarization_timestamps[index + 1]
                length -= 1
            else:
                index += 1
    return diarization_timestamps

Extend timestamps given by the diarization to avoid word cutting

Imagine we have a segment like [00:01:20 –> 00:01:25], followed by [00:01:27 –> 00:01:30].

Maybe diarization is not working fine and there is some sound missing in the segments (means missing sound is between 00:01:25 and 00:01:27). The transcription model will then have difficulty understanding what is being said in these segments.

➡️ Solution consists in fixing the end of the first segment and the start of the second one to 00:01:26, the middle of these values.

def extending_timestamps(new_diarization_timestamps):
    """
    Extend timestamps between each diarization timestamp if possible, so we avoid word cutting
    :param new_diarization_timestamps: list
    :return: list with merged times
    """

    for i in range(1, len(new_diarization_timestamps)):
        if new_diarization_timestamps[i][0] - new_diarization_timestamps[i - 1][1] <= timedelta(milliseconds=3000) and new_diarization_timestamps[i][0] - new_diarization_timestamps[i - 1][1] >= timedelta(milliseconds=100):
            middle = (new_diarization_timestamps[i][0] - new_diarization_timestamps[i - 1][1]) / 2
            new_diarization_timestamps[i][0] -= middle
            new_diarization_timestamps[i - 1][1] += middle

    # Converting list so we have a milliseconds format
    for elt in new_diarization_timestamps:
        elt[0] = elt[0].total_seconds() * 1000
        elt[1] = elt[1].total_seconds() * 1000

    return new_diarization_timestamps

Create & Optimize the subtitles

Some people tend to speak naturally very quickly. Also, conversations can sometimes be heated. In both of these cases, there is a good chance that the transcribed text is very dense, and not suitable for displaying subtitles (too much text displayed does not allow to see the video anymore).

We will therefore define the following function. Its role will be to split a speech segment in 2, if the length of the text is judged too long.

def optimize_subtitles(transcription, srt_index, sub_start, sub_end, srt_text):
    """
    Optimize the subtitles (avoid a too long reading when many words are said in a short time)
    :param transcription: transcript generated for an audio chunk
    :param srt_index: Numeric counter that identifies each sequential subtitle
    :param sub_start: beginning of the transcript
    :param sub_end: end of the transcript
    :param srt_text: generated .srt transcript
    """

    transcription_length = len(transcription)

    # Length of the transcript should be limited to about 42 characters per line to avoid this problem
    if transcription_length > 42:
        # Split the timestamp and its transcript in two parts
        # Get the middle timestamp
        diff = (timedelta(milliseconds=sub_end) - timedelta(milliseconds=sub_start)) / 2
        middle_timestamp = str(timedelta(milliseconds=sub_start) + diff).split(".")[0]

        # Get the closest middle index to a space (we don't divide transcription_length/2 to avoid cutting a word)
        space_indexes = [pos for pos, char in enumerate(transcription) if char == " "]
        nearest_index = min(space_indexes, key=lambda x: abs(x - transcription_length / 2))

        # First transcript part
        first_transcript = transcription[:nearest_index]

        # Second transcript part
        second_transcript = transcription[nearest_index + 1:]

        # Add both transcript parts to the srt_text
        srt_text += str(srt_index) + "\n" + str(timedelta(milliseconds=sub_start)).split(".")[0] + " --> " + middle_timestamp + "\n" + first_transcript + "\n\n"
        srt_index += 1
        srt_text += str(srt_index) + "\n" + middle_timestamp + " --> " + str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n" + second_transcript + "\n\n"
        srt_index += 1
    else:
        # Add transcript without operations
        srt_text += str(srt_index) + "\n" + str(timedelta(milliseconds=sub_start)).split(".")[0] + " --> " + str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n" + transcription + "\n\n"

    return srt_text, srt_index

Global function which performs the whole diarization action

This function just calls all the previous diarization functions to perform it

def diarization_treatment(filename, dia_pipeline, max_space, srt_token):
    """
    Launch the whole diarization process to get speakers time intervals as pandas timedelta objects
    :param filename: name of the audio file
    :param dia_pipeline: Diarization Model (Differentiate speakers)
    :param max_space: Maximum temporal distance between two silences
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :return: speakers time intervals list and number of different detected speakers
    """
    
    # initialization
    diarization_timestamps = []

    # whole diarization process
    diarization, number_of_speakers = get_diarization(dia_pipeline, filename)

    if len(diarization) > 0:
        diarization_timestamps = convert_str_diarlist_to_timedelta(diarization)
        diarization_timestamps = merge_speaker_times(diarization_timestamps, max_space, srt_token)
        diarization_timestamps = extending_timestamps(diarization_timestamps)

    return diarization_timestamps, number_of_speakers

Launch diarization mode

Previously, we were systematically running the transcription_non_diarization() function, which is based on the silence detection method.

But now that the user has the option to select the diarization option in the form, it is time to write our transcription_diarization() function.

The only difference between the two is that we replace the silences treatment by the treatment of the results of diarization.

def transcription_diarization(filename, diarization_timestamps, stt_model, stt_tokenizer, diarization_token, srt_token,
                              summarize_token, timestamps_token, myaudio, start, save_result, txt_text, srt_text):
    """
    Performs transcription with the diarization mode
    :param filename: name of the audio file
    :param diarization_timestamps: timestamps of each audio part (ex 10 to 50 secs)
    :param stt_model: Speech to text model
    :param stt_tokenizer: Speech to text model's tokenizer
    :param diarization_token: Differentiate or not the speakers (choice fixed by user)
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :param summarize_token: Summarize or not the transcript (choice fixed by user)
    :param timestamps_token: Display and save or not the timestamps (choice fixed by user)
    :param myaudio: AudioSegment file
    :param start: int value (s) given by st.slider() (fixed by user)
    :param save_result: whole process
    :param txt_text: generated .txt transcript
    :param srt_text: generated .srt transcript
    :return: results of transcribing action
    """
    # Numeric counter that identifies each sequential subtitle
    srt_index = 1

    # Handle a rare case : Only the case if only one "list" in the list (it makes a classic list) not a list of list
    if not isinstance(diarization_timestamps[0], list):
        diarization_timestamps = [diarization_timestamps]

    # Transcribe each audio chunk (from timestamp to timestamp) and display transcript
    for index, elt in enumerate(diarization_timestamps):
        sub_start = elt[0]
        sub_end = elt[1]

        transcription = transcribe_audio_part(filename, stt_model, stt_tokenizer, myaudio, sub_start, sub_end,
                                              index)

        # Initial audio has been split with start & end values
        # It begins to 0s, but the timestamps need to be adjust with +start*1000 values to adapt the gap
        if transcription != "":
            save_result, txt_text, srt_text, srt_index = display_transcription(diarization_token, summarize_token,
                                                                    srt_token, timestamps_token,
                                                                    transcription, save_result, txt_text,
                                                                    srt_text,
                                                                    srt_index, sub_start + start * 1000,
                                                                    sub_end + start * 1000, elt)
    return save_result, txt_text, srt_text

The display_transcription() function returns 3 values for the moment, contrary to what we have just indicated in the transcription_diarization() function. Don’t worry, we will fix the display_transcription() function in a few moments.

You will also need the function below. It will allow the user to validate his access token to the diarization model and access the home page of our app. Indeed, we are going to create another page by default, which will invite the user to enter his token, if he wishes.

def confirm_token_change(hf_token, page_index):
    """
    A function that saves the hugging face token entered by the user.
    It also updates the page index variable so we can indicate we now want to display the home page instead of the token page
    :param hf_token: user's token
    :param page_index: number that represents the home page index (mentioned in the main.py file)
    """
    update_session_state("my_HF_token", hf_token)
    update_session_state("page_index", page_index)

Display the transcript correctly

Once the transcript is obtained, we must display it correctly, depending on the options the user has selected.

For example, if the user has activated diarization, we need to write the identified speaker before each transcript, like the following result:

Speaker1 : “I would like a cup of tea”

This is different from a classic silences detection method, which only writes the transcript, without any names!

There is the same case with the timestamps. We must know if we need to display them or not. We then have 4 different cases:

diarization with timestamps, named DIA_TS
diarization without timestamps, named DIA
non_diarization with timestamps, named NODIA_TS
non_diarization without timestamps, named NODIA

To display the correct elements according to the chosen mode, let’s modify the display_transcription() function.

Replace the old one by the following code:

def display_transcription(diarization_token, summarize_token, srt_token, timestamps_token, transcription, save_result, txt_text, srt_text, srt_index, sub_start, sub_end, elt=None):
    """
    Display results
    :param diarization_token: Differentiate or not the speakers (choice fixed by user)
    :param summarize_token: Summarize or not the transcript (choice fixed by user)
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :param timestamps_token: Display and save or not the timestamps (choice fixed by user)
    :param transcription: transcript of the considered audio
    :param save_result: whole process
    :param txt_text: generated .txt transcript
    :param srt_text: generated .srt transcript
    :param srt_index : numeric counter that identifies each sequential subtitle
    :param sub_start: start value (s) of the considered audio part to transcribe
    :param sub_end: end value (s) of the considered audio part to transcribe
    :param elt: timestamp (diarization case only, otherwise elt = None)
    """
    # Display will be different depending on the mode (dia, no dia, dia_ts, nodia_ts)
    
    # diarization mode
    if diarization_token:
        if summarize_token:
            update_session_state("summary", transcription + " ", concatenate_token=True)
        
        if not timestamps_token:
            temp_transcription = elt[2] + " : " + transcription
            st.write(temp_transcription + "\n\n")

            save_result.append([int(elt[2][-1]), elt[2], " : " + transcription])
            
        elif timestamps_token:
            temp_timestamps = str(timedelta(milliseconds=sub_start)).split(".")[0] + " --> " + \
                              str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n"
            temp_transcription = elt[2] + " : " + transcription
            temp_list = [temp_timestamps, int(elt[2][-1]), elt[2], " : " + transcription, int(sub_start / 1000)]
            save_result.append(temp_list)
            st.button(temp_timestamps, on_click=click_timestamp_btn, args=(sub_start,))
            st.write(temp_transcription + "\n\n")
            
            if srt_token:
                srt_text, srt_index = optimize_subtitles(transcription, srt_index, sub_start, sub_end, srt_text)


    # Non diarization case
    else:
        if not timestamps_token:
            save_result.append([transcription])
            st.write(transcription + "\n\n")
            
        else:
            temp_timestamps = str(timedelta(milliseconds=sub_start)).split(".")[0] + " --> " + \
                              str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n"
            temp_list = [temp_timestamps, transcription, int(sub_start / 1000)]
            save_result.append(temp_list)
            st.button(temp_timestamps, on_click=click_timestamp_btn, args=(sub_start,))
            st.write(transcription + "\n\n")
            
            if srt_token:
                srt_text, srt_index = optimize_subtitles(transcription, srt_index, sub_start, sub_end, srt_text)

        txt_text += transcription + " "  # So x seconds sentences are separated

    return save_result, txt_text, srt_text, srt_index

We also need to add the following function which allow us to create our txt_text variable from the st.session.state[‘process’] variable in a diarization case. This is necessary because, in addition to displaying the spoken sentence which means the transcript part, we must display the identity of the speaker, and eventually the timestamps, which are all stored in the session state variable.

def create_txt_text_from_process(punctuation_token=False, t5_model=None, t5_tokenizer=None):
    """
    If we are in a diarization case (differentiate speakers), we create txt_text from st.session.state['process']
    There is a lot of information in the process variable, but we only extract the identity of the speaker and
    the sentence spoken, as in a non-diarization case.
    :param punctuation_token: Punctuate or not the transcript (choice fixed by user)
    :param t5_model: T5 Model (Auto punctuation model)
    :param t5_tokenizer: T5’s Tokenizer (Auto punctuation model's tokenizer)
    :return: Final transcript (without timestamps)
    """
    txt_text = ""
    # The information to be extracted is different according to the chosen mode
    if punctuation_token:
        with st.spinner("Transcription is finished! Let us punctuate your audio"):
            if st.session_state["chosen_mode"] == "DIA":
                for elt in st.session_state["process"]:
                    # [2:] don't want ": text" but only the "text"
                    text_to_punctuate = elt[2][2:]
                    if len(text_to_punctuate) >= 512:
                        text_to_punctutate_list = split_text(text_to_punctuate, 512)
                        punctuated_text = ""
                        for split_text_to_punctuate in text_to_punctutate_list:
                            punctuated_text += add_punctuation(t5_model, t5_tokenizer, split_text_to_punctuate)
                    else:
                        punctuated_text = add_punctuation(t5_model, t5_tokenizer, text_to_punctuate)

                    txt_text += elt[1] + " : " + punctuated_text + '\n\n'

            elif st.session_state["chosen_mode"] == "DIA_TS":
                for elt in st.session_state["process"]:
                    text_to_punctuate = elt[3][2:]
                    if len(text_to_punctuate) >= 512:
                        text_to_punctutate_list = split_text(text_to_punctuate, 512)
                        punctuated_text = ""
                        for split_text_to_punctuate in text_to_punctutate_list:
                            punctuated_text += add_punctuation(t5_model, t5_tokenizer, split_text_to_punctuate)
                    else:
                        punctuated_text = add_punctuation(t5_model, t5_tokenizer, text_to_punctuate)

                    txt_text += elt[2] + " : " + punctuated_text + '\n\n'
    else:
        if st.session_state["chosen_mode"] == "DIA":
            for elt in st.session_state["process"]:
                txt_text += elt[1] + elt[2] + '\n\n'

        elif st.session_state["chosen_mode"] == "DIA_TS":
            for elt in st.session_state["process"]:
                txt_text += elt[2] + elt[3] + '\n\n'

    return txt_text

Also for the purpose of correct display, we need to update the display_results() function so that it adapts the display to the selected mode among DIA_TS, DIA, NODIA_TS, NODIA. This will also avoid ‘List index out of range’ errors, as the process variable does not contain the same number of elements depending on the mode used.

# Update the following function code
def display_results():

    # Add a button to return to the main page
    st.button("Load an other file", on_click=update_session_state, args=("page_index", 0,))

    # Display results
    st.audio(st.session_state['audio_file'], start_time=st.session_state["start_time"])

    # Display results of transcript by steps
    if st.session_state["process"] != []:

        if st.session_state["chosen_mode"] == "NODIA":  # Non diarization, non timestamps case
            for elt in (st.session_state['process']):
                st.write(elt[0])

        elif st.session_state["chosen_mode"] == "DIA":  # Diarization without timestamps case
            for elt in (st.session_state['process']):
                st.write(elt[1] + elt[2])

        elif st.session_state["chosen_mode"] == "NODIA_TS":  # Non diarization with timestamps case
            for elt in (st.session_state['process']):
                st.button(elt[0], on_click=update_session_state, args=("start_time", elt[2],))
                st.write(elt[1])

        elif st.session_state["chosen_mode"] == "DIA_TS":  # Diarization with timestamps case
            for elt in (st.session_state['process']):
                st.button(elt[0], on_click=update_session_state, args=("start_time", elt[4],))
                st.write(elt[2] + elt[3])

    # Display final text
    st.subheader("Final text is")
    st.write(st.session_state["txt_transcript"])

    # Display Summary
    if st.session_state["summary"] != "":
        with st.expander("Summary"):
            st.write(st.session_state["summary"])

    # Display the buttons in a list to avoid having empty columns (explained in the transcription() function)
    col1, col2, col3, col4 = st.columns(4)
    col_list = [col1, col2, col3, col4]
    col_index = 0

    for elt in st.session_state["btn_token_list"]:
        if elt[0]:
            mycol = col_list[col_index]
            if elt[1] == "useless_txt_token":
                # Download your transcription.txt
                with mycol:
                    st.download_button("Download as TXT", st.session_state["txt_transcript"],
                                       file_name="my_transcription.txt")

            elif elt[1] == "srt_token":
                # Download your transcription.srt
                with mycol:
                    st.download_button("Download as SRT", st.session_state["srt_txt"], file_name="my_transcription.srt")
            elif elt[1] == "dia_token":
                with mycol:
                    # Rename the speakers detected in your audio
                    st.button("Rename Speakers", on_click=update_session_state, args=("page_index", 2,))

            elif elt[1] == "summarize_token":
                with mycol:
                    st.download_button("Download Summary", st.session_state["summary"], file_name="my_summary.txt")
            col_index += 1

We then display 4 buttons that allow you to interact with the implemented functions (download the transcript in .txt format, in .srt, download the summary, and rename the speakers.

These buttons are placed in 4 columns which allows them to be displayed in one line. The problem is that these options are sometimes enabled and sometimes not. If we assign a button to a column, we risk having an empty column among the four columns, which would not be aesthetically pleasing.

This is where the token_list comes in! This is a list of list which contains in each of its indexes a list, having for first element the value of the token, and in second its denomination. For example, we can find in the token_list the following list: [True, “dia_token”], which means that diarization option has been selected.

From this, we can assign a button to a column only if it contains an token set to True. If the token is set to False, we will retry to use this column for the next token. This avoids creating an empty column.

Rename Speakers

Of course, it would be interesting to have the possibility to rename the detected speakers in the audio file. Indeed, having Speaker0, Speaker1, … is fine but it could be so much better with real names! Guess what? We are going to do this!

First, we will create a list where we will add each speaker with his ‘ID’ (ex: Speaker1 has 1 as his ID).

Unfortunately, the diarization does not sort out the interlocutors. For example, the first one detected might be Speaker3, followed by Speaker0, then Speaker2. This is why it is important to sort this list, for example by placing the lowest ID as the first element of our list. This will allow not to exchange names between speakers.

Once this is done, we need to find a way for the user to interact with this list and modify the names contained in it.

➡️ We are going to create a third page that will be dedicated to this functionality. On this page, we will display each name contained in the list in a st.text_area() widget. The user will be able to see how many people have been detected in his audio and the automatic names (Speaker0, Speaker1, …) that have been assigned to them, as the screen below shows:

Overview of the Rename Speakers page

The user is able to modify this text area. Indeed, he can replace each name with the oneche wants but he must respect the one name per line format. When he has finished, he can save his modifications by clicking a “Save changes” button, which calls the callback function click_confirm_rename_btn() that we will define just after. We also display a “Cancel” button that will redirect the user to the results page.

All this process is realized by the rename_speakers_window() function. Add it to your code:

def rename_speakers_window():
    """
    Load a new page which allows the user to rename the different speakers from the diarization process
    For example he can switch from "Speaker1 : "I wouldn't say that"" to "Mat : "I wouldn't say that""
    """

    st.subheader("Here you can rename the speakers as you want")
    number_of_speakers = st.session_state["number_of_speakers"]

    if number_of_speakers > 0:
        # Handle displayed text according to the number_of_speakers
        if number_of_speakers == 1:
            st.write(str(number_of_speakers) + " speaker has been detected in your audio")
        else:
            st.write(str(number_of_speakers) + " speakers have been detected in your audio")

        # Saving the Speaker Name and its ID in a list, example : [1, 'Speaker1']
        list_of_speakers = []
        for elt in st.session_state["process"]:
            if st.session_state["chosen_mode"] == "DIA_TS":
                if [elt[1], elt[2]] not in list_of_speakers:
                    list_of_speakers.append([elt[1], elt[2]])
            elif st.session_state["chosen_mode"] == "DIA":
                if [elt[0], elt[1]] not in list_of_speakers:
                    list_of_speakers.append([elt[0], elt[1]])

        # Sorting (by ID)
        list_of_speakers.sort()  # [[1, 'Speaker1'], [0, 'Speaker0']] => [[0, 'Speaker0'], [1, 'Speaker1']]

        # Display saved names so the user can modify them
        initial_names = ""
        for elt in list_of_speakers:
            initial_names += elt[1] + "\n"

        names_input = st.text_area("Just replace the names without changing the format (one per line)",
                                   value=initial_names)

        # Display Options (Cancel / Save)
        col1, col2 = st.columns(2)
        with col1:
            # Cancel changes by clicking a button - callback function to return to the results page
            st.button("Cancel", on_click=update_session_state, args=("page_index", 1,))
        with col2:
            # Confirm changes by clicking a button - callback function to apply changes and return to the results page
            st.button("Save changes", on_click=click_confirm_rename_btn, args=(names_input, number_of_speakers, ))

    # Don't have anyone to rename
    else:
        st.error("0 speakers have been detected. Seem there is an issue with diarization")
        with st.spinner("Redirecting to transcription page"):
            time.sleep(4)
            # return to the results page
            update_session_state("page_index", 1)

Now, write the callback function that is called when the “Save changes” button is clicked. It allows to save the new speaker’s names in the process session state variable and to recreate the displayed text with the new names given by the user thanks to the previously defined function create_txt_text_from_process(). Finally, it redirects the user to the results page.

def click_confirm_rename_btn(names_input, number_of_speakers):
    """
    If the users decides to rename speakers and confirms his choices, we apply the modifications to our transcript
    Then we return to the results page of the app
    :param names_input: string
    :param number_of_speakers: Number of detected speakers in the audio file
    """

    try:
        names_input = names_input.split("\n")[:number_of_speakers]

        for elt in st.session_state["process"]:
            elt[2] = names_input[elt[1]]

        txt_text = create_txt_text_from_process()
        update_session_state("txt_transcript", txt_text)
        update_session_state("page_index", 1)

    except TypeError:  # list indices must be integers or slices, not str (happened to me one time when writing non sense names)
        st.error("Please respect the 1 name per line format")
        with st.spinner("We are relaunching the page"):
            time.sleep(3)
            update_session_state("page_index", 1)

Create subtitles for videos (.SRT)

Idea is very simple here, process is the same as before. We just have in this case to shorten the timestamps by adjusting the min_space and the max_space values, so we have a good video-subtitles synchronization.

Indeed, remember that subtitles must correspond to small time windows to have small synchronized transcripts. Otherwise, there will be too much text. That’s why we set the min_space to 1s and the max_space to 8s instead of the classic min: 25s and max: 45s values.

def silence_mode_init(srt_token):
    """
    Fix min_space and max_space values
    If the user wants a srt file, we need to have tiny timestamps
    :param srt_token: Enable/Disable generate srt file option (choice fixed by user)
    :return: min_space and max_space values
    """

    if srt_token:
        # We need short intervals if we want a short text
        min_space = 1000  # 1 sec
        max_space = 8000  # 8 secs

    else:

        min_space = 25000  # 25 secs
        max_space = 45000  # 45secs
    return min_space, max_space

Update old code

As we have a lot new parameters (diarization_token, timestamps_token, summarize_token, …) in our display_transcription() function, we need to update our transcription_non_diarization() function so it can interact with these new parameters and display the transcript correctly.

def transcription_non_diarization(filename, myaudio, start, end, diarization_token, timestamps_token, srt_token,
                                  summarize_token, stt_model, stt_tokenizer, min_space, max_space, save_result,
                                  txt_text, srt_text):
    """
    Performs transcribing action with the non-diarization mode
    :param filename: name of the audio file
    :param myaudio: AudioSegment file
    :param start: int value (s) given by st.slider() (fixed by user)
    :param end: int value (s) given by st.slider() (fixed by user)
    :param diarization_token: Differentiate or not the speakers (choice fixed by user)
    :param timestamps_token: Display and save or not the timestamps (choice fixed by user)
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :param summarize_token: Summarize or not the transcript (choice fixed by user)
    :param stt_model: Speech to text model
    :param stt_tokenizer: Speech to text model's tokenizer
    :param min_space: Minimum temporal distance between two silences
    :param max_space: Maximum temporal distance between two silences
    :param save_result: whole process
    :param txt_text: generated .txt transcript
    :param srt_text: generated .srt transcript
    :return: results of transcribing action
    """

    # Numeric counter identifying each sequential subtitle
    srt_index = 1

    # get silences
    silence_list = detect_silences(myaudio)
    if silence_list != []:
        silence_list = get_middle_silence_time(silence_list)
        silence_list = silences_distribution(silence_list, min_space, max_space, start, end, srt_token)
    else:
        silence_list = generate_regular_split_till_end(silence_list, int(end), min_space, max_space)

    # Transcribe each audio chunk (from timestamp to timestamp) and display transcript
    for i in range(0, len(silence_list) - 1):
        sub_start = silence_list[i]
        sub_end = silence_list[i + 1]

        transcription = transcribe_audio_part(filename, stt_model, stt_tokenizer, myaudio, sub_start, sub_end, i)

        # Initial audio has been split with start & end values
        # It begins to 0s, but the timestamps need to be adjust with +start*1000 values to adapt the gap
        if transcription != "":
            save_result, txt_text, srt_text, srt_index = display_transcription(diarization_token, summarize_token,
                                                                    srt_token, timestamps_token,
                                                                    transcription, save_result,
                                                                    txt_text,
                                                                    srt_text,
                                                                    srt_index, sub_start + start * 1000,
                                                                    sub_end + start * 1000)

    return save_result, txt_text, srt_text

Also, you need to add these new parameters to the transcript_from_url() and transcript_from_files() functions.

def transcript_from_url(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline):
    """
    Displays a text input area, where the user can enter a YouTube URL link. If the link seems correct, we try to
    extract the audio from the video, and then transcribe it.

    :param stt_tokenizer: Speech to text model's tokenizer
    :param stt_model: Speech to text model
    :param t5_tokenizer: Auto punctuation model's tokenizer
    :param t5_model: Auto punctuation model
    :param summarizer: Summarizer model
    :param dia_pipeline: Diarization Model (Differentiate speakers)
    """

    url = st.text_input("Enter the YouTube video URL then press Enter to confirm!")
    # If link seems correct, we try to transcribe
    if "youtu" in url:
        filename = extract_audio_from_yt_video(url)
        if filename is not None:
            transcription(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline, filename)
        else:
            st.error("We were unable to extract the audio. Please verify your link, retry or choose another video")

def transcript_from_file(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline):
    """
    Displays a file uploader area, where the user can import his own file (mp3, mp4 or wav). If the file format seems
    correct, we transcribe the audio.
    """

    # File uploader widget with a callback function, so the page reloads if the users uploads a new audio file
    uploaded_file = st.file_uploader("Upload your file! It can be a .mp3, .mp4 or .wav", type=["mp3", "mp4", "wav"],
                                     on_change=update_session_state, args=("page_index", 0,))

    if uploaded_file is not None:
        # get name and launch transcription function
        filename = uploaded_file.name
        transcription(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline, filename, uploaded_file)

Everything is almost ready, you can finally update the transcription() function so it can call all the new methods we have defined:

def transcription(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline, filename,
                  uploaded_file=None):
    """
    Mini-main function
    Display options, transcribe an audio file and save results.
    :param stt_tokenizer: Speech to text model's tokenizer
    :param stt_model: Speech to text model
    :param t5_tokenizer: Auto punctuation model's tokenizer
    :param t5_model: Auto punctuation model
    :param summarizer: Summarizer model
    :param dia_pipeline: Diarization Model (Differentiate speakers)
    :param filename: name of the audio file
    :param uploaded_file: file / name of the audio file which allows the code to reach the file
    """

    # If the audio comes from the Youtube extraction mode, the audio is downloaded so the uploaded_file is
    # the same as the filename. We need to change the uploaded_file which is currently set to None
    if uploaded_file is None:
        uploaded_file = filename

    # Get audio length of the file(s)
    myaudio = AudioSegment.from_file(uploaded_file)
    audio_length = myaudio.duration_seconds

    # Save Audio (so we can display it on another page ("DISPLAY RESULTS"), otherwise it is lost)
    update_session_state("audio_file", uploaded_file)

    # Display audio file
    st.audio(uploaded_file)

    # Is transcription possible
    if audio_length > 0:

        # We display options and user shares his wishes
        transcript_btn, start, end, diarization_token, punctuation_token, timestamps_token, srt_token, summarize_token, choose_better_model = load_options(
            int(audio_length), dia_pipeline)

        # If end value hasn't been changed, we fix it to the max value so we don't cut some ms of the audio because
        # end value is returned by a st.slider which return end value as a int (ex: return 12 sec instead of end=12.9s)
        if end == int(audio_length):
            end = audio_length

        # Switching model for the better one
        if choose_better_model:
            with st.spinner("We are loading the better model. Please wait..."):

                try:
                    stt_tokenizer = pickle.load(open("models/STT_tokenizer2_wav2vec2-large-960h-lv60-self.sav", 'rb'))
                except FileNotFoundError:
                    stt_tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

                try:
                    stt_model = pickle.load(open("models/STT_model2_wav2vec2-large-960h-lv60-self.sav", 'rb'))
                except FileNotFoundError:
                    stt_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

        # Validate options and launch the transcription process thanks to the form's button
        if transcript_btn:

            # Check if start & end values are correct
            start, end = correct_values(start, end, audio_length)

            # If start a/o end value(s) has/have changed, we trim/cut the audio according to the new start/end values.
            if start != 0 or end != audio_length:
                myaudio = myaudio[start * 1000:end * 1000]  # Works in milliseconds (*1000)

            # Transcribe process is running
            with st.spinner("We are transcribing your audio. Please wait"):

                # Initialize variables
                txt_text, srt_text, save_result = init_transcription(start, int(end))
                min_space, max_space = silence_mode_init(srt_token)

                # Differentiate speakers mode
                if diarization_token:

                    # Save mode chosen by user, to display expected results
                    if not timestamps_token:
                        update_session_state("chosen_mode", "DIA")
                    elif timestamps_token:
                        update_session_state("chosen_mode", "DIA_TS")

                    # Convert mp3/mp4 to wav (Differentiate speakers mode only accepts wav files)
                    if filename.endswith((".mp3", ".mp4")):
                        myaudio, filename = convert_file_to_wav(myaudio, filename)
                    else:
                        filename = "../data/" + filename
                        myaudio.export(filename, format="wav")

                    # Differentiate speakers process
                    diarization_timestamps, number_of_speakers = diarization_treatment(filename, dia_pipeline,
                                                                                       max_space, srt_token)
                    # Saving the number of detected speakers
                    update_session_state("number_of_speakers", number_of_speakers)

                    # Transcribe process with Diarization Mode
                    save_result, txt_text, srt_text = transcription_diarization(filename, diarization_timestamps,
                                                                                stt_model,
                                                                                stt_tokenizer,
                                                                                diarization_token,
                                                                                srt_token, summarize_token,
                                                                                timestamps_token, myaudio, start,
                                                                                save_result,
                                                                                txt_text, srt_text)

                # Non Diarization Mode
                else:
                    # Save mode chosen by user, to display expected results
                    if not timestamps_token:
                        update_session_state("chosen_mode", "NODIA")
                    if timestamps_token:
                        update_session_state("chosen_mode", "NODIA_TS")

                    filename = "../data/" + filename
                    # Transcribe process with non Diarization Mode
                    save_result, txt_text, srt_text = transcription_non_diarization(filename, myaudio, start, end,
                                                                                    diarization_token, timestamps_token,
                                                                                    srt_token, summarize_token,
                                                                                    stt_model, stt_tokenizer,
                                                                                    min_space, max_space,
                                                                                    save_result, txt_text, srt_text)

                # Save results so it is not lost when we interact with a button
                update_session_state("process", save_result)
                update_session_state("srt_txt", srt_text)

                # Get final text (with or without punctuation token)
                # Diariation Mode
                if diarization_token:
                    # Create txt text from the process
                    txt_text = create_txt_text_from_process(punctuation_token, t5_model, t5_tokenizer)

                # Non diarization Mode
                else:

                    if punctuation_token:
                        # Need to split the text by 512 text blocks size since the model has a limited input
                        with st.spinner("Transcription is finished! Let us punctuate your audio"):
                            my_split_text_list = split_text(txt_text, 512)
                            txt_text = ""
                            # punctuate each text block
                            for my_split_text in my_split_text_list:
                                txt_text += add_punctuation(t5_model, t5_tokenizer, my_split_text)

                # Clean folder's files
                clean_directory("../data")

                # Display the final transcript
                if txt_text != "":
                    st.subheader("Final text is")

                    # Save txt_text and display it
                    update_session_state("txt_transcript", txt_text)
                    st.markdown(txt_text, unsafe_allow_html=True)

                    # Summarize the transcript
                    if summarize_token:
                        with st.spinner("We are summarizing your audio"):
                            # Display summary in a st.expander widget to don't write too much text on the page
                            with st.expander("Summary"):
                                # Need to split the text by 1024 text blocks size since the model has a limited input
                                if diarization_token:
                                    # in diarization mode, the text to summarize is contained in the "summary" the session state variable
                                    my_split_text_list = split_text(st.session_state["summary"], 1024)
                                else:
                                    # in non-diarization mode, it is contained in the txt_text variable
                                    my_split_text_list = split_text(txt_text, 1024)

                                summary = ""
                                # Summarize each text block
                                for my_split_text in my_split_text_list:
                                    summary += summarizer(my_split_text)[0]['summary_text']

                                # Removing multiple spaces and double spaces around punctuation mark " . "
                                summary = re.sub(' +', ' ', summary)
                                summary = re.sub(r'\s+([?.!"])', r'\1', summary)

                                # Display summary and save it
                                st.write(summary)
                                update_session_state("summary", summary)

                    # Display buttons to interact with results

                    # We have 4 possible buttons depending on the user's choices. But we can't set 4 columns for 4
                    # buttons. Indeed, if the user displays only 3 buttons, it is possible that one of the column
                    # 1, 2 or 3 is empty which would be ugly. We want the activated options to be in the first columns
                    # so that the empty columns are not noticed. To do that, let's create a btn_token_list

                    btn_token_list = [[diarization_token, "dia_token"], [True, "useless_txt_token"],
                                      [srt_token, "srt_token"], [summarize_token, "summarize_token"]]

                    # Save this list to be able to reach it on the other pages of the app
                    update_session_state("btn_token_list", btn_token_list)

                    # Create 4 columns
                    col1, col2, col3, col4 = st.columns(4)

                    # Create a column list
                    col_list = [col1, col2, col3, col4]

                    # Check value of each token, if True, we put the respective button of the token in a column
                    col_index = 0
                    for elt in btn_token_list:
                        if elt[0]:
                            mycol = col_list[col_index]
                            if elt[1] == "useless_txt_token":
                                # Download your transcript.txt
                                with mycol:
                                    st.download_button("Download as TXT", txt_text, file_name="my_transcription.txt",
                                                       on_click=update_session_state, args=("page_index", 1,))
                            elif elt[1] == "srt_token":
                                # Download your transcript.srt
                                with mycol:
                                    update_session_state("srt_token", srt_token)
                                    st.download_button("Download as SRT", srt_text, file_name="my_transcription.srt",
                                                       on_click=update_session_state, args=("page_index", 1,))
                            elif elt[1] == "dia_token":
                                with mycol:
                                    # Rename the speakers detected in your audio
                                    st.button("Rename Speakers", on_click=update_session_state, args=("page_index", 2,))

                            elif elt[1] == "summarize_token":
                                with mycol:
                                    # Download the summary of your transcript.txt
                                    st.download_button("Download Summary", st.session_state["summary"],
                                                       file_name="my_summary.txt",
                                                       on_click=update_session_state, args=("page_index", 1,))
                            col_index += 1

                else:
                    st.write("Transcription impossible, a problem occurred with your audio or your parameters, "
                             "we apologize :(")

    else:
        st.error("Seems your audio is 0 s long, please change your file")
        time.sleep(3)
        st.stop()

Finally, update the main code of the python file, which allows to navigate between the different pages of our application (token, home, results and rename pages):

from app import *

if __name__ == '__main__':
    config()

    if st.session_state['page_index'] == -1:
        # Specify token page (mandatory to use the diarization option)
        st.warning('You must specify a token to use the diarization model. Otherwise, the app will be launched without this model. You can learn how to create your token here: https://huggingface.co/pyannote/speaker-diarization')
        text_input = st.text_input("Enter your Hugging Face token:", placeholder="ACCESS_TOKEN_GOES_HERE", type="password")

        # Confirm or continue without the option
        col1, col2 = st.columns(2)

        # save changes button
        with col1:
            confirm_btn = st.button("I have changed my token", on_click=confirm_token_change, args=(text_input, 0), disabled=st.session_state["disable"])
            # if text is changed, button is clickable
            if text_input != "ACCESS_TOKEN_GOES_HERE":
                st.session_state["disable"] = False

        # Continue without a token (there will be no diarization option)
        with col2:
            dont_mind_btn = st.button("Continue without this option", on_click=update_session_state, args=("page_index", 0))

    if st.session_state['page_index'] == 0:
        # Home page
        choice = st.radio("Features", ["By a video URL", "By uploading a file"])

        stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline = load_models()

        if choice == "By a video URL":
            transcript_from_url(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline)

        elif choice == "By uploading a file":
            transcript_from_file(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline)

    elif st.session_state['page_index'] == 1:
        # Results page
        display_results()

    elif st.session_state['page_index'] == 2:
        # Rename speakers page
        rename_speakers_window()

The idea is the following:

The user arrives at the token page (whose index is -1). He is invited to enter his diarization access token into a text_input() widget. These instructions are given to him by a st.warning(). He can then choose to enter his token and click the confirm_btn, which will then be clickable. But he can also choose not to use this option by clicking on the dont_mind button. In both cases, the variabel page_index will be updated to 0, and the application will then display the home page that will allow the user to transcribe his files.

In this logic, you will understand that the session variable page_index must no longer be initialized to 0 (index of the home page), but to -1, in order to load the token page. For that, modify its initialization in the config() function:

# Modify the page_index initialization in the config() function

def config(): 

    # .... 

    if 'page_index' not in st.session_state:
        st.session_state['page_index'] = -1

Conclusion

Congratulations! Your Speech to Text Application is now full of features. Now it’s time to have fun with!

You can transcribe audio files, videos, with or without punctuation. You can also generate synchronized subtitles. You have also discovered how to differentiate speakers thanks to diarization, in order to follow a conversation more easily.

➡️ To significantly reduce the initialization time of the app and the execution time of the transcribing, we recommend that you deploy your speech to text app on powerful GPU ressources with AI Deploy. To learn how to do it, please refer to this documentation.

How to build a Speech-To-Text Application with Python (2/3)

Mathieu Busquet — Wed, 14 Dec 2022 09:26:39 +0000

A tutorial to create and build your own Speech-To-Text Application with Python.

At the end of this second article, your Speech-To-Text application will be more interactive and visually better.

Indeed, we are going to center our titles and justify our transcript. We will also add some useful buttons (to download the transcript, to play with the timestamps). Finally, we will prepare the application for the next tutorial by displaying sliders and checkboxes to interact with the next functionalities (speaker differentiation, summarization, video subtitles generation, …)

Final code of the app is available in our dedicated GitHub repository.

Overview of our final app

Overview of our final Speech-To-Text application

Objective

In the previous article, we have seen how to build a basic Speech-To-Text application with Python and Streamlit. In this tutorial, we will improve this application by changing its appearance, improving its interactivity and preparing features used in the notebooks (transcript a specific audio part, differentiate speakers, generate video subtitles, punctuate and summarize the transcript, …) that we will implement in the last tutorial!

This article is organized as follows:

Python libraries
Change appearance with CSS
Improve the app’s interactivity
Prepare new functionalities

⚠️ Since this article uses code already explained in the previous notebook tutorials, we will not re-explained its usefulness here. We therefore recommend that you read the notebooks first.

1. Python libraries

To implement our final features (speakers differentiation, summarization, …) to our speech to text app, we need to import the following libraries into our app.py file. We will use them afterwards.

# Models
from pyannote.audio import Pipeline
from transformers import pipeline, HubertForCTC, T5Tokenizer, T5ForConditionalGeneration, Wav2Vec2ForCTC, Wav2Vec2Processor, Wav2Vec2Tokenizer
import pickle

# Others
import pandas as pd
import re

2. Change appearance with CSS

Before adding or modifying anything, let’s improve the appearance of our application!

😕 Indeed, you maybe noticed our transcript is not justified, titles are not centered and there is an unnecessary space at the top of the screen.

➡️ To solve this, let’s use the st.markdown() function to write some CSS code thanks to the “style” attribute!

Just add the following lines to the config() function we have created before, for example after the st.title(“Speech to Text App 📝”) line. This will tell Streamlit how it should display the mentioned elements.

    st.markdown("""