Build a powerful Audio Virtual Assistant in less than 100 lines of code with AI Endpoints!

Raise your hands off the keyboard and chat with your LLM by voice with this Audio Virtual Assistant!

*An audio robot assistant talking to a human about the recipe for apple pie*

Nowadays, the creation of virtual assistants has become more accessible than ever, thanks to advances in AI (Artificial Intelligence), particularly in the field of SpeechAI and the GenAI models.

We will explore how OVHcloud AI Endpoints can be leveraged to design and develop an Audio Virtual Assistant capable of processing and understanding verbal questions, providing accurate answers, and returning answers verbally through speech synthesis.

In this step-by-step tutorial, we will take a look at how to send audio through the microphone to the LLM (Large Language Models) via a written transcript of the ASR (Automatic Speech Recognition). The response is then formulated orally by a TTS (Text To Speech) model.

Objectives

Whatever your level in AI, whether you’re a beginner or an expert, this tutorial will enable you to create your own powerful Audio Virtual Assistant in just a few lines of code.

How to?

By connecting your AI Endpoints like puzzles!

Retrieve the written transcript of your oral question with ASR endpoint
Get the answer to your question with an LLM endpoint
Take advantage of the TTS endpoint with the oral reply

👀 But first of all, a few definitions are needed to fully understand the technical implementation that follows.

Concept

To better understand the technologies that revolve around the Audio Virtual Assistant, let’s start by examining the models and notions of ASR, LLM, TTS…

AI Endpoints in a few words

AI Endpoints is a new serverless platform powered by OVHcloud and designed for developers.

The aim of AI Endpoints is to enable developers to enhance their applications with AI APIs, whatever their level and without the need for AI expertise.

It offers a curated catalog of world-renowned AI models and Nvidia’s optimized models, with a commitment to privacy as data is not stored or shared during or after model use.

AI Endpoints provides access to advanced AI models, including Large Language Models (LLMs), natural language processing, translation, speech recognition, image recognition, and more.

To know more about AI Endpoints, refer to this website.

AI Endpoints proposes several ASR APIs in different languages… But what means ASR?

Questioning with ASR

Automatic Speech Recognition (ASR) technology, also known as Speech-To-Text, is the process of converting spoken language into written text.

This process consists of several stages, including preparing the speech signal, extracting features, creating acoustic models, developing language models, and utilizing speech recognition engines.

With AI Endpoints, we simplify the use of ASR technology through our ready-to-use inference APIs. Learn how to use our APIs by following this link.

These APIs can be used to transcribe a recorded audio question into text, which can then be sent to a Large Language Model (LLM) for an answer.

Answering using LLM

LLMs, or Large Language Models, are known for producing text that is similar to how humans write.

They use complex algorithms to predict patterns in human language, understand context, and provide relevant responses. With LLM, virtual assistants can engage in meaningful and dynamic conversations with users.

If you want to learn more, the best way is to try it out for yourself! You can do so by following this link.

In this particular application, the LLM will be configured to answer the user question based on the results of the ASR (Automatic Speech Recognition) endpoint.

🤯 Would you like a verbal response? Don’t worry, that’s what TTS is for.

Expressing orally through TTS

TTS stands for Text-To-Speech, which is a type of technology that converts written text into spoken words.

This technology uses Artificial Intelligence algorithms to interpret and generate human-like speech from text input.

It is commonly used in various applications such as voice assistants, audiobooks, language learning platforms, and accessibility tools for individuals with visual or reading impairments.

With AI Endpoints, TTS is easy to use thanks to the turnkey inference APIs. Test it for free here.

🤖 Are you ready to start coding the Audio Virtual Assistant? Here we go: 3, 2, 1, begin!

Technical implementation of the Audio Virtual Assistant

This technical section covers the following points:

the use of the ASR endpoint inside Python code to transcribe audio request
the implementation of the TTS function to convertLLM response into spoken words
the creation of a Chatbot app with Mixtral8x22B LLM and Streamlit

➡️ Access the full code here.

*Working principle of the web app resulting from technical implementation*

To build the Audio Virtual Assistant, start by setting up the environment.

Set up the environment

In order to use AI Endpoints APIs easily, create a .env file to store environment variables.

ASR_GRPC_ENDPOINT=nvr-asr-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443
TTS_GRPC_ENDPOINT=nvr-tts-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443
LLM_AI_ENDPOINT=https://mixtral-8x22b-instruct-v01.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1 OVH_AI_ENDPOINTS_ACCESS_TOKEN=<ai-endpoints-api-token>

⚠️ Test AI Endpoints and get your free token here

In the next step, install the needed Python dependencies.

Create the requirements.txt file with the following libraries and launch the installation.

⚠️The environnement workspace is based on Python 3.11

openai==1.13.3
streamlit==1.36.0 streamlit-mic-recorder==1.16.0
nvidia-riva-client==2.15.1 python-dotenv==1.0.1

pip install -r requirements.txt

Once this is done, you can create a Python file named audio-virtual-assistant-app.py.

Then, import Python librairies as follow:

import os
import numpy as np
from openai import OpenAI
import riva.client
from dotenv import load_dotenv
import streamlit as st
from streamlit_mic_recorder import mic_recorder

💡 You are now ready to start coding your web app!

Transcribe input question with ASR

First, create the Automatic Speech Recognition (ASR) function in order to transcribe microphone input into text.

How it works?

The audio input is sent from microphone recording
A call is made to the ASR AI Endpoint named nvr-asr-en-us
The full response is stored in resp variable and returned by the function

def asr_transcription(question):

    asr_service = riva.client.ASRService(
                    riva.client.Auth(uri=os.environ.get('ASR_GRPC_ENDPOINT'), use_ssl=True, metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]])
                )
    
    # set up config
    asr_config = riva.client.RecognitionConfig(
        language_code="en-US",   # languages: en-US
        max_alternatives=1,
        enable_automatic_punctuation=True,
        audio_channel_count = 1,
    )  

    # get asr model response
    response = asr_service.offline_recognize(question, asr_config)

    return response.results[0].alternatives[0].transcript

🎉 Congratulations! Your ASR function is ready to use.

⏳ Almost there! All that remains is to implement the TTS to transform the LLM response into spoken words…

Return the response using TTS

Then, build the Text To Speech (TTS) function in order to transform the written answer into oral reply.

In practice?

The LLM response is retrieved
A call is made to the TTS AI Endpoint named nvr-tts-en-us
The audio sample and the sample rate are returned to play the audio automatically

def tts_synthesis(response):
    
    tts_service = riva.client.SpeechSynthesisService(
                    riva.client.Auth(uri=os.environ.get('TTS_GRPC_ENDPOINT'), use_ssl=True, metadata_args=[["authorization", f"bearer {ai_endpoint_token}"]])
                )
    
    # set up config
    sample_rate_hz = 48000
    req = {
            "language_code"  : "en-US",                                 # languages: en-US
            "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,
            "sample_rate_hz" : sample_rate_hz,                          # sample rate: 48KHz audio
            "voice_name"     : "English-US.Female-1"                    # voices: `English-US.Female-1`, `English-US.Male-1`
    }
    
    # return response
    req["text"] = response
    response = tts_service.synthesize(**req)

    return np.frombuffer(response.audio, dtype=np.int16), sample_rate_hz

⚡️ You’re almost there! Now all you have to do is build your Chatbot app.

Build the LLM chat app with Streamlit

In this last step, create the Chatbot app using Mixtral8x22B endpoint and Streamlit.

Tell me more about Mixtral8x22B endpoint!

The Mixtral-8x22B-Instruct is a Large Language Model (LLM) trained by Mistral AI. It is available on OVHcloud AI Endpoints catalog.

What about Streamlit?

Streamlit is an Open-Source Python library that makes it easy to create and share beautiful, custom web apps for Machine Learning and Data Science in minutes. With Streamlit, you can turn your data scripts into shareable web apps by writing a few lines of code.

What to do?

Create a first Streamlit container to put the title using st.container() and st.title()
Add a second container for bot and user messages thanks to the following components: st.container() ; st.session_state() ; st.chat_message()
Use a third container for the microphone recording, the usage of the ASR, LLM, TTS, and the automatic audio player.

# streamlit interface
with st.container():
    st.title("💬 Audio Virtual Assistant Chatbot")
    
with st.container(height=600):
    messages = st.container()
    
    if "messages" not in st.session_state:
        st.session_state["messages"] = [{"role": "system", "content": "Hello, I'm AVA!", "avatar":"🤖"}]
    
    for msg in st.session_state.messages:
        messages.chat_message(msg["role"], avatar=msg["avatar"]).write(msg["content"])

with st.container():

    placeholder = st.empty()
    _, recording = placeholder.empty(), mic_recorder(
            start_prompt="START RECORDING YOUR QUESTION ⏺️", 
            stop_prompt="STOP ⏹️", 
            format="wav",
            use_container_width=True,
            key='recorder'
        )
    
    if recording:  
        user_question = asr_transcription(recording['bytes'])
        
        if prompt := user_question:
            client = OpenAI(base_url=os.getenv("LLM_AI_ENDPOINT"), api_key=ai_endpoint_token)
            st.session_state.messages.append({"role": "user", "content": prompt, "avatar":"👤"})
            messages.chat_message("user", avatar="👤").write(prompt)
            response = client.chat.completions.create(
                model="Mixtral-8x22B-Instruct-v0.1", 
                messages=st.session_state.messages,
                temperature=0,
                max_tokens=1024,
            )
            msg = response.choices[0].message.content
            st.session_state.messages.append({"role": "system", "content": msg, "avatar": "🤖"})
            messages.chat_message("system", avatar="🤖").write(msg)

            if msg is not None:
                audio_samples, sample_rate_hz = tts_synthesis(msg)
                placeholder.audio(audio_samples, sample_rate=sample_rate_hz, autoplay=True)

Now, the Audio Virtual Assistant is ready to use!

🚀 That’s it! Now get the most out of your tool by launching it locally.

Launch Streamlit chatbot app locally

Finally, you can start this Streamlit app locally by launching the following command:

streamlit run audio-virtual-assistant.py

Benefit from the full power of your tool as follow!

☁️ It’s also possible to make your interface accessible to everyone…

Go further

If you want to go further and deploy your web app in the cloud, refer to the following articles and tutorials.

Conclusion of the Audio Virtual Assistant

Well done 🎉! You have learned how to build your own Audio Virtual Assistant in a few lines of code.

You’ve also seen how easy it is to use AI Endpoints to create innovative turnkey solutions.

➡️ Access the full code here.

🚀 What’s next? Implement RAG chatbot o specialize this Audio Virtual Assistant on your data!

References

Eléa Petton

+ posts

Machine Learning Engineer