<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Artificial Intelligence Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/artificial-intelligence/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/artificial-intelligence/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Mon, 15 Dec 2025 09:22:55 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Artificial Intelligence Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/artificial-intelligence/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Industrial Excellence meets Artificial Intelligence: Behind the Scenes with Smart Datacenter</title>
		<link>https://blog.ovhcloud.com/industrial-excellence-meets-artificial-intelligence-behind-the-scenes-with-smart-datacenter/</link>
		
		<dc:creator><![CDATA[Ali Chehade,&nbsp;Julien Jay&nbsp;and&nbsp;Christian Sharp]]></dc:creator>
		<pubDate>Fri, 12 Dec 2025 14:35:42 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[cooling]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30107</guid>

					<description><![CDATA[At OVHcloud, we are constantly looking for ways to improve our operations and reduce our impact on the environment. This has been a defining part of the company since 1999 and is a key part of our organisational DNA and our commercial model. We are very proud to present the new Smart Datacenter cooling system, [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Findustrial-excellence-meets-artificial-intelligence-behind-the-scenes-with-smart-datacenter%2F&amp;action_name=Industrial%20Excellence%20meets%20Artificial%20Intelligence%3A%20Behind%20the%20Scenes%20with%20Smart%20Datacenter&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p></p>



<p>At OVHcloud, we are constantly looking for ways to improve our operations and reduce our impact on the environment. This has been a defining part of the company since 1999 and is a key part of our organisational DNA and our commercial model.</p>



<p>We are very proud to present the new Smart Datacenter cooling system, which significantly improves energy and water efficiency while delivering a significant reduction in carbon impact across the entire cooling chain, from manufacturing and transport to daily operations.</p>



<p>The system is a new way of building and deploying datacenter infrastructure, changing how we manage and monitor water supply and demand, using a combination of industrial design, IoT sensors and AI innovation, specifically in our smart racks, advanced cooling distribution units (CDUs) and intelligent dry coolers.</p>



<p>Smart Datacenter delivers a reduction in power consumption of up to 50% across the entire cooling loop, from server water blocks to dry coolers, and consumes 30% less water compared to OVHcloud’s earliest design, driving major sustainability benefits. The system also uses complex mathematical models capturing detailed rack-level and environmental data to optimize cooling performance in real time. Furthermore, all operational data is fed into a centralized data lake, enabling cutting-edge artificial intelligence to predict, adapt, and enhance system efficiency and reliability.</p>



<h2 class="wp-block-heading">Let’s get into the detail.</h2>



<p>The system has three main components:</p>



<ol start="1" class="wp-block-list">
<li><strong>Smart Racks: </strong>These are designed with an innovative hydraulic “pull” architecture, where each rack autonomously draws exactly the water flow, pressure, and temperature it needs, dynamically adapting to server load and performance.</li>



<li><strong>Advanced Cooling Distribution Unit (CDU): </strong>This is a compact, next-generation primary loop unit that autonomously balances flow and pressure across all racks without manual intervention or any electrical communication. It uses only hydraulic signals (pressure, flow and temperature of water) to “understand” rack demands and continuously optimizes operation for lowest power consumption and extended pump lifespan.</li>



<li><strong>Intelligent Dry Cooler: </strong>This is operated seamlessly by the CDU, eliminating the need for separate control systems (“brains”) on both the dry cooler and the CDU. This unified control architecture ensures optimized, coordinated performance across the entire cooling infrastructure.</li>
</ol>



<p>OVHcloud’s new Single-Circuit System (SCS) replaces the previous Dual-Circuit System cooling architecture (DCS), which consisted of a primary facility loop and a secondary in-rack loop separated by an in-rack Coolant Distribution Unit (CDU), installed inline directly after the rear door heat exchangers (RDHX), as shown in Figure 1. The CDU housed multiple pumps, several plate heat exchangers (PHEX), and a network of valves and sensors.</p>



<figure class="wp-block-video aligncenter"><video height="1080" style="aspect-ratio: 1920 / 1080;" width="1920" controls src="https://blog.ovhcloud.com/wp-content/uploads/2025/12/OVH-cooling-loop.mp4"></video></figure>



<p>Figure 1. Dual-Circuit System cooling architecture (DCS) vs Single-Circuit system (SCS).</p>



<p>That previous design maintained turbulent flow through water blocks (WBs) using the in-rack CDU to regulate flow and temperature differences, ensuring performance despite OVHcloud’s ΔT of 20 K on the primary loop (far higher than the typical market value around 5 K).</p>



<p>Removing the in-rack CDU — replaced by a Pressure Independent Control Valve (PICV), a flow meter, and two temperature sensors on each rack — simplifies the system to a single closed-loop, where the flow rate through servers is dictated directly by the primary loop, adapting dynamically to rack load density. On the rack side, the system adapts the exact flow the rack requires by analyzing the water behavior and performing iterative, predictive thermal optimization considering IT components and the supplied water temperature and flow. This results in lower inlet water temperatures at the server level due to the elimination of the in-rack CDU’s approach temperature difference, and reduces electrical consumption, CAPEX, carbon footprint, and rack footprint.</p>



<p>To prevent laminar flow and maintain heat transfer efficiency at low flow rates, OVHcloud introduced a passive hydraulic innovation by arranging servers into clusters connected in series with servers inside each cluster connected in parallel, rather than all servers in parallel. This ensures higher water flow through individual servers even when the rack density is low. While this increases system pressure drops depending on cluster configuration, it results in better thermal performance and all servers receive water at temperatures equal to or lower than in the previous DCS design.</p>



<p>The racks operate on a novel hydraulic “pull” principle — where each rack draws exactly the hydraulic power it requires, rather than being pushed by the system. The &nbsp;CDU then dynamically adapts the overall hydraulic performance of the primary loop, balancing flow and pressure in real time to match the actual demand of the entire data center.</p>



<p>A key breakthrough is the CDU’s communication-free operation: it requires no cables, radio waves, or other electronic communication with racks. Instead, it analyzes hydraulic signals — pressure, flow, and temperature fluctuations within the water itself — to understand each rack’s cooling needs and adapt accordingly. This eliminates complex telemetry infrastructure, reduces operational risks, and enhances system reliability. To ensure water quality and system longevity, water supplied to the data center is filtered at 25 microns, and multiple sophisticated high-precision sensors continuously monitors water quality in real time.</p>



<p>The CDU is 50% smaller than the previous generation and manages the entire thermal path — from chip-level water blocks, through the racks and CDU, to the dry coolers.</p>



<p>The newly designed dry cooler is also 50% smaller than the previous model and features one of the lowest density footprints worldwide. Thanks to years of thermal studies on heat exchangers by the OVHcloud R&amp;D team, it has 50% fewer fans, resulting in very low energy consumption, while also reducing noise. Its compact size means that we can also transport more units in the same truck!  This design achieves a 30% reduction in water consumption compared to OVHcloud’s earliest dry cooler design. A key innovation in the dry cooler is its advanced adiabatic cooling pads system, which cools incoming hot air before it passes through the heat exchangers. This high-precision water injection system is the first of its kind, and adjusts water application based on multiple sensors and extensive iterative calculations, including data center load, ambient temperature, and humidity levels.</p>



<p>Unlike traditional adiabatic systems, the pads’ system does not use a conventional recirculation loop. Instead, water is injected when needed onto the pads via a simple setup consisting of a solenoid valve and a flow meter, eliminating complex hydraulics such as pumps, filters, storage tanks, level sensors, and conductivity sensors. The system maintains water quality and physical/chemical properties through careful design, drastically simplifying operation and reducing maintenance needs.</p>



<p>The CDU continuously analyzes data from up to 36 sensors distributed across the CDU itself and the associated dry cooler. It also collects operational data from solenoid valves, pumps, and dry cooler fans across the infrastructure loop. All components are monitored and managed by the system’s central intelligence—the CDU’s control panel—providing a comprehensive understanding of the entire system’s behavior, from the data center interior to the external ambient environment, ensuring real-time performance oversight and precise thermal regulation.</p>



<p>Through this iterative and precise control of water injection, the system optimizes cooling performance and Water Usage Effectiveness (WUE), ensuring minimal water consumption without sacrificing thermal effectiveness.</p>



<h2 class="wp-block-heading"><strong>Advanced System Analytics, Learning &amp; AI Integration</strong></h2>



<p>The entire system is designed to continuously analyze the thermal, hydraulic, and aerodynamic behaviors of the various fluids along the cooling path. It uses daily operational data to learn and adapt its performance dynamically, optimizing cooling efficiency and reliability over time.</p>



<p>The CDU’s brain—the control panel—aggregates data from 36 sensors distributed across the CDU and dry cooler, as well as operational data from solenoid valves, pumps, and dry cooler fans within the infrastructure loop. It also collects critical rack-level information, including flow rates, temperatures, and IPMI data that reflect IT equipment behavior and performance. All this operational data is pushed to a centralized data lake for parallel analysis, which forms the foundation for the next step: integrating cutting-edge artificial intelligence (AI). This AI will leverage the continuously gathered data and learning processes to enhance predictive capabilities, optimize future operating points, and enable fully autonomous decision-making.</p>



<p>This combination of real-time learning and AI-powered analytics will provide advanced diagnostics, predictive maintenance, and proactive management — maximizing uptime, reducing costs, and driving ever-greater sustainability.</p>



<h2 class="wp-block-heading"><strong>Iterative Control System Innovation</strong></h2>



<p>The iterative control system manages all aspects in real time, hands-free, continuously learning from sensor data and operational feedback. It applies algorithms to the pump speed on the CDU, the fans on the dry cooler and the solenoid valve controlling water injection on the adiabatic pads.</p>



<p>On the rack side, the system uses a PICV valve, flow meter, and two temperature sensors to adapt the exact hydraulic flow needed by each rack, considering IT load and incoming water conditions, iteratively optimizing thermal performance and energy efficiency.</p>



<p>On the CDU, the system analyzes combined hydraulic signals from all racks alongside ambient data center conditions, dynamically balancing flow and pressure across the entire data center infrastructure without human intervention.</p>



<p>Furthermore, OVHcloud’s cooling system integrates intelligent communication between cooling line-ups to enhance failure detection and simplify maintenance. This is achieved through embedded freeze-gaud and resilience-switch mechanisms that ensure continuous operation and system resilience. The freeze-gaud system is designed to protect the dry coolers in sub-zero ambient conditions by keeping water circulating through their heat exchangers. If the overall loop flow drops below a predefined threshold, the system automatically opens a normally closed bypass valve to maintain circulation—preventing freezing despite the use of pure water (without glycol) as the cooling medium. The resilience-switch system maintains redundancy by hydraulically linking multiple cooling lines. In the event of failure or overload on one line, normally open solenoid valves isolate the affected line, while bypass valves on neighboring lines open to redistribute water flow and maintain cooling performance. This dynamic and autonomous valve management ensures uninterrupted service and rapid fault response.</p>



<p>Drawing inspiration from autonomous control methodologies in leading-edge industries, the system predicts future behavior based on iterative calculations, dynamically adapting pump speed, fans speed and solenoid valves openings to converge rapidly on optimal operating points. It also adjusts performance based on external constraints such as noise limits, water availability, or energy costs — for example, consuming more energy to save water in water-stressed regions or balancing noise restrictions in urban deployments.</p>



<p>This unique, self-optimizing end-to-end control system maximizes energy efficiency, sustainability, and operational simplicity, extending pump life cycles and ensuring the most environmentally responsible data center cooling solution available today.</p>



<p>This vertically integrated, autonomous system — including smart racks, the advanced CDU, and the intelligent dry cooler — represents a world-first in end-to-end, intelligent, sustainable, communication-free, and data-driven data center cooling.</p>



<h2 class="wp-block-heading"><strong>Why is this important?</strong></h2>



<p>This innovation is critical because it marks a decisive step toward radically more sustainable, efficient, and autonomous data center cooling — addressing the growing demands of digital infrastructure while reducing its environmental footprint.</p>



<p>By using fewer, smaller components, we are saving power, cutting transport costs and reducing carbon impact. Using fewer fans on the dry cooler means up to 50% lower energy consumption on the cooling cycle – and the new pad system means 30% lower water consumption in the cooling system. The system is fully autonomous, avoiding human error. A temperature gradient of 20K on the primary loop – four times higher than the industry average – means that flow rates can be lower and water efficiency is higher. The system doesn’t rely on Wi-Fi or cabling, and the predictive control constantly adapts to external conditions or situational goals, feeding into a data lake to help continuously optimize performance.</p>



<p>Today’s world is built on technology, and datacenters are a key part of that technology, but there is a pressing need to ensure we can maintain human progress without incurring a significant carbon footprint. Power and water efficiency is a key part of this equation in the datacenter industry, and our innovation in the Smart Datacenter continues our trajectory of supporting today’s needs without compromising the world of tomorrow.</p>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="575" src="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-6-1024x575.png" alt="" class="wp-image-30116" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-6-1024x575.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-6-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-6-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-6.png 1502w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p></p>
<img decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Findustrial-excellence-meets-artificial-intelligence-behind-the-scenes-with-smart-datacenter%2F&amp;action_name=Industrial%20Excellence%20meets%20Artificial%20Intelligence%3A%20Behind%20the%20Scenes%20with%20Smart%20Datacenter&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2025/12/OVH-cooling-loop.mp4" length="4050958" type="video/mp4" />

			</item>
		<item>
		<title>PostgreSQL and AI: The pragmatic path to smarter data</title>
		<link>https://blog.ovhcloud.com/postgresql-and-ai-the-pragmatic-path-to-smarter-data/</link>
		
		<dc:creator><![CDATA[Jonathan Clarke]]></dc:creator>
		<pubDate>Thu, 11 Dec 2025 15:11:00 +0000</pubDate>
				<category><![CDATA[Accelerating with OVHcloud]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[aiven]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Managed Database]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=30100</guid>

					<description><![CDATA[Beyond the buzz: Building AI on solid foundations Artificial intelligence has quickly become the cornerstone of digital innovation. From text generation to image recognition and intelligent automation, AI is redefining how organisations extract value from data. At OVHcloud, we believe this transformation shouldn’t only belong to the tech elite &#8211; it should be open, accessible, [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpostgresql-and-ai-the-pragmatic-path-to-smarter-data%2F&amp;action_name=PostgreSQL%20and%20AI%3A%20The%20pragmatic%20path%20to%20smarter%20data&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading"><strong>Beyond the buzz: Building AI on solid foundations</strong></h2>



<p><strong>Artificial intelligence</strong> has quickly become the cornerstone of digital innovation. From text generation to image recognition and intelligent automation, AI is redefining how organisations extract value from data.</p>



<p>At OVHcloud, we believe this transformation shouldn’t only belong to the tech elite &#8211; it should be open, accessible, and built on trusted, sovereign infrastructure.</p>



<p>This vision drives everything from our <a href="https://www.ovhcloud.com/en-ie/public-cloud/ai-endpoints/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>AI Endpoints</strong></a> and <a href="https://www.ovhcloud.com/en-ie/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>AI Deploy</strong></a> solutions to our <a href="https://huggingface.co/blog/OVHcloud/inference-providers-ovhcloud" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Hugging Face partnership</strong></a>, which empowers developers to run open <strong>inference</strong> models directly in the cloud. But beyond those flagship initiatives, AI also lives in the everyday – in the data that powers recommendations, insights and smarter user experiences.</p>



<p>And that’s where <strong>PostgreSQL + Vector capabilities</strong> come in.</p>



<figure class="wp-block-image size-full"><img decoding="async" width="841" height="561" src="https://blog.ovhcloud.com/wp-content/uploads/2025/02/Image1.png" alt="" class="wp-image-28243" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/02/Image1.png 841w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/Image1-300x200.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/02/Image1-768x512.png 768w" sizes="(max-width: 841px) 100vw, 841px" /></figure>



<h2 class="wp-block-heading"><strong>Vectors: Where data meets understanding</strong></h2>



<p>At its core, AI systems function by decoding relationships between words, images or user behaviours. To do that, machine learning models translate these entities <strong>into vectors </strong>— mathematical representations that capture meaning and similarity.</p>



<p>A vector representation allows a system to measure how close two pieces of data are. It is the foundation of <strong>semantic search</strong>, <strong>recommendation engines</strong>, <strong>facial recognition</strong> and <strong>anomaly detection systems</strong>.</p>



<p>Traditionally, companies needed to move their datasets from transactional databases into specialised “vector databases.” While vector databases are effective for purely vector-centric workloads, this approach often comes with <strong>higher complexity</strong>, data duplication, and <strong>integration</strong> <strong>overhead</strong>. These challenges are not ideal for production-grade systems that demand reliability and compliance.</p>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="373" height="355" src="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-2.png" alt="" class="wp-image-30101" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-2.png 373w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-2-300x286.png 300w" sizes="auto, (max-width: 373px) 100vw, 373px" /></figure>



<div style="height:20px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading"><strong>PostgreSQL + pgvector: AI where your data already lives</strong></h2>



<p>Instead of creating yet another database to maintain, <a href="https://www.ovhcloud.com/en-ie/public-cloud/postgresql/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>PostgreSQL</strong></a> offers an elegant solution: the <strong><em>pgvector </em>extension</strong>. With pgvector, organisations can store, query and compare vectorised data alongside traditional relational data, using the same SQL syntax they already know. pgvector also allows you to build full or partial indexes to speed up similarity search.</p>



<p>In other words, PostgreSQL becomes not just your source of truth, but also your foundation for AI experimentation and delivery.</p>



<p>Here’s what this means in practice:</p>



<ul class="wp-block-list">
<li><strong>Simplified architecture</strong>: Keep data in one place. No ETL pipelines or synchronisation risks.</li>



<li><strong>Familiar SQL workflow</strong>: Run similarity searches directly in SQL, with ACID guarantees intact.</li>



<li><strong>Faster time to value</strong>: Build and iterate AI use cases faster, without learning a new database technology.</li>
</ul>



<p>This is AI grounded in operational reality — a pragmatic path for enterprises to explore machine learning use cases <strong>safely</strong> and <strong>efficiently</strong>.</p>



<h2 class="wp-block-heading"><strong>A practical use case: Real-time product recommendations</strong></h2>



<p>Imagine an e-commerce company managing both product and customer data in <a href="https://www.ovhcloud.com/en-ie/public-cloud/postgresql/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Managed PostgreSQL</strong></a> at OVHcloud, ensuring access to the latest, most performant features.</p>



<p>By combining <em>pgvector</em> with embeddings generated from an open-source model, the team can:</p>



<ol class="wp-block-list">
<li>Convert product descriptions and user preferences into vector representations.</li>



<li>Store these vectors in PostgreSQL columns alongside stock levels, pricing and metadata.</li>



<li>Run a <strong>similarity search</strong> that finds relevant products instantly: for example, recommending ‘eco-friendly alternatives’ or ‘similar styles’ while ensuring only in-stock items are shown.</li>
</ol>



<p>The entire process happens within PostgreSQL — <strong>no need for external vector databases or data duplication</strong>.</p>



<p><strong>The result</strong>: real-time, AI-enhanced customer experiences powered by trusted, open technology.</p>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="364" height="262" src="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-3.png" alt="" class="wp-image-30102" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-3.png 364w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-3-300x216.png 300w" sizes="auto, (max-width: 364px) 100vw, 364px" /></figure>



<div style="height:20px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading"><strong>The enterprise reality: AI without reinventing the wheel</strong></h2>



<p>In the rush to ‘go AI’, many <strong>organisations risk overcomplicating</strong> their architectures by chasing the latest dedicated vector databases. While those solutions have their place, <strong>PostgreSQL</strong>’s maturity, ecosystem and extensibility make it uniquely suited for the vast majority of enterprise AI workloads.</p>



<p>For most companies exploring AI, starting with what they already know, PostgreSQL, means <strong>solid foundations</strong>,<strong> less risk</strong>,<strong> faster learning </strong>and<strong> lower cost</strong>.</p>



<h2 class="wp-block-heading"><strong>The OVHcloud advantage: Open, managed, secure</strong></h2>



<p>OVHcloud’s partnership with <a href="https://aiven.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Aiven</strong></a>, which brings proven expertise in managing PostgreSQL at scale, ensures our customers benefit from the latest capabilities as soon as they are production-ready, without operational difficulties. Let your teams <strong>focus on their product</strong> rather than worry about database resources and infrastructure.</p>



<p>Additionally, OVHcloud customers can benefit from a service-level agreement (SLA) of up to 99.99% via its <a href="https://help.ovhcloud.com/csm/en-public-cloud-databases-migrate-1az-to-3az?id=kb_article_view&amp;sysparm_article=KB0072137" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Multi-Availability Zone</a> (3-AZ) regions. These regions feature geographically separated zones with independent power, cooling and network systems, providing true fault isolation.</p>



<figure class="wp-block-image aligncenter size-full"><img loading="lazy" decoding="async" width="345" height="345" src="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-4.png" alt="" class="wp-image-30103" srcset="https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-4.png 345w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-4-300x300.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-4-150x150.png 150w, https://blog.ovhcloud.com/wp-content/uploads/2025/12/image-4-70x70.png 70w" sizes="auto, (max-width: 345px) 100vw, 345px" /></figure>



<div style="height:20px" aria-hidden="true" class="wp-block-spacer"></div>



<p>At <strong>OVHcloud</strong>, we see PostgreSQL as more than a database. It’s a bridge between today’s workloads and tomorrow’s intelligent systems. And as AI workloads evolve, we’ll continue to integrate the technologies that matter: from <strong>vector search</strong> and <strong>AI embeddings</strong> to seamless connections with <strong>AI Endpoints</strong> and <strong>Hugging Face models</strong>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpostgresql-and-ai-the-pragmatic-path-to-smarter-data%2F&amp;action_name=PostgreSQL%20and%20AI%3A%20The%20pragmatic%20path%20to%20smarter%20data&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to serve LLMs with vLLM and OVHcloud AI Deploy</title>
		<link>https://blog.ovhcloud.com/how-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Wed, 29 May 2024 12:22:26 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Endpoints]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMA]]></category>
		<category><![CDATA[LLaMA 3]]></category>
		<category><![CDATA[LLM Serving]]></category>
		<category><![CDATA[Mistral]]></category>
		<category><![CDATA[Mixtral]]></category>
		<category><![CDATA[vLLM]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=26762</guid>

					<description><![CDATA[In this tutorial, we will learn how to serve Large Language Models (LLMs) using vLLM and the OVHcloud AI Products.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>In this tutorial, we will walk you through the process of serving large language models (LLMs), providing step-by-step instruction</em>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="345" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png" alt="" class="wp-image-25615" style="width:750px;height:auto" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1024x345.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-300x101.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-768x259.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-1536x518.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/LLaMA2_finetuning_OVHcloud_resized-2048x690.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p></p>



<h3 class="wp-block-heading">Introduction</h3>



<p>In recent years, <strong>large language models</strong> (LLMs) have become increasingly <strong>popular</strong>, with <strong>open-source</strong> models like <em>Mistral</em> and <em>LLaMA</em> gaining widespread attention. In particular, the <em>LLaMA 3</em> model was released on <em>April 18, 2024</em>, is one of today&#8217;s most powerful open-source LLMs.</p>



<p>However, <strong>serving these LLMs can be challenging</strong>, particularly on hardware with limited resources. Indeed, even on expensive hardware, LLMs can be surprisingly slow, with high VRAM utilization and throughput limitations.</p>



<p>This is where<strong><em> </em></strong><em><a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>vLLM</strong></a></em> comes in. <em><strong>vLLM</strong></em> is an <strong>open-source project</strong> that enables <strong>fast and easy-to-use LLM inference and serving</strong>. Designed for optimal performance and resource utilization, <em>vLLM</em> supports a range of <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLM architectures</a> and offers <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">flexible customization options</a>. That&#8217;s why we are going to use it to efficiently deploy and scale our LLMs.</p>



<h3 class="wp-block-heading">Objective</h3>



<p>In this guide, you will discover how to deploy a LLM thanks to <a href="https://github.com/vllm-project/vllm" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em></a> and the <strong><em>AI Deploy</em></strong> <em>OVHcloud</em> solution. This will enable you to benefit from <em>vLLM</em>&#8216;s optimisations and <em>OVHcloud</em>&#8216;s GPU computing resources. Your LLM will then be exposed by a secured API.</p>



<p>🎁 And for those who do not want to bother with the deployment process, <strong>a surprise awaits you at the <a href="#AI-ENDPOINTS">end of the article</a></strong>. We are going to introduce you to our new solution for using LLMs, called <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>AI Endpoints</strong></a>. This product makes it easy to integrate AI capabilities into your applications with a simple API call, without the need for deep AI expertise or infrastructure management. And while it&#8217;s in alpha, it&#8217;s <strong>free</strong>!</p>



<h3 class="wp-block-heading">Requirements</h3>



<p>To deploy your <em>vLLM</em> server, you need:</p>



<ul class="wp-block-list">
<li>An <em>OVHcloud</em> account to access the <a href="https://www.ovh.com/auth/?action=gotomanager&amp;from=https://www.ovh.co.uk/&amp;ovhSubsidiary=GB" data-wpel-link="exclude"><em>OVHcloud Control Panel</em></a></li>



<li>A <em>Public Cloud</em> project</li>



<li>A <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">user for the AI Products</a>, related to this <em>Public Cloud</em> project</li>



<li><a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">The <em>OVHcloud AI CLI</em></a> installed on your local computer (to interact with the AI products by running commands). </li>



<li><a href="https://www.docker.com/get-started" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker</a> installed on your local computer, <strong>or</strong> access to a Debian Docker Instance, which is available on the <a href="https://www.ovh.com/manager/public-cloud/" data-wpel-link="exclude"><em>Public Cloud</em></a></li>
</ul>



<p>Once these conditions have been met, you are ready to serve your LLMs.</p>



<h3 class="wp-block-heading">Building a Docker image</h3>



<p>Since the <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>OVHcloud AI Deploy</em></a> solution is based on <a href="https://www.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Docker</em></a> images, we will be using a <em>Docker</em> image to deploy our <em>vLLM</em> inference server. </p>



<p>As a reminder, <em>Docker</em> is a platform that allows you to create, deploy, and run applications in containers. <em>Docker</em> containers are standalone and executable packages that include everything needed to run an application (code, libraries, system tools).</p>



<p>To create this <em>Docker</em> image, we will need to write the following <em><strong>Dockerfile</strong></em> into a new folder:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">mkdir my_vllm_image
nano Dockerfile</code></pre>



<pre class="wp-block-code"><code lang="bash" class="language-bash"># 🐳 Base image
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

# 👱 Set the working directory inside the container
WORKDIR /workspace

# 📚 Install missing system packages (git) so we can clone the vLLM project repository
RUN apt-get update &amp;&amp; apt-get install -y git
RUN git clone https://github.com/vllm-project/vllm/

# 📚 Install the Python dependencies
RUN pip3 install --upgrade pip
RUN pip3 install vllm 

# 🔑 Give correct access rights to the OVHcloud user
ENV HOME=/workspace
RUN chown -R 42420:42420 /workspace</code></pre>



<p>Let&#8217;s take a closer look at this <em>Dockerfile</em> to understand it:</p>



<ul class="wp-block-list">
<li><strong>FROM</strong>: Specify the base image for our <em>Docker</em> Image. We choose the <em>PyTorch</em> image since it comes with <em>CUDA</em>, <em>CuDNN</em> and <em>torch</em>, which is needed by <em>vLLM</em>. </li>



<li><strong>WORKDIR /workspace</strong>: We set the working directory for the <em>Docker</em> container to <em>/workspace</em>, which is the default folder when we use <em>AI Deploy</em>.</li>



<li><strong>RUN</strong>: It allows us to upgrade <em>pip</em> to the latest version to make sure we have access to the latest libraries and dependencies. We will install <em>vLLM</em> library, and <em>git</em>, which will enable to clone the <a href="https://github.com/vllm-project/vllm/tree/main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>vLLM</em> repository</a> into th<em>e /workspace</em> directory.</li>



<li><strong>ENV</strong> HOME=/workspace: This sets the <em>HOME</em> environment variable to <em>/workspace</em>. This is a requirement to use the <em>OVHcloud</em> AI Products.</li>



<li><strong>RUN chown -R 42420:42420 /workspace</strong>: This changes the owner of the <em>/workspace</em> directory to the user and group with IDs of <em>42420</em> (<em>OVHcloud</em> user). This is also a requirement to use the <em>OVHcloud</em> AI Products.</li>
</ul>



<p>This <em>Dockerfile</em> does not contain a <strong>CMD</strong> instruction and therefore does not launch our <em>VLLM</em> server. Do not worry about that, we will do it directly from <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Deploy</a>&nbsp;to have more flexibility.</p>



<p>Once your Dockerfile is written, launch the following command to build your image:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker build . -t vllm_image:latest</code></pre>



<h3 class="wp-block-heading">Push the image into the shared registry</h3>



<p>Once you have built the Docker image, you will need to push it to a <strong>registry</strong> to make it accessible from <em>AI Deploy</em>. A <strong>registry</strong> is a service that allows you to store and distribute <em>Docker</em> images, making it easy to deploy them in different environments.</p>



<p>Several registries can be used (<em><a href="https://www.ovhcloud.com/en-gb/public-cloud/managed-private-registry/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Managed Private Registry</a>, <a href="https://hub.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker Hub</a>, <a href="https://github.com/features/packages" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub packages</a>, &#8230;</em>). In this tutorial, we will use the <strong><em>OVHcloud</em> <em>shared registry</em></strong>. More information are available in the <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-manage-registries?id=kb_article_view&amp;sysparm_article=KB0057949" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Registries documentation</a>.</p>



<p>To find the address of your shared registry, use the following command (<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-cli-install-client?id=kb_article_view&amp;sysparm_article=KB0047844" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>ovhai CLI</em></a> needs to be installed on your computer):</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">ovhai registry list</code></pre>



<p>Then, log in on your <em>shared registry</em> with your usual <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-users?id=kb_article_view&amp;sysparm_article=KB0048170" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Platform user</em></a> credentials:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker login -u &lt;user&gt; -p &lt;password&gt; &lt;shared-registry-address&gt;</code></pre>



<p>Once you are logged in to the registry, tag the compiled image and push it into your shared registry:</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">docker tag vllm_image:latest &lt;shared-registry-address&gt;/vllm_image:latest
docker push &lt;shared-registry-address&gt;/vllm_image:latest</code></pre>



<h3 class="wp-block-heading">vLLM inference server deployment</h3>



<p>Once your image has been pushed, it can be used with <em>AI Deploy</em>, using either the <em>ovhai CLI</em> or the <em>OVHcloud Control Panel (UI)</em>.</p>



<h5 class="wp-block-heading">Creating an access token </h5>



<p>Tokens are used as unique authenticators to securely access the <em>AI Deploy</em> apps. By creating a token, you can ensure that only authorized requests are allowed to interact with the <em>vLLM</em> endpoint. You can create this token by using the <em>OVHcloud Control Panel (UI)</em> or by running the following command:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai token create vllm --role operator --label-selector name=vllm</code></pre>



<p>This will give you a token that you will need to keep.</p>



<h5 class="wp-block-heading">Creating a Hugging Face token (optionnal)</h5>



<p>Note that some models, such as <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 3</a> require you to accept their license, hence, you need to create a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HuggingFace account</a>, accept the model’s license, and generate a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">token</a> by accessing your account settings, that will allow you to access the model.</p>



<p>For example, when visiting the HugginFace <a href="https://huggingface.co/google/gemma-2b" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gemma model page</a>, you’ll see this (if you are logged in):</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="716" height="312" src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png" alt="accept_model_conditions_hugging_face" class="wp-image-26768" srcset="https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21.png 716w, https://blog.ovhcloud.com/wp-content/uploads/2024/05/Screenshot-2024-05-22-at-14.15.21-300x131.png 300w" sizes="auto, (max-width: 716px) 100vw, 716px" /></figure>



<p>If you want to use this model, you will have to Acknowledge the license, and then make sure to create a token in the <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">tokens section</a>.</p>



<p>In the next step, we will set this token as an environment variable (named  <code>HF_TOKEN</code>). Doing this will enable us to use any LLM whose conditions of use we have accepted.</p>



<h5 class="wp-block-heading">Run the AI Deploy application</h5>



<p>Run the following command to deploy your <em>vLLM</em> server by running your customized <em>Docker</em> image:</p>



<pre class="wp-block-code"><code lang="" class="">ovhai app run &lt;shared-registry-address&gt;/vllm_image:latest \
  --name vllm_app \
  --flavor h100-1-gpu \
  --gpu 1 \
  --env HF_TOKEN="&lt;YOUR_HUGGING_FACE_TOKEN&gt;" \
  --label name=vllm \
  --default-http-port 8080 \
  -- python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt; --dtype half</code></pre>



<p><em>You just need to change the address of your registry to the one you used, and the name of the LLM you want to use. Also pay attention to the name of the image, its tag, and the label selector of your label if you haven&#8217;t used the same ones as those given in this tutorial.</em></p>



<p><strong>Parameters explanation</strong></p>



<ul class="wp-block-list">
<li><code>&lt;shared-registry-address&gt;/vllm_image:latest</code> is the image on which the app is based.</li>



<li><code>--name vllm_app</code> is an optional argument that allows you to give your app a custom name, making it easier to manage all your apps.</li>



<li><code>--flavor h100-1-gpu</code> indicates that we want to run our app on H100 GPU(s). You can access the full list of GPUs available by <code>running ovhai capabilities flavor list</code></li>



<li><code>--gpu 1</code> indicates that we request 1 GPU for that app.</li>



<li><code>--env HF_TOKEN</code> is an optional argument that allows us to set our Hugging Face token as an environment variable. This gives us access to models for which we have accepted the conditions.</li>



<li><code>--label name=vllm</code> allows to privatize our LLM by adding the token corresponding to the label selector <code>name=vllm</code>.</li>



<li><code>--default-http-port 8080</code> indicates that the port to reach on the app URL is the <code>8080</code>.</li>



<li><code>--python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8080 --model &lt;model&gt;</code> allows to start the vLLM API server. The specified &lt;model&gt; will be downloaded from Hugging Face. Here is a list of those that are <a href="https://docs.vllm.ai/en/latest/models/supported_models.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">supported by vLLM</a>. <a href="https://docs.vllm.ai/en/latest/models/engine_args.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Many arguments</a> can be used to optimize your inference.</li>
</ul>



<p>When this <code>ovhai app run</code> command is executed, several pieces of information will appear in your terminal. Get the ID of your application, and open the Info URL in a new tab. Wait a few minutes for your application to launch. When it is <strong>RUNNING</strong>, you can stream its logs by executing:</p>



<pre class="wp-block-code"><code class="">ovhai app logs -f &lt;APP_ID&gt;</code></pre>



<p>This will allow you to track the server launch, the model download and any errors you may encounter if you have used a model for which you have not accepted the user contract. </p>



<p>If all goes well, you should see the following output, which means that your server is up and running:</p>



<pre class="wp-block-code"><code class="">Started server process [11]
Waiting for application startup.
Application startup complete.
Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)</code></pre>



<h3 class="wp-block-heading">Interacting with your LLM</h3>



<p>Once the server is up and running, we can interact with our LLM by hitting the <code>/generate</code> endpoint.</p>



<p><strong>Using cURL</strong></p>



<p><em>Make sure you change the ID to that of your application so that you target the right endpoint. In order for the request to be accepted, also specify the token that you generated previously by executing</em> <code>ovhai token create</code>. Feel free to adapt the parameters of the request (<em>prompt</em>, <em>max_tokens</em>, <em>temperature</em>, &#8230;)</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">curl --request POST \                                             
  --url https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net/generate \
  --header 'Authorization: Bearer &lt;AI_TOKEN_generated_with_CLI&gt;' \
  --header 'Content-Type: application/json' \
  --data '{
        "prompt": "&lt;YOUR_PROMPT&gt;",
        "max_tokens": 50,
        "n": 1,
        "stream": false
}'</code></pre>



<p><strong>Using Python</strong></p>



<p><em>Here too, you need to add your personal token and the correct link for your application.</em></p>



<pre class="wp-block-code"><code lang="python" class="language-python">import requests
import json

# change for your host
APP_URL = "https://&lt;APP_ID&gt;.app.gra.ai.cloud.ovh.net"
TOKEN = "AI_TOKEN_generated_with_CLI"

url = f"{APP_URL}/generate"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}
data = {
    "prompt": "What a LLM is in AI?",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["text"][0])</code></pre>



<h3 class="wp-block-heading" id="AI-ENDPOINTS">OVHcloud AI Endpoints</h3>



<p>If you are not interested in building your own image and deploying your own LLM inference server, you can use OVHcloud&#8217;s new <em><strong><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AI Endpoints</a></strong> </em>product which will make your life definitely easier!</p>



<p><a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is a serverless solution that provides AI APIs, enabling you to easily use pre-trained and optimized AI models in your applications. </p>



<figure class="wp-block-video"><video height="1400" style="aspect-ratio: 2560 / 1400;" width="2560" controls src="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4"></video></figure>



<p class="has-text-align-center"><em>Overview of AI Endpoints</em></p>



<p>You can use LLM as a Service, choosing the desired model (such as <em>LLaMA</em>, <em>Mistral</em>, or <em>Mixtral</em>) and making an API call to use it in your application. This will allow you to interact with these models without even having to deploy them!</p>



<p>In addition to LLM capabilities, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> also offers a range of other AI models, including speech-to-text, translation, summarization, embeddings and computer vision. </p>



<p>Best of all, <a href="https://endpoints.ai.cloud.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>AI Endpoints</em></a> is currently in alpha phase and is <strong>free to use</strong>, making it an accessible and affordable solution for developers seeking to explore the possibilities of AI. Check <a href="https://blog.ovhcloud.com/enhance-your-applications-with-ai-endpoints/" data-wpel-link="internal">this article</a> and try it out today to discover the power of AI!</p>



<p>Join our <a href="https://discord.gg/ovhcloud" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Discord server</a> to interact with the community and send us your feedbacks (#<em>ai-endpoints</em> channel)!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-serve-llms-with-vllm-and-ovhcloud-ai-deploy%2F&amp;action_name=How%20to%20serve%20LLMs%20with%20vLLM%20and%20OVHcloud%20AI%20Deploy&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2024/05/demo-ai-endpoints.mp4" length="14424826" type="video/mp4" />

			</item>
		<item>
		<title>Understanding Image Generation: A Beginner&#8217;s Guide to Generative Adversarial Networks</title>
		<link>https://blog.ovhcloud.com/understanding-image-generation-beginner-guide-generative-adversarial-networks-gan/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Tue, 05 Sep 2023 09:21:57 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Docker]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[Streamlit]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=25664</guid>

					<description><![CDATA[How to train a generative adversarial network (GAN) to generate images ?
How to train a DCGAN ? 
How GAN and DCGAN work ?<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Funderstanding-image-generation-beginner-guide-generative-adversarial-networks-gan%2F&amp;action_name=Understanding%20Image%20Generation%3A%20A%20Beginner%26%238217%3Bs%20Guide%20to%20Generative%20Adversarial%20Networks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>All the code related to this article is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/computer-vision/image-generation/miniconda/dcgan-image-generation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a> You can reproduce all the experiments with</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a>.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/dcgan_evolution.gif" alt="" class="wp-image-25680" style="width:549px;height:549px" width="549" height="549"/><figcaption class="wp-element-caption"><em>Fake samples generated by the model during training</em></figcaption></figure>



<p>Have you ever been amazed by what generative artificial intelligence could do, and wondered how it can generate realistic images 🤯🎨?</p>



<p>In this tutorial, we will embark on an exciting journey into the world of <strong>Generating Adversarial Networks (GANs)</strong>, a revolutionary concept in generative AI. No prior experience is necessary to follow along. We will walk you through every step, starting with the basic concepts and gradually building up to the implementation of <strong>Deep Convolutional GANs (DCGANs)</strong>.</p>



<p><em><strong>By the end of this tutorial, you will be able to generate your own images!</strong></em></p>



<h3 class="wp-block-heading">Introduction</h3>



<p>GANs have been introduced by Ian Goodfellow et al. in 2014 in the paper <a href="https://arxiv.org/abs/1406.2661" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Generative Adversarial Nets</em></a>. GANs have become very popular last years, allowing us, for example, to:</p>



<ul class="wp-block-list">
<li>Generate high-resolution images (avatars, objects and scenes)</li>



<li>Augment our data (generating synthetic (fake) data samples for limited datasets)</li>



<li>Enhance the resolution of low-resolution images (upscaling images)</li>



<li>Transfer image style of one image to another (Black and white to color)</li>



<li>Predict facial appearances at different ages (Face Aging)</li>
</ul>



<h4 class="wp-block-heading">What is a GAN and how it works?</h4>



<p>A GAN is composed of two main components: a <strong>generator <em>G</em></strong> and a <strong>discriminator <em>D</em></strong>.</p>



<p>Each component is a neural network, but their roles are different:</p>



<ul class="wp-block-list">
<li>The purpose of the generator <em>G</em> is to <strong>reproduce the data distribution of the training data <em>𝑥</em></strong>, to <strong>generate synthetic samples</strong> for the same data distribution. These data are often images, but can also be audio or text.</li>
</ul>



<ul class="wp-block-list">
<li>On the other hand, the discriminator <em>D</em> is a kind of judge who will <strong>estimate whether a sample <strong><em>𝑥</em></strong> is real or fake</strong> (has been generated). It is in fact a <strong>classifier</strong> that will say if a sample comes from the real data distribution or the generator.</li>
</ul>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/GAN_Illustration-1024x311.png" alt="" class="wp-image-25667" style="width:1201px;height:365px" width="1201" height="365" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/GAN_Illustration-1024x311.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/GAN_Illustration-300x91.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/GAN_Illustration-768x233.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/GAN_Illustration-1536x467.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/GAN_Illustration.png 1882w" sizes="auto, (max-width: 1201px) 100vw, 1201px" /></figure>



<p class="has-text-align-center"><em>Illustration of GAN training</em></p>



<p>During training, the generator starts with a <strong>vector of random noise</strong> (z) as input and produces synthetic samples G(z).</p>



<p>As training progresses, it refines its output, making the generated data G(z) more and more similar to the real data. The goal of the generator is to <strong>outsmart</strong> the discriminator into classifying its generated samples as real.</p>



<p>Meanwhile, the discriminator is presented with both real samples from the training data and fake samples from the generator. As it learns to discriminate between the two, it <strong>provides feedback</strong> to the generator about the quality of its generated samples. This is why the term <em>&#8220;<strong>adversarial</strong>&#8220;</em> is used here.</p>



<h4 class="wp-block-heading">Mathematical approach</h4>



<p>In fact, GANs come from game theory, where <em>D</em> and <em>G</em> are playing a two-player <em>minimax</em> game with the following value function:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/eq-1024x59.png" alt="value-objective-function" class="wp-image-25668" style="width:802px;height:46px" width="802" height="46" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/eq-1024x59.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/eq-300x17.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/eq-768x45.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/eq.png 1447w" sizes="auto, (max-width: 802px) 100vw, 802px" /></figure>



<p></p>



<p>As we can observe, the <strong>discriminator aims to maximize the V function</strong>. To do this, it must maximize each of the two parts of the equation that will be added together. This means maximizing <em>log(D(x))</em>, so <em>D(x)</em> in the first equation part (probability of real data), and minimizing the <em>D(G(z))</em> in the second part (probability of fake data).</p>



<p>Simultaneously, the <strong>generator tries to minimize the function</strong>. It only comes into play in the second part of the function, where it tries to obtain the highest value of <em>D(G(z))</em> in order to fool the discriminator.</p>



<p>This constant confrontation between the generator and the discriminator creates an <strong>iterative learning process</strong>, where the generator gradually improves to produce increasingly realistic G(z) samples, and the discriminator becomes increasingly accurate in its distinction of the data presented to it.</p>



<p>In an <strong>ideal scenario</strong>, this iterative process would reach an <strong>equilibrium point</strong>, where the generator produces data that is indistinguishable from real data, and the discriminator&#8217;s performance is 50% (random guessing).</p>



<p>GANs may not always reach this equilibrium due to the training process being sensitive to factors (architecture, hyperparameters, dataset complexity). The generator and discriminator may reach a dead end, oscillating between solutions or facing <strong>mode collapse</strong>, resulting in limited sample diversity. Also, it is important that discriminator does not start off too strong, otherwise the generator will not get any information on how to improve itself, since it does not know what the real data looks like, as shown in the illustration above.</p>



<h4 class="wp-block-heading">DCGAN (Deep Convolutional GANs)</h4>



<p><strong>DCGAN</strong><em> </em>has been introduced in 2016 by Alec Radford et al. in the paper<em> <a href="https://arxiv.org/abs/1511.06434" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks</a>.</em></p>



<p>Its new convolutional architecture has considerably improved the quality and stability of image synthesis compared to classical GANs. Here are the major changes:</p>



<ul class="wp-block-list">
<li>Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator), making them exceptionally well-suited for image generation tasks.</li>



<li>Use batchnorm in both the generator and the discriminator.</li>



<li>Removing fully connected hidden layers for deeper architectures.</li>



<li>Use ReLU activation in generator for all layers except for the output, which uses tanh.</li>



<li>Use LeakyReLU activation in the discriminator for all layer.</li>
</ul>



<p><em>The operation principles</em> <em>of these layers will not be explained in this tutorial.</em></p>



<h3 class="wp-block-heading">Use case &amp; Objective</h3>



<p>Now that we know the concept of image generation, let’s try to put it into practice!</p>



<p>In this tutorial, we will <strong>implement a</strong> <strong>DCGAN</strong> architecture and <strong>train it on a medical dataset</strong> to generate new images. This dataset is the <a href="https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Chest X-Ray Pneumonia</em></a>. All the code explained here will run on <strong>a single GPU</strong>, linked to <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a>, and is given in our <em><a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/computer-vision/image-generation/miniconda/dcgan-image-generation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a></em></p>



<h3 class="wp-block-heading">1 &#8211; Explore dataset and prepare it for training</h3>



<p><em>The </em><a href="https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><em>Chest X-Ray Pneumonia</em> dataset</a> contains <strong>5,863 X-Ray images</strong>. This may not be sufficient for training a robust DCGAN, but we are going to try! Indeed, the DCGAN research paper is conducting its study on a dataset of over 60,000 images. </p>



<p>Additionally, it is important to consider that the dataset contains two classes (Pneumonia/Normal). While we will not separate the classes for data quantity purposes, improving our network&#8217;s performance could be beneficial. Furthermore, it is advisable to verify if the classes are well-balanced. </p>



<p>Only the training subset will be used here (5,221 images). Let&#8217;s take a look at our images:</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/training_images.png" alt="chest-x-ray-pneumonia-dataset-images" class="wp-image-25669" style="width:366px;height:366px" width="366" height="366" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/training_images.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/training_images-150x150.png 150w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/training_images-70x70.png 70w" sizes="auto, (max-width: 366px) 100vw, 366px" /><figcaption class="wp-element-caption"><em>Chest X-Ray Pneumonia dataset real samples</em></figcaption></figure>



<p>We notice that we have quite similar images. The <strong>backgrounds are identical</strong>, and the chests are <strong>often centered in the same way</strong>, which should help the network learn.</p>



<h4 class="wp-block-heading">Preprocessing</h4>



<p><strong>Data pre-processing</strong> is a crucial step when you want to facilitate and accelerate model convergence and obtain high-quality results. This pre-processing can be broken down into various generic operations that are commonly applied.</p>



<p>Each image in the dataset will be <strong>transformed</strong>. They are then assembled in packets of 128 images, which we call <strong>batches</strong>. This avoids loading the dataset all at once, which could use up a lot of memory. This also makes the most of <strong>GPUs parallelism</strong>.</p>



<p>The applied <strong>transformation</strong> will:</p>



<ul class="wp-block-list">
<li><strong>Resize</strong> <strong>images</strong> to (64x64xchannels), dimensions expected by our DCGAN. This avoids keeping the original dimensions of the images, which are all different. This also reduces the number of pixels which accelerates the model training (computation cost).</li>



<li><strong>Convert images to tensors</strong> (format expected by models).</li>



<li><strong>Standardize &amp; Normalize the image&#8217;s pixel values</strong>, which improves training performance in AI.</li>
</ul>



<p><em>If original images are smaller than the desired size, transformation will pad the images to reach the specified size.</em></p>



<p><em>We won&#8217;t show you the code for these transformations here, but as mentioned earlier, you can find it in its entirety on our <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/computer-vision/image-generation/miniconda/dcgan-image-generation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a> You can reproduce all the experiments with <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a>.</em></p>



<h3 class="wp-block-heading">Step 2 &#8211; Define the models</h3>



<p>Now that the images are ready, we can define our DCGAN:</p>



<h4 class="wp-block-heading">Generator implementation</h4>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/09/Generator-Frame1-1.svg" alt="" class="wp-image-25742" style="width:1200px;height:319px" width="1200" height="319"/></figure>



<p>As shown in the image above, the generator architecture is designed to take a random noise vector z as input and transform it into a (3x64x64) image, which is the same size as the images in the training dataset.</p>



<p>To do this, it uses <strong>transposed convolutions</strong> (also falsely known as deconvolutions) to progressively upsample the noise vector <em>z</em> until it reaches the desired output image size. In fact, the transposed convolutions help the generator capture complex patterns and generate realistic images during the training process.</p>



<p>The final <em>Tanh()</em> activation function ensures that the pixel values of the generated images are in the range <em>[-1, 1]</em>, which also corresponds to our transformed training images (we had normalized them).</p>



<p><em>The code for implementing this generator is given in its entirety on our <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/computer-vision/image-generation/miniconda/dcgan-image-generation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a></em></p>



<h3 class="wp-block-heading">Discriminator implementation</h3>



<p>As a reminder, the discriminator acts as a sample classifier. Its aim is to distinguish the data generated by the generator from the real data in the training dataset.</p>



<figure class="wp-block-image size-full"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/09/Discriminator-Frame.svg" alt="DCGAN architecture discriminator" class="wp-image-25743"/></figure>



<p></p>



<p>As shown in the image above, the discriminator takes an input image of size (3x64x64) and <strong>outputs a probability</strong>, indicating if the input image is real (1) or fake (0).</p>



<p>To do this, it uses convolutional layers, batch normalization layers, and LeakyReLU functions, which are presented in the paper as architecture guidelines to follow. Each convolutional block is designed to capture features of the input images, moving from low-level features such as edges and textures for the first blocks, to more abstract and complex features such as shapes and objects for the last.</p>



<p>Probability is obtained thanks to the use of the sigmoid activation, which squashes the output to the range <em>[0, 1]</em>.</p>



<p><em>The code for implementing this discriminator is given in its entirety on our <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/computer-vision/image-generation/miniconda/dcgan-image-generation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a></em></p>



<h4 class="wp-block-heading">Define loss function and labels</h4>



<p>Now that we have our adversarial networks, we need to define the <strong>loss function</strong>. </p>



<p>The adversarial loss <em>V(D, G)</em> can be approximated using the<strong> <em>Binary Cross Entropy (BCE)</em></strong> loss function, which is commonly used for GANs because it measures the binary cross-entropy between the discriminator&#8217;s output (probability) and the ground truth labels during training (here we fix real=1 or fake=0). It will calculate the loss for both the generator and the discriminator during <strong>backpropagation</strong>.</p>



<p><em>BCE Loss</em> is computed with the following equation, where <em>target</em> is the ground truth label (1 or 0), and <em>ŷ</em> is the discriminator&#8217;s probability output:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/09/bce_eq-1024x102.png" alt="Binary Cross Entropy loss function" class="wp-image-25744" style="width:616px;height:61px" width="616" height="61" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/09/bce_eq-1024x102.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/09/bce_eq-300x30.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/09/bce_eq-768x76.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/09/bce_eq.png 1246w" sizes="auto, (max-width: 616px) 100vw, 616px" /></figure>



<p>If we compare this equation to our previous <em>V(D, G)</em> objective, we can see that BCE loss term for real data samples corresponds to the first term in <em>V(D, G)</em>, <em>log(D(x))</em>, and the BCE loss term for fake data samples corresponds to the second term in V(D, G), log(1 &#8211; D(G(z))).</p>



<p>In this binary case, the BCE can be represented by two distinct curves, which describe how the loss varies as a function of the predictions ŷ of the model. The first shows the loss as a function of the calculated probability, for a synthetic sample (label y = 0). The second describes the loss for a real sample (label y = 1).</p>



<figure class="wp-block-image aligncenter size-full"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/09/BCE-LOSS1.svg" alt="" class="wp-image-25747"/><figcaption class="wp-element-caption"><em>Variations of BCE loss over the interval ]0;1[ for different targeted labels (y =0 and y = 1)</em></figcaption></figure>



<p>We can see that<strong> the further the prediction ŷ is from the actual label assigned (target), the greater the loss</strong>. On the other hand, a prediction that is close to the truth will generate a loss very close to zero, which will not impact the model since it appears to classify the samples successfully.</p>



<p>During training, <strong>the goal is to minimize the BCE loss</strong>. This way, the discriminator will learn to correctly classify real and generated samples, while the generator will learn to generate samples that can &#8220;fool&#8221; the discriminator into classifying them as real.</p>



<h4 class="wp-block-heading">Hyperparameters</h4>



<p>Hyperparameters were chosen according to the indications given by in the <a href="https://arxiv.org/abs/1511.06434" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">DCGAN paper</a>.</p>



<h3 class="wp-block-heading">Step 3 &#8211; Train the model</h3>



<p>We are now ready to train our DCGAN !</p>



<p>To monitor the generator&#8217;s learning progress, we will create a <strong>constant noise vector</strong>, denoted as <code><strong>fixed_noise</strong></code>. </p>



<p>During the training loop, we will regularly feed this <code><strong>fixed_noise</strong></code> into the generator. Using a same constant vector makes it possible to generate similar images each time, and to observe the evolution of the samples produced by the generator over the training cycles.</p>



<pre class="wp-block-code"><code lang="python" class="language-python">fixed_noise = torch.randn(64, nz, 1, 1, device=device)</code></pre>



<p>Also, we will compute the <strong>BCE Loss</strong> of the Discriminator and the Generator separately. This will enable them to improve over the training cycles. For each batch, these losses will be calculated and saved into lists, enabling us to plot the losses after training for each training iteration.</p>



<h4 class="wp-block-heading">Training Process</h4>



<p>Thanks to our fixed noise vector, we were able to capture the evolution of the generated images, providing an overview of how the model learned to reproduce the distribution of training data over time.</p>



<p>Here are the samples generated by our model during training, when fed with a fixed noise, over 100 epochs.  For visualization, a display of 9 generated images was chosen : </p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="381" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Results-Chests-1024x381.png" alt="generated-samples-epoch" class="wp-image-25678" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Results-Chests-1024x381.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Results-Chests-300x112.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Results-Chests-768x286.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Results-Chests-1536x572.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Results-Chests-2048x762.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Evolution of the synthetic samples produced by the generator over time, from a constant random vector of noise z</em></figcaption></figure>



<p>At the start of the training process (epoch 1), the images generated show the characteristics of the random noise vector. </p>



<p>As the training progresses, the <strong>weights</strong> of the discriminator and generator <strong>are updated</strong>. Noticeable changes occur in the generated images. Epochs 5, 10 and 20 show quick and subtle evolution of the model, which begins to capture more distinct shapes and structures.</p>



<p>Next epochs show an improvement in edges and details. Generated samples become sharper and more identifiable, and by epoch 100 the images are quite realistic despite the limited data available (5,221 images).</p>



<p><em>Do not hesitate to play with the hyperparameters to try and vary your results! You can also check out the <a href="https://github.com/soumith/ganhacks" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GAN hacks repo</a>, which shares many tips dedicated to training GANs.</em> <em>Training time will vary according to your resources and the number of imag</em>es.</p>



<h3 class="wp-block-heading">Step 4 &#8211; Results &amp; Inference</h3>



<p>Once the generator has been trained over 100 epochs, we are free to generated unlimited new images, based on a new random noise vector each time.</p>



<p>In order to retain only relevant samples, a data <strong>post-processing</strong> step was set up to assess the quality of the images generated. All generated images were sent to the trained discriminator. Its job is to evaluate the probability of the generated samples, and keep only those which have obtained a probability greater than a fixed threshold (0.8 for example).</p>



<p>This way, we have obtained the following images, compared to the original ones. We can see that despite the small number of images in our dataset, the model was able to identify and learn the distribution of the real images data and reproduce them in a realistic way:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="498" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/results-1024x498.png" alt="real-images-vs-generated" class="wp-image-25679" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/results-1024x498.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/results-300x146.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/results-768x373.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/results-1536x747.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/results.png 1888w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="has-text-align-center"><em>O<em>riginal dataset images (left)</em>, compared with images selected from generated samples (right) </em></p>



<h3 class="wp-block-heading">Step 5 &#8211; Evaluate the model</h3>



<p>A DCGAN model (and GANs in general) can be evaluated in several ways. A <a href="https://arxiv.org/abs/1802.03446" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">research paper</a> has been published on this subject.</p>



<h3 class="wp-block-heading">Quantitative measures</h3>



<p>On the <strong>quantitative</strong> side, the <strong>evolution of the BCE loss</strong> of the generator and the discriminator provide indications of the quality of the model during training.</p>



<p>The evolution of these losses is illustrated in the figure below, where the discriminator losses are shown in orange and the generator losses in blue, over a total of 4100 iterations. Each iteration corresponds to a complete pass of the dataset, which is split into 41 batches of 128 images. Since the model has been trained over 100 epochs, loss tracking is available over 4100 iterations (41*100).</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/loss-1024x568.png" alt="generator-discriminator-loss" class="wp-image-25677" style="width:706px;height:392px" width="706" height="392" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/loss-1024x568.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/loss-300x167.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/loss-768x426.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/loss-1536x853.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/loss.png 1668w" sizes="auto, (max-width: 706px) 100vw, 706px" /></figure>



<p>At the start of training, both curves show high loss values, indicating an <strong>unstable start</strong> of the DCGAN. This results in very <strong>unrealistic images being generated</strong>, where the nature of the <strong>random noise is still too present </strong>(see epoch 1 on the previous image). The discriminator is therefore too powerful for the moment.</p>



<p>A few iterations later, the losses converge towards lower values, demonstrating the improvement in the model&#8217;s performance.</p>



<p>However, from epoch 10, a trend emerges. The discriminator loss begins to decrease very slightly, indicating an improvement in its ability to determine which samples are genuine and which are synthetic. On the other hand, the generator&#8217;s loss shows a slight increase, suggesting that it needs to improve in order to generate images capable of deceiving its adversary.</p>



<p>More generally, fluctuations are observed throughout training due to the competitive nature of the network, where the generator and discriminator are constantly adjusting relative to each other. These moments of fluctuation may reflect attempts to adjust the two networks. Unfortunately, they do not ultimately appear to lead to an overall reduction in network loss.</p>



<h4 class="wp-block-heading">Qualitative measures</h4>



<p>Losses are not the only performance indicator. They are often insufficient to assess the visual quality of the images generated.</p>



<p>This is confirmed by an analysis of the previous graphs, where we inevitably notice that the images generated at epoch 10 are not the most realistic, while the loss is approximately the same as that obtained at epoch 100.</p>



<p>One commonly used method is <strong>human visual</strong> assessment. However, this manual assessment has a number of limitations. It is subjective, does not fully reflect the capabilities of the models, cannot be reproduced and is <strong>expensive</strong>.</p>



<p>Research is therefore focusing on finding new, more reliable and less costly methods. This is particularly the case with <strong>CAPTCHAs</strong>, tests designed to check whether a user is a human or a robot before accessing content. These tests sometimes present pairs of generated and real images where the user has to indicate which of the two seems more authentic. This ultimately amounts to training a discriminator and a generator manually.</p>



<p class="has-text-align-center"><em>All the code related to this article is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/computer-vision/image-generation/miniconda/dcgan-image-generation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a> You can reproduce all the experiments with</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a>.</p>



<h3 class="wp-block-heading">Conclusion</h3>



<p>I hope you have enjoyed this post!</p>



<p>You are now more comfortable with image generation and the concept of Generative Adversarial Networks! Now you know how to generate images from your own dataset, even if it&#8217;s not very large!</p>



<p>You can train your own network on your dataset and generate images of faces, objects and landscapes. Happy GANning! 🎨🚀</p>



<p>You can check our other computer vision articles to learn how to:</p>



<ul class="wp-block-list">
<li><a href="https://blog.ovhcloud.com/image-segmentation-train-a-u-net-model-to-segment-brain-tumors/" data-wpel-link="internal">Perform Brain tumor segmentation using U-Net</a></li>



<li><a href="https://blog.ovhcloud.com/object-detection-train-yolov5-on-a-custom-dataset/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">Train YOLOv5 on a custom dataset Object detection:</a></li>
</ul>



<p>Paper references</p>



<ul class="wp-block-list">
<li><a href="https://arxiv.org/abs/1406.2661" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Generative Adversarial Nets, Ian Goodfellow, 2014</a></li>



<li><a href="https://arxiv.org/abs/1511.06434" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, Alec Radford et al., 2016</a></li>



<li><a href="https://arxiv.org/abs/1802.03446" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Pros and Cons of GAN Evaluation Measures, Ali Borji, 2018</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Funderstanding-image-generation-beginner-guide-generative-adversarial-networks-gan%2F&amp;action_name=Understanding%20Image%20Generation%3A%20A%20Beginner%26%238217%3Bs%20Guide%20to%20Generative%20Adversarial%20Networks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Create your solution for Sign Language recognition with OVHcloud AI tools</title>
		<link>https://blog.ovhcloud.com/create-your-solution-for-sign-language-recognition-with-ovhcloud-ai-tools/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Fri, 01 Sep 2023 09:27:49 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Notebooks]]></category>
		<category><![CDATA[AI Training]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Machine learning]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=25709</guid>

					<description><![CDATA[A guide to build a solution for sign language interpretation based on a Computer Vision algorithm: YOLOv7. Introduction In the field of Artificial Intelligence, we often talk about Computer Vision and Object Detection, but what role do these AI techniques play in the vast field of healthcare? We&#8217;ll see that data plays a key role [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcreate-your-solution-for-sign-language-recognition-with-ovhcloud-ai-tools%2F&amp;action_name=Create%20your%20solution%20for%20Sign%20Language%20recognition%20with%20OVHcloud%20AI%20tools&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>A guide to build a solution for sign language interpretation based on a <strong>Computer Vision</strong> algorithm: <a href="https://github.com/WongKinYiu/yolov7" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">YOLOv7</a>.</em></p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="617" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.52-1024x617.png" alt="" class="wp-image-25717" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.52-1024x617.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.52-300x181.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.52-768x463.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.52-1536x925.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.52.png 1738w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Sign Language recognition with OVHcloud AI tools</em></figcaption></figure>



<h2 class="wp-block-heading">Introduction</h2>



<p>In the field of Artificial Intelligence, we often talk about <strong>Computer Vision</strong> and <strong>Object Detection</strong>, but what role do these AI techniques play in the vast field of healthcare? We&#8217;ll see that data plays a key role in AI applications for the medical-social sector. </p>



<p><strong>Have you ever wondered if AI could be the solution to understand sign language?</strong></p>



<p>Through this article, you will see that it is possible to use an AI model to detect signed letters. How? Thanks to the power of <strong>Computer Vision</strong> and <strong>Transfer Learning</strong>!</p>



<p><strong>The article is organized as follows:</strong></p>



<ul class="wp-block-list">
<li>Objectives</li>



<li>American Sign Language Dataset</li>



<li>Fine-Tune YOLOv7 model for Sign Language detection</li>



<li>Deploy custom YOLOv7 model for real time detection</li>
</ul>



<p><em>All the code for this blogpost is available in our dedicated <a href="https://github.com/ovh/ai-training-examples" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">G</a><a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/computer-vision/object-detection/miniconda/yolov7/notebook_object_detection_yolov7_asl.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">itHub repository</a>. You can <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-notebooks-yolov7-sign-language?id=kb_article_view&amp;sysparm_article=KB0057517" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Fine-Tune YOLOv7</a> to detect signs with <strong>AI Notebooks</strong> tool and <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-streamlit-yolov7-sign-language?id=kb_article_view&amp;sysparm_article=KB0057491" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">deploy the custom model</a> for real-time detection with <strong>AI Deploy</strong>.</em></p>



<h2 class="wp-block-heading">Objectives</h2>



<p>The purpose of this article is to show how it is possible to deploy a solution for <strong>Sign Language recognition</strong> thanks to AI. </p>



<p>An <strong>Object Detection</strong> algorithm will be used to detect the various signs and categorize them. Although closely related to image classification, <strong>Object Detection </strong>performs <strong>Image Classification</strong> on a more precise scale.</p>



<p>In this article, you will learn how to <strong><a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/computer-vision/object-detection/miniconda/yolov7/notebook_object_detection_yolov7_asl.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Fine-Tune YOLOv7</a></strong> model for <strong>Sign Language</strong> detection.</p>



<p>Once the model has been trained, what do you think of deploying a web app? <strong>Streamlit</strong> is the answer to your needs! At the end, AI will enable you to understand Sign Language, with <strong>real-time detection</strong> and written transcription.</p>



<h2 class="wp-block-heading">American Sign Language Dataset</h2>



<p>First of all, let&#8217;s talk data!</p>



<p><a href="https://public.roboflow.com/object-detection/american-sign-language-letters/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">American Sign Language Letters Dataset v1</a> is a public set of alphabet images and their labels created by <strong>David Lee</strong>.</p>



<p>This dataset is composed of <strong>1728 images</strong> and <strong>26 classes</strong> with the alphabet letters from A to Z.</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:100%">
<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.00-1.png" alt="" class="wp-image-25725" style="width:377px;height:390px" width="377" height="390" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.00-1.png 935w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.00-1-290x300.png 290w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.00-1-768x794.png 768w" sizes="auto, (max-width: 377px) 100vw, 377px" /><figcaption class="wp-element-caption"><em>ASL dataset</em></figcaption></figure>
</div>
</div>



<p>This dataset is composed of <strong>images</strong> and their corresponding <strong>labels</strong>, which are in <strong>txt</strong> format and give information about the location of the object thanks to the <em>x</em>, <em>y</em> coordinates as well as the <em>height</em> and <em>width</em> of the bounding box.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/01/mug_annotation-1024x635.png" alt="" class="wp-image-21645" style="width:1024px;height:635px" width="1024" height="635" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/01/mug_annotation-1024x635.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/01/mug_annotation-300x186.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/01/mug_annotation-768x477.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/01/mug_annotation-1536x953.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/01/mug_annotation.png 1713w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Label components of the ASL dataset for YOLOv7 usage</em></figcaption></figure>



<p>This data format is ideal for training a <strong>YOLO</strong> type Object Detection model.</p>



<h2 class="wp-block-heading">Fine-Tune YOLOv7 model for Sign Language recognition</h2>



<p>How can the model YOLOv7 be trained to recognize American Sign Language letters? </p>



<h6 class="wp-block-heading"><strong>Object Detection with YOLOv7 </strong></h6>



<p><strong>YOLOv7</strong> is part of the &#8220;YOLO family&#8221; algorithms, which actually means &#8220;<em>You Only Look Once</em>.&#8221; In fact, unlike many detection algorithms, YOLO is a neural network that evaluates the position and class of identified objects from a <strong>single end-to-end network</strong> that detects classes using a fully connected layer.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.02.47-1024x991.png" alt="" class="wp-image-25722" style="width:533px;height:515px" width="533" height="515" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.02.47-1024x991.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.02.47-300x290.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.02.47-768x743.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.02.47.png 1059w" sizes="auto, (max-width: 533px) 100vw, 533px" /><figcaption class="wp-element-caption"><em>Object Detection</em></figcaption></figure>



<p>Therefore, YOLO models pass only once on each image to detect the objects. This Object Detection model is particularly known for its <strong>speed</strong> and <strong>accuracy</strong> and allows <strong>real-time recognition</strong>.</p>



<p>But how can the model YOLOv7 be trained to recognize American Sign Language letters? Follow the next steps and let the magic work!</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="266" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.27-1024x266.png" alt="" class="wp-image-25719" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.27-1024x266.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.27-300x78.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.27-768x200.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.27-1536x400.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-11.23.27.png 1841w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Fine-Tuning of YOLOv7</em></figcaption></figure>



<p><em>The full notebook is available on the following <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/computer-vision/object-detection/miniconda/yolov7/notebook_object_detection_yolov7_asl.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub repository</a>.</em></p>



<h4 class="wp-block-heading">Import dependencies</h4>



<p>Firstly, import the dependencies you need.</p>



<pre class="wp-block-code"><code class="">import torch
import os
import yaml
import torchvision
from IPython.display import Image, clear_output</code></pre>



<h4 class="wp-block-heading">Check GPU availability</h4>



<p>Then, check the GPU availability. Indeed, the training of a model like YOLOv7 requires the use of <strong>GPU</strong>, in this case a Tesla V100S is used.</p>



<pre class="wp-block-code"><code class="">print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))</code></pre>



<p><code>Setup complete. Using torch 1.12.1+cu102 _CudaDeviceProperties(name='Tesla V100S-PCIE-32GB', major=7, minor=0, total_memory=32510MB, multi_processor_count=80)</code></p>



<h4 class="wp-block-heading">Extract the dataset information</h4>



<p>Next, you can access to the <code>data.yaml</code> file.</p>



<p>This file contains vital information about the dataset, especially the number of classes. Here we got 26 classes with the letters from A to Z.</p>



<pre class="wp-block-code"><code class=""># go to the directory where the data.yaml file is located to extract the number of classes
%cd /workspace/data
with open("data.yaml", 'r') as stream:
    num_classes = str(yaml.safe_load(stream)['nc'])</code></pre>



<p>Now, it&#8217;s time to train YOLOv7 model!</p>



<h4 class="wp-block-heading">Recover YOLOv7 weights</h4>



<p>In this tutorial, you can use the&nbsp;<strong>Transfer Learning</strong>&nbsp;method by using YOLOv7 weights pre-trained on the&nbsp;<a href="https://cocodataset.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">COCO dataset</a>.</p>



<p><strong>How to define Transfer Learning?</strong></p>



<p>For both humans and machines, learning something new takes time and practice. However, it is easier to perform out tasks similar to those already learned. As with humans, AI will be able to identify patterns from previous knowledge and apply them to new learning.</p>



<p>If a model is trained on a database, there is no need to re-train the model from scratch to fit a new set of similar data.</p>



<p><strong>Main advantages of Transfer Learning:</strong></p>



<ul class="wp-block-list">
<li>saving resources</li>



<li>improving efficiency</li>



<li>model training facilitation</li>



<li>saving time</li>
</ul>



<p>At this time, you can download the trained model:</p>



<pre class="wp-block-code"><code class=""># YOLOv7 path
%cd /workspace/yolov7
!wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7_training.pt</code></pre>



<p><code>Saving to: ‘yolov7_training.pt’<br>yolov7_training.pt 100%[===================&gt;] 72.12M 12.0MB/s in 5.5s</code></p>



<h4 class="wp-block-heading">Run YOLOv7 training on ASL Letters Dataset</h4>



<p>You can therefore set the following parameters.</p>



<ul class="wp-block-list">
<li><em><strong>workers:</strong></em> maximum number of dataloader workers.</li>



<li><em><strong>device:</strong></em> cuda device.</li>



<li><strong><em>batch-size:</em></strong> refers to the batch size (number of training examples used in one iteration).</li>



<li><strong><em>data:</em></strong> refers to the path to the yaml file.</li>



<li><strong><em>img:</em></strong> refers to the input images size.</li>



<li><strong><em>cfg:</em></strong> define the model configuration.</li>



<li><strong><em>weights:</em></strong> initial weights path.</li>



<li><strong><em>name:</em></strong> save to project/name.</li>



<li><strong><em>hyp:</em></strong> hyperparameters path.</li>



<li><strong><em>epochs:</em></strong> refers to the number of training epochs. An epoch corresponds to one cycle through the full training dataset.</li>
</ul>



<pre class="wp-block-code"><code class=""># time the performance
%time

# train yolov7 on custom data for 100 epochs
!python /workspace/yolov7/train.py \
          --workers 8 \
          --device 0 \
          --batch-size 8 \
          --data '/workspace/data/data.yaml' \
          --img 416 416 \
          --cfg '/workspace/yolov7/cfg/training/yolov7.yaml' \
          --weights '/workspace/yolov7/yolov7_training.pt' \
          --name yolov7-asl \
          --hyp '/workspace/yolov7/data/hyp.scratch.custom.yaml' \
          --epochs 100</code></pre>



<h4 class="wp-block-heading">Display results of YOLOv7 training on ASL Letters dataset</h4>



<p>Then you can display the results of the training and check the evolution of the metrics.</p>



<pre class="wp-block-code"><code class=""># display images
Image(filename='/workspace/yolov7/runs/train/yolov7-asl/results.png', width=1000)  # view results</code></pre>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="512" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/image-1024x512.png" alt="" class="wp-image-25713" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/image-1024x512.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/image-300x150.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/image-768x384.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/image-1536x768.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/image-2048x1024.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>YOLOv7 training overview</em></figcaption></figure>



<h4 class="wp-block-heading">Export new weights for future inference</h4>



<p>Finally, you can extract the <strong>new weights</strong> coming from YOLOv7 training on ASL Alphabet dataset. The goal is to save the model weights in a bucket in the cloud for reuse in a dedicated application.</p>



<p>Firstly, rename the PyTorch model it with the name you want.</p>



<pre class="wp-block-code"><code class="">%cd /workspace/yolov7/runs/train/yolov7-asl/weights/
os.rename("best.pt","yolov7.pt")</code></pre>



<p><code>/workspace/yolov7/runs/train/yolov7-asl/weights</code></p>



<p>Secondly, copy it in a new folder where you can put all the weights generated during your trainings.</p>



<pre class="wp-block-code"><code class="">%cp /workspace/yolov7/runs/train/yolov7-asl/weights/yolov7.pt /workspace/asl-volov7-model/yolov7.pt</code></pre>



<p><strong>Your model is ready?</strong> It&#8217;s now time to deploy a web app to use the model and benefit from real-time detection 🎉 !</p>



<h2 class="wp-block-heading">Deploy custom YOLOv7 model for real time detection</h2>



<p>Once this <strong>YOLOv7 model</strong> is trained, it can be used for inference. If you want to quickly build an app to serve your AI model, the <strong>Streamlit</strong> framework may be right for you.</p>



<h6 class="wp-block-heading"><strong>What is Streamlit?</strong></h6>



<p>Now, it&#8217;s time to discuss about the framework used to create a Web App: <strong>Streamlit</strong>!</p>



<p><a href="https://streamlit.io/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Streamlit</a>&nbsp;allows you to transform data scripts into quickly shareable web applications using only the&nbsp;<strong>Python</strong>&nbsp;language. Moreover, this framework does not require front-end skills.</p>



<p>This is a time-saver for the data scientist who wants to deploy an app around the world of data!</p>



<p>To make this app accessible, you need to containerize it using&nbsp;<strong>Docker</strong>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.00.50-1024x960.png" alt="" class="wp-image-25723" style="width:601px;height:564px" width="601" height="564" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.00.50-1024x960.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.00.50-300x281.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.00.50-768x720.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/Capture-decran-2023-08-28-a-14.00.50.png 1098w" sizes="auto, (max-width: 601px) 100vw, 601px" /><figcaption class="wp-element-caption"><em>Streamlit web app</em></figcaption></figure>



<p>By creating an app, you will enable anyone to <strong>understand Sign Language</strong>, with Real-Time detection and written transcription.</p>



<p>Let&#8217;s go for the implementation!</p>



<h4 class="wp-block-heading">Create the interface with Streamlit</h4>



<p>First of all, we must build the <strong>web interface</strong> to take a photo and the various functions to analyze the signs present on this image.</p>



<ul class="wp-block-list">
<li><code>load_model</code>: this function should be pushed in &#8220;cache&#8221; so that you only have to load the model once</li>
</ul>



<pre class="wp-block-code"><code class="">@st.cache
def load_model():

    custom_yolov7_model = torch.hub.load("WongKinYiu/yolov7", 'custom', '/workspace/asl-volov7-model/yolov7.pt')

    return custom_yolov7_model</code></pre>



<ul class="wp-block-list">
<li><code>get_prediction</code>: the model analyzes the image and returns the result of the prediction</li>
</ul>



<pre class="wp-block-code"><code class="">def get_prediction(img_bytes, model):

    img = Image.open(io.BytesIO(img_bytes))
    results = model(img, size=640)

    return results</code></pre>



<ul class="wp-block-list">
<li><code>analyse_image</code>: the image is processed before and after the model analysis</li>
</ul>



<pre class="wp-block-code"><code class="">def analyse_image(image, model):

    if image is not None:

        img = Image.open(image)

        bytes_data = image.getvalue()
        img_bytes = np.asarray(bytearray(bytes_data), dtype=np.uint8)
        result = get_prediction(img_bytes, model)
        result.render()

        for img in result.imgs:
            RGB_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            im_arr = cv2.imencode('.jpg', RGB_img)[1]
            st.image(im_arr.tobytes())

        result_list = list((result.pandas().xyxy[0])["name"])

    else:
        st.write("no asl letters were detected!")
        result_list = []

    return result_list</code></pre>



<ul class="wp-block-list">
<li><code>display_letters</code>: the letters are recovered and displayed to form the final word</li>
</ul>



<pre class="wp-block-code"><code class="">def display_letters(letters_list):

    word = ''.join(letters_list)
    path_file = "/workspace/word_file.txt"
    with open(path_file, "a") as f:
        f.write(word)

    return path_file</code></pre>



<p><em>To access the full code of the app, refer to this <a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/sign-language-recognition-yolov7-app" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub</a><a href="https://github.com/ovh/ai-training-examples/blob/main/apps/streamlit/sign-language-recognition-yolov7-app/main.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"> repository</a>.</em></p>



<h4 class="wp-block-heading">Containerize your app with Docker</h4>



<p>Once the app code has been created, it&#8217;s time to containerize it!</p>



<p>The containerization is based on the construction of a Docker image, and before this image is usable, several steps must be completed.</p>



<p><strong>What are the containerization steps 🐳 ?</strong></p>



<p><em>The following steps refer to this <a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-streamlit-yolov7-sign-language?id=kb_article_view&amp;sysparm_article=KB0057491" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">documentation</a> where you can find detailed information.</em></p>



<ul class="wp-block-list">
<li>Write the <a href="https://github.com/ovh/ai-training-examples/blob/main/apps/streamlit/sign-language-recognition-yolov7-app/requirements.txt" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">requirements.txt</a> file</li>



<li>Create the <a href="https://github.com/ovh/ai-training-examples/blob/main/apps/streamlit/sign-language-recognition-yolov7-app/Dockerfile" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Dockerfile</a></li>



<li><a href="https://help.ovhcloud.com/csm/fr-public-cloud-ai-deploy-streamlit-yolov7-sign-language?id=kb_article_view&amp;sysparm_article=KB0057495#build-the-docker-image-from-the-dockerfile" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Build the Docker image</a></li>



<li><a href="https://help.ovhcloud.com/csm/fr-public-cloud-ai-deploy-streamlit-yolov7-sign-language?id=kb_article_view&amp;sysparm_article=KB0057495#push-the-image-into-the-shared-registry" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Tag and push the Docker image on a registry</a></li>
</ul>



<p>Your docker image is created successfully? You are ready to launch your app 🚀 !</p>



<h4 class="wp-block-heading">Deploy your app and make it accessible</h4>



<p>The following command starts a new AI Deploy app running your Streamlit web interface.</p>



<pre class="wp-block-code"><code class="">ovhai app run
       --gpu 1 \
       --default-http-port 8501 \
       --volume asl-volov7-model@GRA/:/workspace/asl-volov7-model:RO \
       &lt;shared-registry-address&gt;/yolov7-streamlit-asl-recognition:latest</code></pre>



<p>In this command line, you can set up several parameters:</p>



<ul class="wp-block-list">
<li><code>resources</code>: choose between CPUs or GPUs</li>



<li><code>default HTTP port</code>: precise the Streamlit default port &#8211; 8501</li>



<li><code>data</code>: link the bucket containing your model</li>



<li><code>docker image</code>: add your docker image addess</li>
</ul>



<p>When your app is up and running, you can access the following page:</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="648" height="1024" src="https://blog.ovhcloud.com/wp-content/uploads/2023/08/overview-streamlit-yolov7-asl-648x1024.png" alt="" class="wp-image-25720" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/08/overview-streamlit-yolov7-asl-648x1024.png 648w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/overview-streamlit-yolov7-asl-190x300.png 190w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/overview-streamlit-yolov7-asl-768x1214.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/overview-streamlit-yolov7-asl-972x1536.png 972w, https://blog.ovhcloud.com/wp-content/uploads/2023/08/overview-streamlit-yolov7-asl.png 988w" sizes="auto, (max-width: 648px) 100vw, 648px" /><figcaption class="wp-element-caption"><em>Resulting Streamlit app</em></figcaption></figure>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Well done 🎉&nbsp;! You have learned how to create <strong>your own solution for Sign Language recognition</strong> with OVHcloud AI tools.</p>



<p>You have been able to <strong>Fine-Tune YOLOv7 model</strong> thanks to <em>AI Notebooks</em> and <strong>deploy a Real-Time recognition app</strong> with <em>AI Deploy</em>.</p>



<h4 class="wp-block-heading" id="want-to-find-out-more">Want to find out more?</h4>



<h6 class="wp-block-heading"><strong>Notebook</strong></h6>



<p>You want to access the notebook? Refer to the&nbsp;<a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/computer-vision/object-detection/miniconda/yolov7/notebook_object_detection_yolov7_asl.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub repository</a>.</p>



<p>To launch this notebook with&nbsp;<strong>AI Notebook</strong>, please refer to&nbsp;our&nbsp;<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-notebooks-yolov7-sign-language?id=kb_article_view&amp;sysparm_article=KB0057517" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">documentation</a>.</p>



<h6 class="wp-block-heading"><strong>App</strong></h6>



<p>You want to access to the full code to create the Streamlit app? Refer to the&nbsp;<a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/sign-language-recognition-yolov7-app" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a>.<br><br>To deploy this app with&nbsp;<strong>AI Deploy</strong>, please refer to&nbsp;our&nbsp;<a href="https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-deploy-streamlit-yolov7-sign-language?id=kb_article_view&amp;sysparm_article=KB0057491" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">doc</a><a href="https://docs.ovh.com/gb/en/publiccloud/ai/deploy/tuto-streamlit-eda-iris/" data-wpel-link="exclude">umentation</a>.</p>



<h2 class="wp-block-heading">References</h2>



<ul class="wp-block-list">
<li><a href="https://public.roboflow.com/object-detection/american-sign-language-letters" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ASL Alphabet Dataset V1</a></li>



<li><a href="https://github.com/WongKinYiu/yolov7" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">YOLOv7 GitHub repository</a></li>



<li><a href="https://blog.ovhcloud.com/object-detection-train-yolov5-on-a-custom-dataset/" data-wpel-link="internal">Object detection: train YOLOv5 on a custom dataset</a></li>



<li><a href="https://medium.com/@prishanga1/yolov7-training-on-custom-data-c6d8ec030e13" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">YoloV7 Training on Custom Data</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcreate-your-solution-for-sign-language-recognition-with-ovhcloud-ai-tools%2F&amp;action_name=Create%20your%20solution%20for%20Sign%20Language%20recognition%20with%20OVHcloud%20AI%20tools&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks</title>
		<link>https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Fri, 21 Jul 2023 15:04:00 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Notebooks]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[Fine-tuning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMa 2]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[QLoRA]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=25613</guid>

					<description><![CDATA[In this tutorial, we will walk you through the process of fine-tuning LLaMA 2 models, providing step-by-step instructions. All the code related to this article is available in our dedicated GitHub repository. You can reproduce all the experiments with OVHcloud AI Notebooks. Introduction On July 18, 2023, Meta released LLaMA 2, the latest version of [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks%2F&amp;action_name=Fine-Tuning%20LLaMA%202%20Models%20using%20a%20single%20GPU%2C%20QLoRA%20and%20AI%20Notebooks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>In this tutorial, we will walk you through the process of fine-tuning <a href="https://ai.meta.com/llama/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 2</a> models, providing step-by-step instructions.</em> </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-1024x538.jpg" alt="Fine-Tuning LLaMA 2 Models with a single GPU and OVHcloud" class="wp-image-25629" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p class="has-text-align-center"><em>All the code related to this article is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/llm/miniconda/llama2-fine-tuning/llama_2_finetuning.ipynb" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a> You can reproduce all the experiments with</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a>.</p>



<h3 class="wp-block-heading">Introduction</h3>



<p>On July 18, 2023, <a href="https://about.meta.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Meta</a> released <a href="https://ai.meta.com/llama/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 2</a>, the latest version of their <strong>Large Language Model </strong>(LLM).</p>



<p>Trained between January 2023 and July 2023 on 2 trillion tokens, these new models outperforms other LLMs on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. This release comes in different flavors, with parameter sizes of <strong>7B</strong>, <strong>13B</strong>, and a mind-blowing <strong>70B</strong>. Models are intended for free for both commercial and research use in English.</p>



<p>To suit every text generation needed and fine-tune these models, we will use <a href="https://arxiv.org/abs/2305.14314" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">QLoRA (Efficient Finetuning of Quantized LLMs)</a>, a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small &#8220;Low-Rank Adapters&#8221;. This unique approach allows for fine-tuning LLMs <strong>using just a single GPU</strong>! This technique is supported by the <a href="https://huggingface.co/docs/peft/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PEFT library</a>.</p>



<p>To fine-tune our model, we will create <em>a</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a> with only 1 GPU.</p>



<h3 class="wp-block-heading">Mandatory requirements</h3>



<p>To successfully fine-tune LLaMA 2 models, you will need the following:</p>



<ul class="wp-block-list">
<li>Fill Meta&#8217;s form to <a href="https://ai.meta.com/resources/models-and-libraries/llama-downloads/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">request access to the next version of Llama</a>. Indeed, the use of Llama 2 is governed by the Meta license, that you must accept in order to download the model weights and tokenizer.<br></li>



<li>Have a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face</a> account (with the same email address you entered in Meta&#8217;s form).<br></li>



<li>Have a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face token</a>.<br></li>



<li>Visit the page of one of the LLaMA 2 available models (version <a href="https://huggingface.co/meta-llama/Llama-2-7b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">7B</a>, <a href="https://huggingface.co/meta-llama/Llama-2-13b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">13B</a> or <a href="https://huggingface.co/meta-llama/Llama-2-70b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">70B</a>), and accept Hugging Face&#8217;s license terms and acceptable use policy.<br></li>



<li>Log in to the Hugging Face model Hub from your notebook&#8217;s terminal by running the <code>huggingface-cli login</code> command, and enter your token. You will not need to add your token as git credential.<br></li>



<li>Powerful Computing Resources: Fine-tuning the Llama 2 model requires substantial computational power. Ensure you are running code on GPU(s) when using <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Notebooks</a> or <a href="https://www.ovhcloud.com/en/public-cloud/ai-training/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Training</a>.</li>
</ul>



<h3 class="wp-block-heading">Set up your Python environment</h3>



<p>Create the following <code>requirements.txt</code> file:</p>



<pre class="wp-block-code"><code lang="" class="">torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy</code></pre>



<p>Then install and import the installed libraries:</p>



<pre class="wp-block-code"><code class="">pip install -r requirements.txt</code></pre>



<pre class="wp-block-code"><code lang="python" class="language-python">import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset</code></pre>



<h3 class="wp-block-heading">Download LLaMA 2 model</h3>



<p>As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.</p>



<p>To download the model you have been granted access to, <strong>make sure you are logged in to the Hugging Face model hub</strong>. As mentioned in the requirements step, you need to use the <code>huggingface-cli login</code> command.</p>



<p>The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer</code></pre>



<h3 class="wp-block-heading">Download a Dataset</h3>



<p>There are many datasets that can help you fine-tune your model. You can even use your own dataset!</p>



<p>In this tutorial, we are going to download and use the <a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Databricks Dolly 15k dataset</a>, which contains <strong>15,000 prompt/response pairs</strong>. It was crafted by over 5,000 <a href="https://www.databricks.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Databricks</a> employees during March and April of 2023.</p>



<p>This dataset is designed specifically for fine-tuning large language models. Released under the <a href="https://creativecommons.org/licenses/by-sa/3.0/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">CC BY-SA 3.0 license</a>, it can be used, modified, and extended by any individual or company, even for commercial applications. So it&#8217;s a perfect fit for our use case!</p>



<p>However, like most datasets, this one has <strong>its limitations</strong>. Indeed, pay attention to the following points:</p>



<ul class="wp-block-list">
<li>It consists of content collected from the public internet, which means it may contain objectionable, incorrect or biased content and typo errors, which could influence the behavior of models fine-tuned using this dataset.<br></li>



<li>Since the dataset has been created for Databricks by their own employees, it&#8217;s worth noting that the dataset reflects the interests and semantic choices of Databricks employees, which may not be representative of the global population at large.<br></li>



<li>We only have access to the <code>train</code> split of the dataset, which is its largest subset.</li>
</ul>



<pre class="wp-block-code"><code lang="python" class="language-python"># Load the databricks dataset from Hugging Face
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")</code></pre>



<h3 class="wp-block-heading">Explore dataset</h3>



<p>Once the dataset is downloaded, we can take a look at it to understand what it contains:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

*** OUTPUT ***
Number of prompts: 15011
Column Names are: ['instruction', 'context', 'response', 'category']</code></pre>



<p>As we can see, each sample is a dictionary that contains:</p>



<ul class="wp-block-list">
<li><strong>An instruction</strong>: What could be entered by the user, such as a question</li>



<li><strong>A context</strong>: Help to interpret the sample</li>



<li><strong>A response</strong>: Answer to the instruction</li>



<li><strong>A category</strong>: Classify the sample between Open Q&amp;A, Closed Q&amp;A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing</li>
</ul>



<h3 class="wp-block-heading">Pre-processing dataset</h3>



<p><strong>Instruction fine-tuning</strong> is a common technique used to fine-tune a base LLM for a specific downstream use-case.</p>



<p>It will help us to format our prompts as follows: </p>



<pre class="wp-block-code"><code class="">Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Sea or Mountain

### Response:
I believe Mountain are more attractive but Ocean has it's own beauty and this tropical weather definitely turn you on! SO 50% 50%

### End</code></pre>



<p>To delimit each prompt part by hashtags, we can use the following function:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters 
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"
    
    blurb = f"{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"
    input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    response = f"{RESPONSE_KEY}\n{sample['response']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    
    sample["text"] = formatted_prompt

    return sample</code></pre>



<p>Now, we will use our <strong>model tokenizer to process these prompts into tokenized ones</strong>. </p>



<p>The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model&#8217;s maximum token limit.</p>



<pre class="wp-block-code"><code lang="python" class="language-python"># SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    """Format &amp; tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset &amp; and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["instruction", "context", "response", "text", "category"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) &lt; max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset</code></pre>



<p>With these functions, our dataset will be ready for fine-tuning !</p>



<h3 class="wp-block-heading">Create a bitsandbytes configuration</h3>



<p>This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config</code></pre>



<p>To leverage the LoRa method, we need to wrap the model as a PeftModel.</p>



<p>To do this, we need to implement a <a href="https://huggingface.co/docs/peft/conceptual_guides/lora" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LoRa configuration</a>:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config</code></pre>



<p>Previous function needs the <strong>target modules</strong> to update the necessary matrices. The following function will get them for our model:</p>



<pre class="wp-block-code"><code lang="python" class="language-python"># SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)</code></pre>



<p>Once everything is set up and the base model is prepared, we can use the <em>print_trainable_parameters()</em> helper function to see how many trainable parameters are in the model. </p>



<pre class="wp-block-code"><code lang="python" class="language-python">def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )</code></pre>



<p>We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.</p>



<h3 class="wp-block-heading">Train</h3>



<p>Now that everything is ready, we can pre-process our dataset and load our model using the set configurations: </p>



<pre class="wp-block-code"><code lang="python" class="language-python"># Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf" 

bnb_config = create_bnb_config()

model, tokenizer = load_model(model_name, bnb_config)</code></pre>



<pre class="wp-block-code"><code lang="python" class="language-python">## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)</code></pre>



<p>Then, we can run our fine-tuning process: </p>



<pre class="wp-block-code"><code lang="python" class="language-python">def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=20,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)</code></pre>



<p><em>If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the <code>max_steps</code> argument by <code>num_train_epochs</code>.</em></p>



<p>To later load and use the model for inference, we have used the <code>trainer.model.save_pretrained(output_dir)</code> function, which saves the fine-tuned model&#8217;s weights, configuration, and tokenizer files.</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-1024x498.png" alt="" class="wp-image-25619" width="870" height="422" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-1024x498.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-300x146.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-768x374.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results.png 1320w" sizes="auto, (max-width: 870px) 100vw, 870px" /></figure>



<p class="has-text-align-center">Fine-tuning llama2 results on <a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">databricks-dolly-15k</a> dataset</p>



<p>Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a <code>EarlyStoppingCallback</code>, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.</p>



<h3 class="wp-block-heading">Merge weights</h3>



<p>Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!</p>



<pre class="wp-block-code"><code lang="python" class="language-python">model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)</code></pre>



<h3 class="wp-block-heading">Conclusion</h3>



<p>We hope you have enjoyed this article!</p>



<p>You are now able to fine-tune LLaMA 2 models on your own datasets!</p>



<p>In our next tutorial, you will discover how to <strong>Deploy your Fine-tuned LLM on <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Deploy</a> for inference</strong>!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks%2F&amp;action_name=Fine-Tuning%20LLaMA%202%20Models%20using%20a%20single%20GPU%2C%20QLoRA%20and%20AI%20Notebooks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Image segmentation: Train a U-Net model to segment brain tumors</title>
		<link>https://blog.ovhcloud.com/image-segmentation-train-a-u-net-model-to-segment-brain-tumors/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Wed, 19 Apr 2023 12:03:29 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Docker]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[Streamlit]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24637</guid>

					<description><![CDATA[brain tumor segmentation tutorial with BraTS2020 dataset and U-Net<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fimage-segmentation-train-a-u-net-model-to-segment-brain-tumors%2F&amp;action_name=Image%20segmentation%3A%20Train%20a%20U-Net%20model%20to%20segment%20brain%20tumors&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>A guide to discover image segmentation and train a convolutional neural network on medical images to segment brain tumors</em></p>



<p>All the code related to this article is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/computer-vision/image-segmentation/tensorflow/brain-tumor-segmentation-unet/notebook_image_segmentation_unet.ipynb" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a>. You can reproduce all the experiments with <strong><a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Notebooks</a></strong>.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/segmentations_compare.gif" alt="Graphical comparison of the original brain tumor segmentation, ground truth, and prediction for the BraTS2020 dataset" class="wp-image-25024" width="1188" height="397"/><figcaption class="wp-element-caption"><em>Comparison of the original and predicted segmentation, with non-enhancing tumors in blue, edema in green and enhancing tumors in yellow.</em></figcaption></figure>



<p>Over the past few years, the field of <strong>computer vision</strong> has experienced a significant growth. It encompasses a wide range of methods for acquiring, processing, analyzing and understanding digital images.</p>



<p>Among these methods, one is called <strong>image segmentation</strong>.</p>



<h3 class="wp-block-heading"><strong>What is Image Segmentation?</strong> 🤔</h3>



<p>Image segmentation is a technique used to <strong>separate an image into multiple segments or regions</strong>, each of which corresponds to a different object or part of the image.</p>



<p>The goal is to simplify the image and make it easier to analyze, so that a computer can better understand and interpret the content of the image, which can be really useful!</p>



<p><strong>Application fields</strong></p>



<p>Indeed, image segmentation has a lot of application fields such as <strong>object detection &amp; recognition, medical imaging, and self-driving systems</strong>. In all these cases, the understanding of the image content by the computer is essential.</p>



<p><strong>Example</strong></p>



<p>In an image of a street with cars, the segmentation algorithm would be able to divide the image into different regions, with one for the cars, one for the road, another for the sky, one for the trees and so on.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/02/Image_segmentation.png" alt="illustration of semantic image segmentation" class="wp-image-24755" width="470" height="354" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/02/Image_segmentation.png 512w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/Image_segmentation-300x226.png 300w" sizes="auto, (max-width: 470px) 100vw, 470px" /></figure>



<p class="has-text-align-center"><em>Semantic image segmentation from <a href="https://commons.wikimedia.org/wiki/File:Image_segmentation.png" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Wikipedia Creative Commons</a></em></p>



<h5 class="wp-block-heading"><strong>Different types of segmentation</strong></h5>



<p>There are two main types of image segmentation: <strong>semantic segmentation</strong> and <strong>instance segmentation</strong>.</p>



<ul class="wp-block-list">
<li><strong>Semantic segmentation</strong> is the task of assigning a class label to each pixel in an image. For example, in an image of a city, the task of semantic segmentation would be to label each pixel as belonging to a certain class, such as &#8220;building&#8221;, &#8220;road&#8221;, &#8220;sky&#8221;, &#8230;, as shown in the image above.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Instance segmentation</strong> not only assigns a class label to each pixel, but also differentiates instances of the same class within an image. In the previous example, the task would be to not only label each pixel as belonging to a certain class, such as &#8220;building&#8221;, &#8220;road&#8221;, &#8230;, but also to distinguish different instances of the same class, such as different buildings in the image. Each building will then be represented by a different color.</li>
</ul>



<h3 class="wp-block-heading"><strong>Use case &amp; Objective</strong></h3>



<p>Now that we know the concept of image segmentation, let&#8217;s try to put it into practice!</p>



<p>In this article, we will focus on <strong>medical imaging</strong>. Our goal will be to <strong>segment brain tumors</strong>. To do this, we will use the <strong><a href="https://www.kaggle.com/datasets/awsaf49/brats20-dataset-training-validation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">BraTS2020 Dataset</a></strong>.</p>



<h3 class="wp-block-heading">1 &#8211; <strong>BraTS2020 dataset exploration</strong></h3>



<p>This dataset <strong>contains magnetic resonance imaging (MRI) scans of brain tumors</strong>.</p>



<p>To be more specific, each patient of this dataset is represented through <strong>four different MRI scans / modalities, named <strong>T1, T1CE, T2 </strong></strong>and<strong><strong> FLAIR</strong>. </strong>These 4 images come with<strong> </strong>the ground truth segmentation of the tumoral and non-tumoral regions of their brains, which has been manually realized by experts.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/02/braTS2020_dataset_overview-1024x212.png" alt="Display of 4 MRI images from the BraTS2020 dataset, and a tumor segmentation" class="wp-image-24644" width="1195" height="248" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/02/braTS2020_dataset_overview-1024x212.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/braTS2020_dataset_overview-300x62.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/braTS2020_dataset_overview-768x159.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/braTS2020_dataset_overview-1536x318.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/braTS2020_dataset_overview.png 1606w" sizes="auto, (max-width: 1195px) 100vw, 1195px" /><figcaption class="wp-element-caption"><em>Display of the 4 modalities of a patient and its segmentation</em></figcaption></figure>



<p><strong>Why 4 modalities ?</strong></p>



<p>As you can see, the four modalities bring out <strong>different aspects</strong> for the same patient. To be more specific, here is a description of their interest: </p>



<ul class="wp-block-list">
<li><strong>T1 :</strong> Show the structure and composition of different types of tissue.</li>



<li><strong>T1CE:</strong> Similar to T1 images but with the injection of a contrast agent, which will enhance the visibility of abnormalities.</li>



<li><strong>T2:</strong> Show the fluid content of different types of tissue.</li>



<li><strong>FLAIR:</strong> Used to suppress this fluid content, to better identify lesions and tumors that are not clearly visible on T1 or T2 images.</li>
</ul>



<p>For an expert, it can be useful to have these 4 modalities in order to analyze the tumor more precisely, and to confirm its presence or not.</p>



<p>But for our artificial approach, <strong>using only two modalities instead of four is interesting</strong> since it can reduce the amount of manipulated data and therefore the computational and memory requirements of the segmentation task, making it faster and more efficient. </p>



<p>That is why we will <strong>exclude T1</strong>, since we have its improved version T1CE. Also, <strong>we will exclude the T2 modality</strong>. Indeed, the fluids it presents could degrade our predictions. These fluids are removed in the flair version, which highlights the affected regions much better, and will therefore be much more interesting for our training.</p>



<p><strong>Images format</strong></p>



<p>It is important to understand that all these MRI scans are <strong><em>NIfTI</em> <em>files</em></strong> (<em>.nii format)</em>. A NIfTI image is a digital representation of a 3D object, such as a brain in our case. Indeed, our modalities and our annotations have a 3-dimensional (240, 240, 155) shape.</p>



<p>Each dimension is composed of a series of two-dimensional images, known as <strong>slices</strong>, which all contain the same number of pixels, and are stacked together to create a 3D representation. That is why we have been able to display 2D images just above. Indeed, we have displayed the <strong>100th slice</strong> of a dimension for the 4 modalities and the segmentation.</p>



<p>Here is a quick presentation of these 3 planes:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/body_planes-1024x493.png" alt="illustration of planes of the body" class="wp-image-24957" width="982" height="473" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/body_planes-1024x493.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/body_planes-300x144.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/body_planes-768x370.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/body_planes.png 1188w" sizes="auto, (max-width: 982px) 100vw, 982px" /></figure>



<p class="has-text-align-center"><em>Planes of the body</em> <em>from <a href="https://commons.wikimedia.org/wiki/File:Planes_of_Body.jpg" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Wikipedia Creative Commons</a></em></p>



<p>&#8211; <strong>Sagittal Plane</strong>: Divides the body into left and right sections and is often referred to as a &#8220;front-back&#8221; plane.</p>



<p>&#8211; <strong>Coronal Plane</strong>: Divides the body into front and back sections and is often referred to as a &#8220;side-side&#8221; plane.</p>



<p>&#8211; <strong>Axial or Transverse Plane</strong>: Divides the body into top and bottom sections and is often referred to as a &#8220;head-toe&#8221; plane.</p>



<p>Each modality can then be displayed through its different planes. For example, we will display the 3 axes of the T1 modality:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_planes_slice_resized-1024x360.png" alt="MRI scan viewed in the 3 planes of the human body" class="wp-image-25017" width="1024" height="360" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_planes_slice_resized-1024x360.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_planes_slice_resized-300x106.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_planes_slice_resized-768x270.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_planes_slice_resized.png 1284w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>100th slice of the T1 modality of the first patient, in the 3 planes of the human body</em></figcaption></figure>



<p><strong>Why choose to display the 100th slice?</strong></p>



<p>Now that we know why we have three dimensions, let&#8217;s try to understand why we chose to display a specific slice.</p>



<p>To do this, we will display all the slices of a modality:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_slices_of_a_plane-1024x667.png" alt="all the slices of a BraTS2020 MRI modality" class="wp-image-24959" width="1024" height="667" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_slices_of_a_plane-1024x667.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_slices_of_a_plane-300x195.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_slices_of_a_plane-768x500.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/all_slices_of_a_plane.png 1227w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Display of all slices of T1 of the first patient in the sagittal plane</em></figcaption></figure>



<p>As you can see, <strong>two black parts are present</strong> on each side of our montage. However, <strong>these black parts correspond to slices</strong>. This means that a large part of the slices does not contain much information. This is not surprising since the MRI scanner goes through the brain gradually.</p>



<p>This analysis is the same on all other modalities, all planes and also on the images segmented by the experts. Indeed, they were not able to segment the slices that do not contain much information.</p>



<p>This is why we can exclude these slices in our analysis, in order to reduce the number of manipulated images, and speed up our training. Indeed, you can see that a <strong>(60:135) slices range will be much more interesting</strong>:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/selected_slices_of_a_plane-1024x667.png" alt="some slices of a BraTS2020 MRI modality" class="wp-image-24962" width="1024" height="667" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/selected_slices_of_a_plane-1024x667.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/selected_slices_of_a_plane-300x195.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/selected_slices_of_a_plane-768x500.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/selected_slices_of_a_plane.png 1227w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em>Display of slices 60 to 135 of T1 of the first patient in the sagittal plane</em></figcaption></figure>



<p><strong>What about segmentations?</strong></p>



<p>Now, let&#8217;s focus on the segmentations provided by the experts. What information do they give us?</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/random_segmentation-edited.png" alt="segmentation classes from BraTS2020 dataset" class="wp-image-25027" width="555" height="416" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/random_segmentation-edited.png 555w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/random_segmentation-edited-300x225.png 300w" sizes="auto, (max-width: 555px) 100vw, 555px" /><figcaption class="wp-element-caption"><em><em>100th slice of the segmentation modality of the first patient</em></em></figcaption></figure>



<p>Regardless of the plane you are viewing, you will notice that some slices have multiple colors, which means that the experts have assigned multiple values / classes to the segmentation (one color represents one value).</p>



<p>Actually, we only have 4 possible pixels values in this dataset. <strong>These 4 values will form our 4 classes</strong>. Here is what they correspond to:</p>



<figure class="wp-block-table aligncenter"><table><tbody><tr><td class="has-text-align-center" data-align="center"><strong>Class value</strong></td><td class="has-text-align-center" data-align="center"><strong>Class color</strong></td><td class="has-text-align-center" data-align="center"><strong>Class meaning</strong></td></tr><tr><td class="has-text-align-center" data-align="center">0</td><td class="has-text-align-center" data-align="center">Purple</td><td class="has-text-align-center" data-align="center">Not tumor (healthy zone or image background)</td></tr><tr><td class="has-text-align-center" data-align="center">1</td><td class="has-text-align-center" data-align="center">Blue</td><td class="has-text-align-center" data-align="center">Necrotic and non-enhancing tumor</td></tr><tr><td class="has-text-align-center" data-align="center">2</td><td class="has-text-align-center" data-align="center">Green</td><td class="has-text-align-center" data-align="center">Peritumoral Edema</td></tr><tr><td class="has-text-align-center" data-align="center">4</td><td class="has-text-align-center" data-align="center">Yellow</td><td class="has-text-align-center" data-align="center">Enhancing Tumor</td></tr></tbody></table></figure>



<p class="has-text-align-center"><em>Explanation of the BraTS2020 dataset classes</em></p>



<p>As you can see, class 3 does not exist. We go directly to 4. We will therefore modify this &#8220;error&#8221; before sending the data to our model.</p>



<p>Our goal is to predict and segment each of these 4 classes for new patients to find out whether or not they have a brain tumor and which areas are affected.</p>



<p><strong>To summarize data exploration:</strong></p>



<ul class="wp-block-list">
<li>We have for each patient 4 different modalities (T1, T1CE, T2 &amp; FLAIR), accompanied by a segmentation that indicates tumor areas.</li>



<li>Modalities <strong>T1CE</strong> and <strong>FLAIR</strong> are the more interesting to keep, since these 2 provide complementary information about the anatomy and tissue contrast of the patient&#8217;s brain.</li>



<li>Each image is 3D, and can therefore be analyzed through 3 different planes that are composed of 2D slices.</li>



<li>Many slices contain little or no information. We will <strong>only</strong> <strong>keep the (60:135)</strong> <strong>slices</strong> range.</li>



<li>A segmentation image contains 1 to 4 classes.</li>



<li>Class number 4 must be reassigned to 3 since value 3 is missing.</li>
</ul>



<p>Now that we know more about our data, it is time to prepare the training of our model.</p>



<h3 class="wp-block-heading">2 &#8211; Training preparation</h3>



<p><strong>Split data into 3 sets</strong></p>



<p>In the world of AI, the quality of a model is determined by its <strong>ability to make accurate predictions on new, unseen data</strong>. To achieve this, it is important to divide our data into three sets: <strong>Training, Validation and Test</strong>.</p>



<p>Reminder of their usefulness:</p>



<ul class="wp-block-list">
<li><strong>Training set</strong> is used to train the model. During training, the model is exposed to the training data and adjusts its parameters to minimize the error between its predictions and the Ground truth (original segmentations).</li>



<li><strong>Validation set</strong> is used to fine-tune the hyperparameters of our model, which are set before training and determine the behavior of our model. The aim is to compare different hyperparameters and select the best configuration for our model.</li>



<li><strong>Test set</strong> is used to evaluate the performance of our model after it has been trained, to see how well it performs on data that was not used during the training of the model.</li>
</ul>



<p>The dataset contains 369 different patients. Here is the distribution chosen for the 3 data sets:</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/BraTS_data_distribution.png" alt="Data distribution for BraTS2020 dataset" class="wp-image-25006" width="430" height="340" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/BraTS_data_distribution.png 398w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/BraTS_data_distribution-300x237.png 300w" sizes="auto, (max-width: 430px) 100vw, 430px" /></figure>



<p><strong>Data preprocessing</strong></p>



<p>In order to train a neural network to segment objects in images, it is necessary to feed it with both the raw image data (X) and the ground truth segmentations (y). By combining these two types of data, the neural network can learn to recognize tumor patterns and make accurate predictions about the contents of a patient&#8217;s scan.</p>



<p>Unfortunately, our modalities images (X) and our segmentations (y) <strong>cannot be sent directly to the AI model</strong>. Indeed, loading all these 3D images would overload the memory of our environment, and will lead to shape mismatch errors. We have to do some image <strong>preprocessing</strong> before, which will be done by using a<strong> Data Generator</strong>, where we will perform any operation that we think is necessary when loading the images.</p>



<p>As we have explained, we will, for each sample:</p>



<ul class="wp-block-list">
<li>Retrieve the paths of its 2 selected modalities (T1CE &amp; FLAIR) and of its ground truth (original segmentation)</li>



<li>Load modalities &amp; segmentation</li>



<li>Create a X array (image) that will contain all the selected slices (60-135) of these 2 modalities.</li>



<li>Generate an y array (image) that will contain all the selected slices (60-135) of the ground truth. </li>



<li>Assign to all the 4 in the y array the value 3 (in order to correct the class 3 missing case).</li>
</ul>



<p>In addition to these preprocessing steps, we will:</p>



<p><strong>Work in the axial plane</strong></p>



<p>Since the images are square in shape (240&#215;240) in this plane. But since we will manipulate a range of slices, we will be able to visualize the predictions in the 3 planes, so it doesn&#8217;t really have an impact.</p>



<p><strong>Apply a One-Hot Encoder to the y array</strong></p>



<p>Since our goal is to segment regions that are represented as different classes (0 to 3), we must use One-Hot Encoding to convert our categorical variables (classes) into a numerical representation that can be used by our neural network (since they are based on mathematical equations).</p>



<p>Indeed, from a mathematical point of view, sending the y array as it is would mean that some classes are superior to others, while there is no superiority link between them. For example, class 1 is inferior to class 4 since 1 &lt; 4. A One-Hot encoder will allow us to manipulate only 0 and 1.</p>



<p>Here is what it consists of, for one slice: </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/one-hot-encoding-1024x576.png" alt="One-Hot encoding applied to the BraTS2020 dataset" class="wp-image-25058" width="1204" height="677" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/one-hot-encoding-1024x576.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/one-hot-encoding-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/one-hot-encoding-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/one-hot-encoding.png 1280w" sizes="auto, (max-width: 1204px) 100vw, 1204px" /><figcaption class="wp-element-caption"><em>One-Hot encoding applied to the BraTS2020 dataset</em></figcaption></figure>



<p><strong>Resize each slice of our images</strong> from (240&#215;240) to a (128, 128) shape.</p>



<p>Resizing is needed since we need image shapes that are a power of two (2<sup>n</sup>, where n is an integer). This is due to the fact that we will use pooling layers (MaxPooling2D) in our convolutional neural network (CNN), which reduce the spatial resolution by 2.</p>



<p>You may wonder why we didn&#8217;t resize the images in a (256, 256) shape, which also is a power of 2 and is closer to 240 than 128 is.</p>



<p>Indeed, resizing images to (256, 256) may preserve more information than resizing to (128, 128), which could lead to better performance. However, this larger size also means that the model will have more parameters, which will increase the training time and memory requirements. This is why we will choose the (128, 128) shape.</p>



<p><strong>To summarize the preprocessing steps:</strong> </p>



<ul class="wp-block-list">
<li>We use a data generator to be able to process and send our data to our neural network (since all our images cannot be stored in memory at once).</li>



<li>For each epoch (single pass of the entire training dataset through a neural network), the model will receive 250 samples (those contained in our training dataset).</li>



<li>For each sample, the model will have to analyze 150 slices (since there are two modalities, and 75 selected slices for both of them), received in a (128, 128) shape, as an X array of a (128, 128, 75, 2) shape. This array will be provided with the ground truth segmentation of the patient, which will be One-Hot encoded and will then have a (75, 128, 128, 4) shape.</li>
</ul>



<h3 class="wp-block-heading">3 &#8211; Define the model</h3>



<p>Now that our data is ready, we can define our segmentation model.</p>



<p><strong>U-Net</strong></p>



<p>We will use the <a href="https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">U-Net architecture</a>. This <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">convolutional neural network (CNN)</a> is designed for biomedical image segmentation, and is particularly well-suited for segmentation tasks where the regions of interest are small and have complex shapes (such as tumors in MRI scans).</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/u-net-architecture-1024x682.png" alt="U-Net architecture" class="wp-image-25056" width="793" height="528" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/u-net-architecture-1024x682.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/u-net-architecture-300x200.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/u-net-architecture-768x512.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/u-net-architecture-1536x1023.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/u-net-architecture.png 1555w" sizes="auto, (max-width: 793px) 100vw, 793px" /><figcaption class="wp-element-caption"><em>U-Net architecture</em></figcaption></figure>



<p><em>This neural network was first introduced in 2015 by Olaf Ronneberger, Philipp Fischer, Thomas Brox and reported in the paper <a href="https://arxiv.org/abs/1505.04597" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">U-Net: Convolutional Networks for Biomedical Image Segmentation</a>.</em></p>



<p><strong>Loss function</strong></p>



<p>When training a CNN, it&#8217;s important to choose a loss function that accurately reflects the performance of the network. Indeed, this function will allow to compare the predicted pixels to those of the ground truth for each patient. At each epoch, the goal is to update the weights of our model in a way that minimizes this loss function, and therefore improves the accuracy of its predictions.</p>



<p>A commonly used loss function for multi-class classification problems is <strong>categorical cross-entropy</strong>, which measures the difference between the predicted probability distribution of each pixel and the real value of the one-hot encoded ground truth. Note that segmentations models sometimes use the <strong>dice loss function</strong> as well.</p>



<p><strong>Output activation function</strong></p>



<p>To get this probability distribution over the different classes for each pixel, we apply a <strong>softmax</strong> activation function to the output layer of our neural network. </p>



<p>This means that during training, our CNN will adjust its weights to minimize our loss function, which compares predicted probabilities given by the softmax function with those of the ground truth segmentation.</p>



<p><strong>Other metrics</strong></p>



<p>It is also important to monitor the model&#8217;s performance using evaluation metrics. </p>



<p>We will of course use <strong>accuracy</strong>, which is a very popular measure. However, this metric can be misleading when working with imbalanced datasets like BraTS2020, where Background class is over represented. To address this issue, we will use other metrics such as the <strong>intersection over union (IoU), the Dice coefficient, precision, sensitivity, and specificity.</strong></p>



<ul class="wp-block-list">
<li><strong>Accuracy</strong>: Measures the overall proportion of correctly classified pixels, including both positive and negative pixels.</li>



<li><strong>IoU: </strong>Measures the overlap between the predicted and ground truth segmentations.</li>



<li><strong>Precision</strong> (positive predictive value): Measures the proportion of predicted positive pixels that are actually positive.</li>



<li><strong>Sensitivity</strong> (true positive rate): Measures the proportion of positive ground truth pixels that were correctly predicted as positive.</li>



<li><strong>Specificity</strong> (true negative rate): Measures the proportion of negative ground truth pixels that were correctly predicted as negative.</li>
</ul>



<h3 class="wp-block-heading">4 &#8211; <strong>Analysis of training metrics</strong></h3>



<p><em>Model has been trained on 35 epochs.</em></p>



<figure class="wp-block-image aligncenter size-large is-resized"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/metrics_result-edited.png" alt="Training metrics of a segmentation model for the BraTS2020 dataset" class="wp-image-25047" width="1197" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/metrics_result-edited.png 1407w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/metrics_result-edited-300x150.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/metrics_result-edited-1024x512.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/metrics_result-edited-768x384.png 768w" sizes="(max-width: 1407px) 100vw, 1407px" /><figcaption class="wp-element-caption"><em>Graphical display of training metrics over epochs</em></figcaption></figure>



<p>On the accuracy graph, we can see that both training accuracy and validation accuracy are increasing over epochs and reaching a plateau. This indicates that the model is learning from the data (training set) and generalizing well to new one (validation set). It does not seem that we are facing overfitting since both metrics are improving.</p>



<p>Then, we can see that our models is clearly learning from the training data, since both losses decrease over time on the second graph. We also notice that the best version of our model is reached around epoch 26. This conclusion is reinforced by the third graph, where both dice coefficients are increasing over epochs.</p>



<h3 class="wp-block-heading">5 &#8211; <strong>Segmentation results</strong></h3>



<p>Once the training is completed, we can look at how the model behaves against the<strong> test set </strong>by calling the <em><strong>.evaluate() </strong></em>function:</p>



<figure class="wp-block-table aligncenter"><table><tbody><tr><td class="has-text-align-center" data-align="center">Metric</td><td class="has-text-align-center" data-align="center">Score</td></tr><tr><td class="has-text-align-center" data-align="center">Categorical cross-entropy loss</td><td class="has-text-align-center" data-align="center">0.0206</td></tr><tr><td class="has-text-align-center" data-align="center">Accuracy</td><td class="has-text-align-center" data-align="center">0.9935</td></tr><tr><td class="has-text-align-center" data-align="center">MeanIOU</td><td class="has-text-align-center" data-align="center">0.8176</td></tr><tr><td class="has-text-align-center" data-align="center">Dice coefficient</td><td class="has-text-align-center" data-align="center">0.6008</td></tr><tr><td class="has-text-align-center" data-align="center">Precision</td><td class="has-text-align-center" data-align="center">0.9938</td></tr><tr><td class="has-text-align-center" data-align="center">Sensitivity</td><td class="has-text-align-center" data-align="center">0.9922</td></tr><tr><td class="has-text-align-center" data-align="center">Specificity</td><td class="has-text-align-center" data-align="center">0.9979</td></tr></tbody></table></figure>



<p>We can conclude that the model <strong>performed very well on the test dataset</strong>, achieving a <strong>low test loss </strong>(0.0206), <strong>a correct dice coefficient</strong> (0.6008) for an image segmentation task, and <strong>good scores on other metrics</strong> which indicate that the model has good generalization performance on unseen data.</p>



<p>To understand a little better what is behind these scores, let&#8217;s try to plot some randomly selected patient predicted segmentations:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_wo_processing-1024x640.png" alt="Predicted segmentation vs ground truth segmentation for the BraTS2020 dataset" class="wp-image-25052" width="902" height="564" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_wo_processing-1024x640.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_wo_processing-300x188.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_wo_processing-768x480.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_wo_processing.png 1280w" sizes="auto, (max-width: 902px) 100vw, 902px" /><figcaption class="wp-element-caption"><em>Graphical comparison of original and predicted segmentations for randomly selected patients</em></figcaption></figure>



<p>Predicted segmentations <strong>seem quite accurate</strong> but we need to do some <strong>post-processing</strong> in order to convert the probabilities given by the softmax function in a single class, for each pixel, corresponding to the class that has obtained the highest probability.</p>



<p><em><strong>Argmax() </strong></em>function is chosen here. Applying this function will also allow us to <strong>remove some false positive cases</strong>, and to <strong>plot the same colors</strong> between the original segmentation and the prediction, which will be easier to compare than just above.</p>



<p>For the same patients as before, we obtain: </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_processed1-1024x640.png" alt="Post-processed predicted segmentation vs ground truth segmentation for the BraTS2020 dataset" class="wp-image-25054" width="908" height="568" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_processed1-1024x640.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_processed1-300x188.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_processed1-768x480.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/truth_vs_pred_processed1.png 1280w" sizes="auto, (max-width: 908px) 100vw, 908px" /><figcaption class="wp-element-caption"><em>Graphical comparison of original and post-processed predicted segmentations for randomly selected patients</em></figcaption></figure>



<h3 class="wp-block-heading">Conclusion</h3>



<p>I hope you have enjoyed this tutorial, you are now more comfortable with image segmentation! </p>



<p>Keep in mind that even if our results seem accurate, we have some false positive in our predictions. In a field like medical imaging, it is crucial to evaluate the balance between true positives and false positives and assess the risks and benefits of an artificial approach.</p>



<p>As we have seen, post-processing techniques can be used to solve this problem. However, we must be careful with the results of these methods, since they can lead to a loss of information.</p>



<h3 class="wp-block-heading">Want to find out more?</h3>



<ul class="wp-block-list">
<li><strong>Notebook</strong></li>
</ul>



<p>All the code is available on our <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/computer-vision/image-segmentation/tensorflow/brain-tumor-segmentation-unet/notebook_image_segmentation_unet.ipynb" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a>.</p>



<ul class="wp-block-list">
<li><strong>App</strong></li>
</ul>



<p>A Streamlit application was created around this use case to predict and observe the predictions generated by the model. Find the <a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/image-segmentation-brain-tumors" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">segmentation app&#8217;s code here</a>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fimage-segmentation-train-a-u-net-model-to-segment-brain-tumors%2F&amp;action_name=Image%20segmentation%3A%20Train%20a%20U-Net%20model%20to%20segment%20brain%20tumors&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Deploy a custom Docker image for Data Science project – A spam classifier with FastAPI (Part 3)</title>
		<link>https://blog.ovhcloud.com/deploy-a-custom-docker-image-for-data-science-project-a-spam-classifier-with-fastapi-part-3/</link>
		
		<dc:creator><![CDATA[Eléa Petton]]></dc:creator>
		<pubDate>Fri, 30 Dec 2022 10:39:54 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Deploy]]></category>
		<category><![CDATA[AI Notebook]]></category>
		<category><![CDATA[AI Solutions]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Docker]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[OVHcloud]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[Scikit Learn]]></category>
		<category><![CDATA[spam classification]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24202</guid>

					<description><![CDATA[A guide to deploy a custom Docker image for an API with FastAPI and AI Deploy. Welcome to the third article concerning custom Docker image deployment. If you haven&#8217;t read the previous ones, you can check it: &#8211; Gradio sketch recognition app&#8211; Streamlit app for EDA and interactive prediction When creating code for a Data [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdeploy-a-custom-docker-image-for-data-science-project-a-spam-classifier-with-fastapi-part-3%2F&amp;action_name=Deploy%20a%20custom%20Docker%20image%20for%20Data%20Science%20project%20%E2%80%93%20A%20spam%20classifier%20with%20FastAPI%20%28Part%203%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>A guide to deploy a custom Docker image for an API with <a href="https://fastapi.tiangolo.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">FastAPI</a> and <strong>AI Deploy</strong>.</em></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="815" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-spam-classifier-1024x815.jpg" alt="fastapi for spam classification" class="wp-image-24226" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-spam-classifier-1024x815.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-spam-classifier-300x239.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-spam-classifier-768x612.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-spam-classifier-1536x1223.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-spam-classifier.jpg 1620w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><em>Welcome to the third article concerning <strong>custom Docker image deployment</strong>. If you haven&#8217;t read the previous ones, you can check it:</em></p>



<p><em>&#8211; </em><a href="https://blog.ovhcloud.com/deploy-a-custom-docker-image-for-data-science-project-gradio-sketch-recognition-app-part-1/" data-wpel-link="internal">Gradio sketch recognition app</a><br><em>&#8211; </em><a href="https://docs.ovh.com/fr/publiccloud/ai/deploy/tuto-streamlit-eda-iris/" data-wpel-link="exclude">Streamlit app for EDA and interactive prediction</a></p>



<p>When creating code for a <strong>Data Science project</strong>, you probably want it to be as portable as possible. In other words, it can be run as many times as you like, even on different machines.</p>



<p>Unfortunately, it is often the case that a Data Science code works fine locally on a machine but gives errors during runtime. It can be due to different versions of libraries installed on the host machine.</p>



<p>To deal with this problem, you can use <a href="https://www.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker</a>.</p>



<p><strong>The article is organized as follows:</strong></p>



<ul class="wp-block-list">
<li>Objectives</li>



<li>Concepts</li>



<li>Define a model for spam classification</li>



<li>Build the FastAPI app with Python</li>



<li>Containerize your app with Docker</li>



<li>Launch the app with AI Deploy</li>
</ul>



<p><em>All the code for this blogpost is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/tree/main/apps/fastapi/spam-classifier-api" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub repository</a>. You can test it with OVHcloud <strong>AI Deploy</strong> tool, please refer to the <a href="https://docs.ovh.com/gb/en/publiccloud/ai/deploy/tuto-fastapi-spam-classifier/" data-wpel-link="exclude">documentation</a> to boot it up.</em></p>



<h2 class="wp-block-heading">Objectives</h2>



<p>In this article, you will learn how to develop <strong>FastAPI</strong> API for spam classification.</p>



<p>Once your app is up and running locally, it will be a matter of containerizing it, then deploying the custom Docker image with AI Deploy.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="2160" height="1215" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-objective-edited.jpg" alt="objective of api deployment" class="wp-image-24228" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-objective-edited.jpg 2160w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-objective-edited-300x169.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-objective-edited-1024x576.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-objective-edited-768x432.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-objective-edited-1536x864.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-objective-edited-2048x1152.jpg 2048w" sizes="auto, (max-width: 2160px) 100vw, 2160px" /></figure>



<h2 class="wp-block-heading">Concepts</h2>



<p>In Artificial Intelligence, you have probably heard of <strong>Natural Language Processing</strong> (NLP). <strong>NLP</strong> gathers several tasks related to language processing such as <strong>text classification</strong>.</p>



<p>This technique is ideal for distinguishing spam from other messages.</p>



<h3 class="wp-block-heading">Spam Ham Collection&nbsp;Dataset</h3>



<p>The <a href="https://archive.ics.uci.edu/ml/datasets/sms+spam+collection" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">SMS Spam Collection</a> is a public set of SMS labeled messages that have been collected for mobile phone spam research.</p>



<p>The dataset contains <strong>5,574 messages</strong> in English. The SMS are tagged as follow:</p>



<ul class="wp-block-list">
<li><strong>HAM</strong> if the message is legitimate</li>



<li><strong>SPAM</strong> if it is not</li>
</ul>



<p>The collection is a <strong>text file</strong>, where each line has the correct <strong>class</strong> followed by the raw <strong>message</strong>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-5-1024x576.png" alt="spam ham dataset" class="wp-image-24219" width="773" height="435" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-5-1024x576.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-5-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-5-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-5-1536x864.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-5.png 1920w" sizes="auto, (max-width: 773px) 100vw, 773px" /></figure>



<h3 class="wp-block-heading">Logistic regression</h3>



<p><strong>What is a Logistic Regression?</strong></p>



<p><a href="https://fr.wikipedia.org/wiki/R%C3%A9gression_logistique" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Logistic regression</a> is a statistical model. It allows to study the relationships between a set of <code>i</code> <strong>qualitative variables</strong> (<code>Xi</code>) and a <strong>qualitative variable</strong> (<code>Y</code>).</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-logistic-regression-1024x779.jpg" alt="logistic regression" class="wp-image-24229" width="467" height="355" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-logistic-regression-1024x779.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-logistic-regression-300x228.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-logistic-regression-768x584.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-logistic-regression-1536x1168.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-logistic-regression.jpg 1620w" sizes="auto, (max-width: 467px) 100vw, 467px" /></figure>



<p>It is a generalized linear model using a logistic function as a link function.</p>



<p>A logistic regression model can also predict the <strong>probability</strong> of an event occurring (value close to <code><strong>1</strong></code>) or not (value close to <strong><code>0</code></strong>) from the optimization of the <strong>regression coefficients</strong>. This result always varies between <strong><code>0</code></strong> and <strong><code>1</code></strong>.</p>



<p>For the spam classification use case, <strong>words</strong> are inputs and <strong>class</strong> (spam or ham) is output.</p>



<h3 class="wp-block-heading">FastAPI</h3>



<p><strong>What is FastAPI?</strong></p>



<p><a href="https://fastapi.tiangolo.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">FastAPI</a> is a web framework for building <strong>RESTful APIs</strong> with Python.</p>



<p>FastAPI is based on <a href="https://docs.pydantic.dev/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Pydantic</a> and type guidance to <em>validate</em>, <em>serialize</em> and <em>deserialize</em> data, and automatically generate OpenAPI documents.</p>



<h3 class="wp-block-heading">Docker</h3>



<p><a href="https://www.docker.com/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Docker</a>&nbsp;platform allows you to build, run and manage isolated applications. The principle is to build an application that contains not only the written code but also all the context to run the code: libraries and their versions for example</p>



<p>When you wrap your application with all its context, you build a Docker image, which can be saved in your local repository or in the Docker Hub.</p>



<p>To get started with Docker, please, check this&nbsp;<a href="https://www.docker.com/get-started" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">documentation</a>.</p>



<p>To build a Docker image, you will define 2 elements:</p>



<ul class="wp-block-list">
<li>the application code (<em>FastAPI app</em>)</li>



<li>the&nbsp;<a href="https://docs.docker.com/engine/reference/builder/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Dockerfile</a></li>
</ul>



<p>In the next steps, you will see how to develop the Python code for your app, but also how to write the Dockerfile.</p>



<p>Finally, you will see how to deploy your custom docker image with&nbsp;<strong>OVHcloud AI Deploy</strong>&nbsp;tool.</p>



<h3 class="wp-block-heading">AI Deploy</h3>



<p><strong>AI Deploy</strong>&nbsp;enables AI models and managed applications to be started via Docker containers.</p>



<p>To know more about AI Deploy, please refer to this&nbsp;<a href="https://docs.ovh.com/gb/en/publiccloud/ai/deploy/getting-started/" data-wpel-link="exclude">documentation</a>.</p>



<h2 class="wp-block-heading">Define a model for spam classification</h2>



<p>❗ <strong><code>To develop an API that uses a Machine Learning model, you have to load the model in the correct format. For this tutorial, a Logistic Regression is used and the Python file model.py is used to define it</code></strong>.<br><br><code><strong>To better understand the model.py code, refer to the <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/text-classification/miniconda/spam-classifier/notebook-spam-classifier.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">notebook</a> which details all the steps</strong></code>.</p>



<p>First of all, you have to import the&nbsp;<strong>Python libraries</strong>&nbsp;needed to create the Logistic Regression in the <code>model.py</code> file.</p>



<pre class="wp-block-code"><code class="">import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression</code></pre>



<p>Now, you can create the Logistic Regression based on the <strong>Spam Ham Collection&nbsp;Dataset</strong>. The Python framework named <strong>Scikit-Learn</strong> is used to define this model.</p>



<p>Firstly, you can load the dataset and transform your input file into a <code>dataframe</code>.</p>



<p>You will also be able to define the <code>input</code> and the <code>output</code> of the model.</p>



<pre class="wp-block-code"><code class="">def load_data():

    PATH = 'SMSSpamCollection'
    df = pd.read_csv(PATH, delimiter = "\t", names=["classe", "message"])

    X = df['message']
    y = df['classe']

    return X, y</code></pre>



<p>In a second step, you split the data in a training and a test set.</p>



<p>To <strong>separate the dataset fairly</strong> and to have a <code>test_size</code> between 0 and 1, you can calculate <code>ntest</code> as follows.</p>



<pre class="wp-block-code"><code class="">def split_data(X, y):

    ntest = 2000/(3572+2000)

    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=ntest, random_state=0)

    return X_train, y_train</code></pre>



<p>Now you can concentrate on creating the <strong>Machine Learning model</strong>. To do this, create a <code>spam_classifier_model</code> function.</p>



<p>To fully understand the code, refer to <strong>Steps 6 to 9</strong> of this <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/text-classification/miniconda/spam-classifier/notebook-spam-classifier.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">notebook</a>. In these steps you will learn how to:</p>



<ul class="wp-block-list">
<li>create the model using <strong>Logistic Regression</strong></li>



<li>evaluate on the test set</li>



<li>do <strong>dimension reduction</strong> with stop words and term frequency</li>



<li>do <strong>dimension reduction</strong> to post-processing of the model</li>
</ul>



<pre class="wp-block-code"><code class="">def spam_classifier_model(Xtrain, ytrain):

    model_logistic_regression = LogisticRegression()
    model_logistic_regression = model_logistic_regression.fit(Xtrain, ytrain)

    coeff = model_logistic_regression.coef_
    coef_abs = np.abs(coeff)

    quantiles = np.quantile(coef_abs,[0, 0.25, 0.5, 0.75, 0.9, 1])

    index = np.where(coeff[0] &gt; quantiles[1])
    newXtrain = Xtrain[:, index[0]]

    model_logistic_regression = LogisticRegression()
    model_logistic_regression.fit(newXtrain, ytrain)

    return model_logistic_regression, index</code></pre>



<p>Once these Python functions are defined, you can call and apply them as follows.</p>



<p>Firstly, extract input and output data with <code>load_data()</code>:</p>



<pre class="wp-block-code"><code class="">data_input, data_output = load_data()</code></pre>



<p>Secondly, split the data using the <code>split_data(data_input, data_output)</code>:</p>



<pre class="wp-block-code"><code class="">X_train, ytrain = split_data(data_input, data_output)</code></pre>



<p>❗ <code><strong>Here, there is no need to use the test set. Indeed, the evaluation of the final model has already been done in <em>Step 9 - Dimensionality reduction: post processing of the model</em> of the <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/text-classification/miniconda/spam-classifier/notebook-spam-classifier.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">notebook</a>.</strong></code></p>



<p>Thirdly, <strong>transform</strong> and <strong>fit</strong> training set. In order to<strong> </strong>prepare<strong> </strong>the data, you can use <code>CountVectorizer</code> from Scikit-Learn to remove <strong>stop-words</strong> and then <code>fit_transform</code> to fit the inputs.</p>



<pre class="wp-block-code"><code class="">vectorizer = CountVectorizer(stop_words='english', binary=True, min_df=10)
Xtrain = vectorizer.fit_transform(X_train.tolist())
Xtrain = Xtrain.toarray()</code></pre>



<p>Fourthly, use the model and index for prediction by calling <code>spam_classifier_model</code> function.</p>



<pre class="wp-block-code"><code class="">model_logistic_regression, index = spam_classifier_model(Xtrain, ytrain)</code></pre>



<p>Find out the full Python code <a href="https://github.com/ovh/ai-training-examples/blob/main/apps/fastapi/spam-classifier-api/model.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</p>



<p>Have you successfully defined your model? Good job 🥳 !</p>



<p>Let&#8217;s go for the creation of the API!</p>



<h2 class="wp-block-heading">Build the FastAPI app with Python</h2>



<p>❗ <code><strong>All the codes below are available in the <em>app.py</em> file. You can find the complete Python code of the <em>app.py</em> file <a href="https://github.com/ovh/ai-training-examples/blob/main/apps/fastapi/spam-classifier-api/app.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</strong></code></p>



<p>To begin, you can import dependencies for FastAPI app.</p>



<ul class="wp-block-list">
<li>uvicorn</li>



<li>fastapi</li>



<li>pydantic</li>
</ul>



<pre class="wp-block-code"><code class="">import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from model import model_logistic_regression, index, vectorizer</code></pre>



<p>In the first place, you can initialize an instance of FastAPI.</p>



<pre class="wp-block-code"><code class="">app = FastAPI()</code></pre>



<p>Next, you can define the data format by creating the Python class named <code>request_body</code>. Here, the <strong>string</strong> (<code>str</code>) format is required.</p>



<pre class="wp-block-code"><code class="">class request_body(BaseModel):
    message : str</code></pre>



<p>Now, you can create the process function in order to prepare the sent message to be used by the model.</p>



<pre class="wp-block-code"><code class="">def process_message(message):

    desc = vectorizer.transform(message)
    dense_desc = desc.toarray()
    dense_select = dense_desc[:, index[0]]

    return dense_select</code></pre>



<p>At the exit of this function the message does not contain any more <strong>stop words</strong>, it is put in the right format for the model thanks to the <code>transform</code> and is then represented as an <code>array</code>.</p>



<p>Now that the function for processing the input data is defined, you can pass the <code>GET</code> and <code>POST</code> methods.</p>



<p>First, let&#8217;s go for the <code>GET</code> method!</p>



<pre class="wp-block-code"><code class="">@app.get('/')
def root():
    return {'message': 'Welcome to the SPAM classifier API'}</code></pre>



<p>Here you can see the <em>welcome message</em> when you arrive on your API.</p>



<pre class="wp-block-preformatted"><code><strong>{"message":"Welcome to the SPAM classifier API"}</strong></code></pre>



<p>Now it&#8217;s the turn of the <code>POST</code> method. In this part of the code, you will be able to:</p>



<ul class="wp-block-list">
<li>define the message format</li>



<li>check if a message has been sent or not</li>



<li>process the message to fit with the model</li>



<li>extract the probabilities</li>



<li>return the results</li>
</ul>



<pre class="wp-block-code"><code class="">@app.post('/spam_detection_path')
def classify_message(data : request_body):

    message = [
        data.message
    ]

    if (not (message)):
        raise HTTPException(status_code=400, detail="Please Provide a valid text message")

    dense_select = process_message(message)

    label = model_logistic_regression.predict(dense_select)
    proba = model_logistic_regression.predict_proba(dense_select)

    if label[0]=='ham':
        label_proba = proba[0][0]
    else:
        label_proba = proba[0][1]

    return {'label': label[0], 'label_probability': label_proba}</code></pre>



<p><code><strong>❗ Again, you can find the full code <a href="https://github.com/ovh/ai-training-examples/blob/main/apps/fastapi/spam-classifier-api/app.py" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a></strong></code>.</p>



<p>Before deploying your API, you can test it locally using the following command:</p>



<pre class="wp-block-code"><code class="">uvicorn app:app --reload</code></pre>



<p>Then, you can test your app locally at the following address:&nbsp;<strong><code>http://localhost:8000/</code></strong></p>



<p>You will arrive on the following page:</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-4.png" alt="" class="wp-image-24217" width="590" height="721" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-4.png 760w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-4-245x300.png 245w" sizes="auto, (max-width: 590px) 100vw, 590px" /></figure>



<p><strong>How to interact with your&nbsp;API?</strong></p>



<p>You can add&nbsp;<code>/docs</code>&nbsp;at the end of the url of your&nbsp;app: <strong><code>http://localhost:8000/</code></strong><code><strong>docs</strong></code></p>



<p>A new page opens to you. It provides a complete dashboard for interacting with the&nbsp;API!</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image.png" alt="" class="wp-image-24213" width="590" height="722" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image.png 760w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-245x300.png 245w" sizes="auto, (max-width: 590px) 100vw, 590px" /></figure>



<p>To be able to send a message for classification, select&nbsp;<code><strong>/spam_detection_path</strong></code>&nbsp;in the green box. Click on<strong>&nbsp;<code>Try</code></strong><code><strong> it out</strong></code>&nbsp;and type the message of your choice in the dedicated&nbsp;zone.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-2.png" alt="" class="wp-image-24215" width="596" height="729" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-2.png 760w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-2-245x300.png 245w" sizes="auto, (max-width: 596px) 100vw, 596px" /></figure>



<p>Enter the message of your choice. It must be in the form of a <code><strong>string</strong></code>. </p>



<p><em>Example:</em> <code><strong>"A new free service for you only"</strong></code></p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-1.png" alt="" class="wp-image-24214" width="599" height="733" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-1.png 760w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-1-245x300.png 245w" sizes="auto, (max-width: 599px) 100vw, 599px" /></figure>



<p>To get the result of the prediction, click on the&nbsp;<code><strong>Execute</strong></code>&nbsp;button.</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-3.png" alt="" class="wp-image-24216" width="611" height="748" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-3.png 760w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/image-3-245x300.png 245w" sizes="auto, (max-width: 611px) 100vw, 611px" /></figure>



<p>Finally, you obtain the result of the prediction with the&nbsp;<strong>label</strong>&nbsp;and the&nbsp;<strong>confidence&nbsp;score</strong>.</p>



<p>Your app works locally? Congratulations&nbsp;🎉 !</p>



<p>Now it’s time to move on to containerization!</p>



<h2 class="wp-block-heading">Containerize your app with Docker</h2>



<p>First of all, you have to build the file that will contain the different Python modules to be installed with their corresponding version.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="574" src="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-docker-1024x574.jpg" alt="docker image datascience" class="wp-image-24230" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-docker-1024x574.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-docker-300x168.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-docker-768x430.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-docker-1536x861.jpg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/12/draw-docker.jpg 1620w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Create the requirements.txt file</h3>



<p>The&nbsp;<code><a href="https://github.com/ovh/ai-training-examples/blob/main/apps/fastapi/spam-classifier-api/requirements.txt" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">requirements.txt</a></code>&nbsp;file will allow us to write all the modules needed to make our application work.</p>



<pre class="wp-block-code"><code class="">fastapi==0.87.0
pydantic==1.10.2
uvicorn==0.20.0
pandas==1.5.1
scikit-learn==1.1.3</code></pre>



<p>This file will be useful when writing the&nbsp;<code>Dockerfile</code>.</p>



<h3 class="wp-block-heading">Write the Dockerfile</h3>



<p>Your&nbsp;<code><a href="https://github.com/ovh/ai-training-examples/blob/main/apps/fastapi/spam-classifier-api/Dockerfile" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Dockerfile</a></code>&nbsp;should start with the the&nbsp;<code>FROM</code>&nbsp;instruction indicating the parent image to use. In our case we choose to start from a classic Python image.</p>



<p>For this Streamlit app, you can use version&nbsp;<strong><code>3.8</code></strong>&nbsp;of Python.</p>



<pre class="wp-block-code"><code class="">FROM python:3.8</code></pre>



<p>Next, you have to to fill in the working directory and add all&nbsp;files into.</p>



<p><code><strong>❗&nbsp;Here you must be in the /workspace directory. This is the basic directory for launching an OVHcloud AI Deploy.</strong></code></p>



<pre class="wp-block-code"><code class="">WORKDIR /workspace
ADD . /workspace</code></pre>



<p>Install the&nbsp;<code>requirements.txt</code>&nbsp;file which contains your needed Python modules using a&nbsp;<code>pip install…</code>&nbsp;command.</p>



<pre class="wp-block-code"><code class="">RUN pip install -r requirements.txt</code></pre>



<p>Set the listening port of the&nbsp;container. For <strong>FastAPI</strong>, you can use the port <code>8000</code>.</p>



<pre class="wp-block-code"><code class="">EXPOSE 8000</code></pre>



<p>Then, you have to define the <strong>entrypoint</strong> and the <strong>default launching command</strong> to start the application.</p>



<pre class="wp-block-code"><code class="">ENTRYPOINT ["uvicorn"]
CMD [ "streamlit", "run", "/workspace/app.py", "--server.address=0.0.0.0" ]</code></pre>



<p>Finally, you can give correct access rights to OVHcloud user (<code>42420:42420</code>).</p>



<pre class="wp-block-code"><code class="">RUN chown -R 42420:42420 /workspace
ENV HOME=/workspace</code></pre>



<p>Once your&nbsp;<code>Dockerfile</code>&nbsp;is defined, you will be able to build your custom docker image.</p>



<h3 class="wp-block-heading">Build the Docker image from the Dockerfile</h3>



<p>First, you can launch the following command from the&nbsp;<code>Dockerfile</code>&nbsp;directory to build your application image.</p>



<pre class="wp-block-code"><code class="">docker build . -t fastapi-spam-classification:latest</code></pre>



<p>⚠️&nbsp;<strong><code>The dot . argument indicates that your build context (place of the Dockerfile and other needed files) is the current directory.</code></strong></p>



<p>⚠️&nbsp;<code><strong>The -t argument allows you to choose the identifier to give to your image. Usually image identifiers are composed of a name and a version tag &lt;name&gt;:&lt;version&gt;. For this example we chose fastapi-spam-classification:latest.</strong></code></p>



<h3 class="wp-block-heading">Test it locally</h3>



<p>Now, you can run the following&nbsp;<strong>Docker command</strong>&nbsp;to launch your application locally on your computer.</p>



<pre class="wp-block-code"><code class="">docker run --rm -it -p 8080:8080 --user=42420:42420 fastapi-spam-classification<span style="background-color: inherit;font-family: inherit;font-size: 1rem;font-weight: inherit">:latest</span></code></pre>



<p>⚠️&nbsp;<code><strong>The -p 8000:8000 argument indicates that you want to execute a port redirection from the port 8000 of your local machine into the port 8000 of the Docker container.</strong></code></p>



<p>⚠️<code><strong>&nbsp;Don't forget the --user=42420:42420 argument if you want to simulate the exact same behaviour that will occur on AI Deploy. It executes the Docker container as the specific OVHcloud user (user 42420:42420).</strong></code></p>



<p>Once started, your application should be available on&nbsp;<strong>http://localhost:8000</strong>.<br><br>Your Docker image seems to work? Good job&nbsp;👍 !<br><br>It’s time to push it and deploy it!</p>



<h3 class="wp-block-heading">Push the image into the shared registry</h3>



<p>❗&nbsp;The shared registry of AI Deploy should only be used for testing purpose. Please consider attaching your own Docker registry. More information about this can be found&nbsp;<a href="https://docs.ovh.com/asia/en/publiccloud/ai/training/add-private-registry/" data-wpel-link="exclude">here</a>.</p>



<p>Then, you have to find the address of your&nbsp;<code>shared registry</code>&nbsp;by launching this command.</p>



<pre class="wp-block-code"><code class="">ovhai registry list</code></pre>



<p>Next, log in on the shared registry with your usual&nbsp;<code>OpenStack</code>&nbsp;credentials.</p>



<pre class="wp-block-code"><code class="">docker login -u &lt;user&gt; -p &lt;password&gt; &lt;shared-registry-address&gt;</code></pre>



<p>To finish, you need to push the created image into the shared registry.</p>



<pre class="wp-block-code"><code class="">docker tag fastapi-spam-classification:latest &lt;shared-registry-address&gt;/fastapi-spam-classification:latest</code></pre>



<pre class="wp-block-code"><code class="">docker push &lt;shared-registry-address&gt;/fastapi-spam-classification:latest</code></pre>



<p>Once you have pushed your custom Docker image into the shared registry, you are ready to launch your app 🚀 !</p>



<h2 class="wp-block-heading">Launch the AI Deploy app</h2>



<p>The following command starts a new job running your <strong>FastAPI</strong> application.</p>



<pre class="wp-block-code"><code class="">ovhai app run \
      --default-http-port 8000 \
      --cpu 4 \
      &lt;shared-registry-address&gt;/fastapi-spam-classification:latest</code></pre>



<h3 class="wp-block-heading">Choose the compute resources</h3>



<p>First, you can either choose the number of GPUs or CPUs for your app.</p>



<p><code><strong>--cpu 4</strong></code>&nbsp;indicates that we request 4 CPUs for that app.</p>



<h3 class="wp-block-heading">Make the app public</h3>



<p>Finally, if you want your app to be accessible without the need to authenticate, specify it as follows.</p>



<p>Consider adding the&nbsp;<code><strong>--unsecure-http</strong></code>&nbsp;attribute if you want your application to be reachable without any authentication.</p>



<figure class="wp-block-video"></figure>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Well done 🎉&nbsp;! You have learned how to build your&nbsp;<strong>own Docker image</strong>&nbsp;for a dedicated&nbsp;<strong>spam classification API</strong>!</p>



<p>You have also been able to deploy this app thanks to&nbsp;<strong>OVHcloud’s AI Deploy</strong>&nbsp;tool.</p>



<h3 class="wp-block-heading" id="want-to-find-out-more">Want to find out more?</h3>



<h5 class="wp-block-heading"><strong>Notebook</strong></h5>



<p>You want to access the notebook? Refer to the&nbsp;<a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/text-classification/miniconda/spam-classifier/notebook-spam-classifier.ipynb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub repository</a>.</p>



<h5 class="wp-block-heading"><strong>App</strong></h5>



<p>You want to access to the full code to create the <strong>FastAPI</strong> API? Refer to the&nbsp;<a href="https://github.com/ovh/ai-training-examples/tree/main/apps/fastapi/spam-classifier-api" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a>.<br><br>To launch and test this app with&nbsp;<strong>AI Deploy</strong>, please refer to&nbsp;our&nbsp;<a href="https://docs.ovh.com/gb/en/publiccloud/ai/deploy/tuto-fastapi-spam-classifier/" data-wpel-link="exclude">documentation</a>.</p>



<h2 class="wp-block-heading">References</h2>



<ul class="wp-block-list">
<li><a href="https://towardsdatascience.com/how-to-run-a-data-science-project-in-a-docker-container-2ab1a3baa889" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">How to Run a Data Science Project in a Docker Container</a></li>



<li><a href="https://towardsdatascience.com/step-by-step-approach-to-build-your-machine-learning-api-using-fast-api-21bd32f2bbdb" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Step-by-step Approach to Build Your Machine Learning API Using Fast API</a></li>
</ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdeploy-a-custom-docker-image-for-data-science-project-a-spam-classifier-with-fastapi-part-3%2F&amp;action_name=Deploy%20a%20custom%20Docker%20image%20for%20Data%20Science%20project%20%E2%80%93%20A%20spam%20classifier%20with%20FastAPI%20%28Part%203%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to build a Speech-To-Text Application with Python (3/3)</title>
		<link>https://blog.ovhcloud.com/how-to-build-a-speech-to-text-application-with-python-3-3/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Mon, 26 Dec 2022 14:22:42 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Apps]]></category>
		<category><![CDATA[AI Solutions]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Docker]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[Streamlit]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=23823</guid>

					<description><![CDATA[A tutorial to create and build your own Speech-To-Text Application with Python. At the end of this third article, your Speech-To-Text Application will offer many new features such as speaker differentiation, summarization, video subtitles generation, audio trimming, and others! Final code of the app is available in our dedicated GitHub repository. Overview of our final [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-build-a-speech-to-text-application-with-python-3-3%2F&amp;action_name=How%20to%20build%20a%20Speech-To-Text%20Application%20with%20Python%20%283%2F3%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p id="block-972dc647-4202-432e-86f0-434b7dd789f0"><em>A tutorial to create and build your own <strong>Speech-To-Text Application</strong></em> with Python.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-3-1024x576.png" alt="speech to text app image3" class="wp-image-24060" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-3-1024x576.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-3-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-3-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-3-1536x864.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-3.png 1920w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p id="block-799b7c8e-3686-469c-b530-a568cfcce605">At the end of this <strong>third article</strong>, your Speech-To-Text Application will offer <strong>many new features</strong> such as speaker differentiation, summarization, video subtitles generation, audio trimming, and others!</p>



<p id="block-b8ed3876-3e4c-42cb-b83e-4e4f3c9b13b3"><em>Final code of the app is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a>.</em></p>



<h3 class="wp-block-heading" id="block-da716ed2-9734-494e-be61-539249c19438">Overview of our final Speech to Text Application</h3>



<figure class="wp-block-image aligncenter" id="block-2d8b0805-62db-4814-b015-efba81a8520a"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/App_Overview-1024x575.png" alt="Overview of the speech to text application"/></figure>



<p class="has-text-align-center" id="block-865fd283-6e3d-47f5-96cb-fdf47b60ffd2"><em>Overview of our final Speech-To-Text application</em></p>



<h3 class="wp-block-heading" id="block-1df7dcfc-8052-426f-ac67-ebc251d14185">Objective</h3>



<p id="block-812edfb1-6bba-47c0-98f0-830353e3d5c6">In the <a href="https://blog.ovhcloud.com/how-to-build-a-speech-to-text-application-with-python-2-3/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">previous article</a>, we have created a form where the user can select the options he wants to interact with. </p>



<p id="block-812edfb1-6bba-47c0-98f0-830353e3d5c6">Now that this form is created, it&#8217;s time to <strong>deploy the features</strong>!</p>



<p id="block-2fa073cc-8250-4740-a707-1a1b67dff94d">This article is organized as follows:</p>



<ul class="wp-block-list">
<li>Trim an audio</li>



<li>Puntuate the transcript</li>



<li>Differentiate speakers with diarization</li>



<li>Display the transcript correctly</li>



<li>Rename speakers</li>



<li>Create subtitles for videos (<em>.SRT</em>)</li>



<li>Update old code</li>
</ul>



<p id="block-d4e59d83-200a-4e2b-b502-d5fe00941adb"><em>⚠️ Since this article uses code already explained in the previous <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/natural-language-processing/speech-to-text/miniconda" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">notebook tutorials</a>, w</em>e will not <em>re-explained its usefulness</em> here<em>. We therefore recommend that you read the notebooks first.</em></p>



<h3 class="wp-block-heading" id="block-1d231b87-2963-4dba-9df6-2f7dc4acb93c">Trim an audio ✂️</h3>



<p>The first option we are going to add add is to be able to trim an audio file. </p>



<p>Indeed, if the user&#8217;s audio file is <strong>several minutes long</strong>, it is possible that the user only wants to <strong>transcribe a part of it</strong> to save some time. This is where the <strong>sliders</strong> of our form become useful. They allow the user to <strong>change default start &amp; end values</strong>, which determine which part of the audio file is transcribed.</p>



<p><em>For example, if the user&#8217;s file is 10 minutes long, the user can use the sliders to indicate that he only wants to transcribe the [00:30 -&gt; 02:30] part, instead of the full audio file.</em></p>



<p id="block-837f49db-342e-4f92-8f7d-ae9b97ee51c2">⚠️ With this functionality, we must <strong>check the values </strong>set by the user! Indeed, imagine that the user selects an <em>end</em> value which is lower than the <em>start</em> one (ex : transcript would starts from start=40s to end=20s), this would be problematic.</p>



<p id="block-6bcd6e87-53cf-40d4-8b94-9f97c988587a">This is why you need to <strong>add the following function</strong> to your code, to rectify the potential errors:</p>



<pre id="block-00415e75-ed9c-4257-a5d2-0d033f419082" class="wp-block-code"><code class="">def correct_values(start, end, audio_length):
    """
    Start or/and end value(s) can be in conflict, so we check these values
    :param start: int value (s) given by st.slider() (fixed by user)
    :param end: int value (s) given by st.slider() (fixed by user)
    :param audio_length: audio duration (s)
    :return: approved values
    """
    # Start &amp; end Values need to be checked

    if start &gt;= audio_length or start &gt;= end:
        start = 0
        st.write("Start value has been set to 0s because of conflicts with other values")

    if end &gt; audio_length or end == 0:
        end = audio_length
        st.write("End value has been set to maximum value because of conflicts with other values")

    return start, end</code></pre>



<p id="block-a1ab0c76-ddb2-4abd-99ef-15f001529dae">If one of the values has been changed, we immediately <strong>inform the user</strong> with a <em>st.write().</em></p>



<p>We will call this function in the <em>transcription()</em> function, that we will rewrite at the end of this tutorial.</p>



<h3 class="wp-block-heading" id="block-50095355-4c37-4c0a-9ade-0ca44e86fde9">Split a text</h3>



<p id="block-b2616dcc-f867-4200-b95d-46b386d7efe1">If you have read the notebooks, you probably remember that some models (punctuation &amp; summarization) have <strong>input size limitations</strong>.</p>



<p id="block-62890aae-06a4-43e0-a8eb-2a13f2d57fe6">Let&#8217;s <strong>reuse the <em>split_text()</em> function</strong>, used in the notebooks, which will allow to send our whole transcript to these models by small text blocks, limited to a <em>max_size</em> number of characters:</p>



<pre id="block-0669f22d-2b45-4053-a806-07c9e897fc8e" class="wp-block-code"><code class="">def split_text(my_text, max_size):
    """
    Split a text
    Maximum sequence length for this model is max_size.
    If the transcript is longer, it needs to be split by the nearest possible value to max_size.
    To avoid cutting words, we will cut on "." characters, and " " if there is not "."

    :return: split text
    """

    cut2 = max_size

    # First, we get indexes of "."
    my_split_text_list = []
    nearest_index = 0
    length = len(my_text)
    # We split the transcript in text blocks of size &lt;= max_size.
    if cut2 == length:
        my_split_text_list.append(my_text)
    else:
        while cut2 &lt;= length:
            cut1 = nearest_index
            cut2 = nearest_index + max_size
            # Find the best index to split

            dots_indexes = [index for index, char in enumerate(my_text[cut1:cut2]) if
                            char == "."]
            if dots_indexes != []:
                nearest_index = max(dots_indexes) + 1 + cut1
            else:
                spaces_indexes = [index for index, char in enumerate(my_text[cut1:cut2]) if
                                  char == " "]
                if spaces_indexes != []:
                    nearest_index = max(spaces_indexes) + 1 + cut1
                else:
                    nearest_index = cut2 + cut1
            my_split_text_list.append(my_text[cut1: nearest_index])

    return my_split_text_list</code></pre>



<h3 class="wp-block-heading" id="block-aeff41dd-fe6b-4b31-ad7a-e2916e00657d">Punctuate the transcript</h3>



<p>Now, we need to add the function that allows us to <strong>send a <em>transcript</em> to the punctuation model</strong> in order to punctuate it:</p>



<pre id="block-6316f13a-3018-4a13-8bd6-b35a593c82e0" class="wp-block-code"><code class="">def add_punctuation(t5_model, t5_tokenizer, transcript):
    """
    Punctuate a transcript
    transcript: string limited to 512 characters
    :return: Punctuated and improved (corrected) transcript
    """

    input_text = "fix: { " + transcript + " } &lt;/s&gt;"

    input_ids = t5_tokenizer.encode(input_text, return_tensors="pt", max_length=10000, truncation=True,
                                    add_special_tokens=True)

    outputs = t5_model.generate(
        input_ids=input_ids,
        max_length=256,
        num_beams=4,
        repetition_penalty=1.0,
        length_penalty=1.0,
        early_stopping=True
    )

    transcript = t5_tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

    return transcript</code></pre>



<p>The punctuation feature is now ready. We will call these functions later.<br>For the summarization model, you don&#8217;t have to do anything else either. </p>



<h3 class="wp-block-heading" id="block-ed0fa2ed-80f4-4821-9931-68f055e12490">Differentiate speakers with diarization</h3>



<p id="block-7430afcf-a8eb-45a7-bb0c-ebf0e1c77767">Now, let&#8217;s reuse all the <strong>diarization functions</strong> studied in the notebook tutorials, so we can differentiate speakers during a conversation.</p>



<p id="block-42bf8e95-825c-4e63-bfe9-877000b5f8f2"><strong>Convert <em>mp3/mp4</em> files to </strong><em><strong>.wav</strong> </em></p>



<p id="block-42bf8e95-825c-4e63-bfe9-877000b5f8f2"><em>Remember pyannote&#8217;s diarization only accepts .wav files as input.</em></p>



<pre id="block-71aa8771-0c17-40c7-88f1-a2880f99755b" class="wp-block-code"><code class="">def convert_file_to_wav(aud_seg, filename):
    """
    Convert a mp3/mp4 in a wav format
    Needs to be modified if you want to convert a format which contains less or more than 3 letters

    :param aud_seg: pydub.AudioSegment
    :param filename: name of the file
    :return: name of the converted file
    """
    filename = "../data/my_wav_file_" + filename[:-3] + "wav"
    aud_seg.export(filename, format="wav")

    newaudio = AudioSegment.from_file(filename)

    return newaudio, filename</code></pre>



<p id="block-4bdf29a7-df3e-485c-a8a6-f7be0ef8bd58"><strong>Get diarization of an audio file</strong></p>



<p><em>The following function allows you to diarize an andio file</em>.</p>



<pre id="block-85e83a22-7dbb-41de-b4fd-ceea38b5086e" class="wp-block-code"><code class="">def get_diarization(dia_pipeline, filename):
    """
    Diarize an audio (find numbers of speakers, when they speak, ...)
    :param dia_pipeline: Pyannote's library (diarization pipeline)
    :param filename: name of a wav audio file
    :return: str list containing audio's diarization time intervals
    """
    # Get diarization of the audio
    diarization = dia_pipeline({'audio': filename})
    listmapping = diarization.labels()
    listnewmapping = []

    # Rename default speakers' names (Default is A, B, ...), we want Speaker0, Speaker1, ...
    number_of_speakers = len(listmapping)
    for i in range(number_of_speakers):
        listnewmapping.append("Speaker" + str(i))

    mapping_dict = dict(zip(listmapping, listnewmapping))
    diarization.rename_labels(mapping_dict, copy=False)
    # copy set to False so we don't create a new annotation, we replace the actual on

    return diarization, number_of_speakers</code></pre>



<p id="block-15d919bb-e051-452f-a69e-be1d61908437"><strong>Convert diarization results to timedelta objects</strong></p>



<p><em>This conversion makes it easy to manipulate the results. </em></p>



<pre id="block-32987985-e79c-415f-806a-37e8f1721fa3" class="wp-block-code"><code class="">def convert_str_diarlist_to_timedelta(diarization_result):
    """
    Extract from Diarization result the given speakers with their respective speaking times and transform them in pandas timedelta objects
    :param diarization_result: result of diarization
    :return: list with timedelta intervals and their respective speaker
    """

    # get speaking intervals from diarization
    segments = diarization_result.for_json()["content"]
    diarization_timestamps = []
    for sample in segments:
        # Convert segment in a pd.Timedelta object
        new_seg = [pd.Timedelta(seconds=round(sample["segment"]["start"], 2)),
                   pd.Timedelta(seconds=round(sample["segment"]["end"], 2)), sample["label"]]
        # Start and end = speaking duration
        # label = who is speaking
        diarization_timestamps.append(new_seg)

    return diarization_timestamps</code></pre>



<p id="block-3c5ce61f-c4a2-4505-bd1f-d1844eee8949"><strong>Merge the diarization segments</strong> <strong>that follow each other and that mention the same speaker</strong></p>



<p id="block-3c5ce61f-c4a2-4505-bd1f-d1844eee8949"><em>This will reduce the number of audio segments we need to create, and will give less sequenced, less small transcripts, which will be more pleasant for the user.</em></p>



<pre id="block-346d48b0-6c06-4208-b86d-d5424f222e32" class="wp-block-code"><code class="">def merge_speaker_times(diarization_timestamps, max_space, srt_token):
    """
    Merge near times for each detected speaker (Same speaker during 1-2s and 3-4s -&gt; Same speaker during 1-4s)
    :param diarization_timestamps: diarization list
    :param max_space: Maximum temporal distance between two silences
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :return: list with timedelta intervals and their respective speaker
    """

    if not srt_token:
        threshold = pd.Timedelta(seconds=max_space/1000)

        index = 0
        length = len(diarization_timestamps) - 1

        while index &lt; length:
            if diarization_timestamps[index + 1][2] == diarization_timestamps[index][2] and \
                    diarization_timestamps[index + 1][1] - threshold &lt;= diarization_timestamps[index][0]:
                diarization_timestamps[index][1] = diarization_timestamps[index + 1][1]
                del diarization_timestamps[index + 1]
                length -= 1
            else:
                index += 1
    return diarization_timestamps</code></pre>



<p id="block-e36e9996-2809-450a-834b-0482eca3299f"><strong>Extend timestamps given by the diarization to avoid word cutting</strong></p>



<p id="block-b943d9c3-e469-4e9d-b7b7-23d5aa1235e7">Imagine we have a segment like [00:01:20 &#8211;&gt; 00:01:25], followed by [00:01:27 &#8211;&gt; 00:01:30].</p>



<p id="block-c7909bb8-7b9e-471e-a928-75d92375fedc">Maybe diarization is not working fine and there is some sound missing in the segments (means missing sound is between 00:01:25 and 00:01:27). The transcription model will then have difficulty understanding what is being said in these segments.</p>



<p id="block-c7909bb8-7b9e-471e-a928-75d92375fedc">➡️ Solution consists in fixing the end of the first segment and the start of the second one to 00:01:26, the middle of these values.</p>



<pre id="block-df4216a7-0112-425a-b763-9acabdf5d7d9" class="wp-block-code"><code class="">def extending_timestamps(new_diarization_timestamps):
    """
    Extend timestamps between each diarization timestamp if possible, so we avoid word cutting
    :param new_diarization_timestamps: list
    :return: list with merged times
    """

    for i in range(1, len(new_diarization_timestamps)):
        if new_diarization_timestamps[i][0] - new_diarization_timestamps[i - 1][1] &lt;= timedelta(milliseconds=3000) and new_diarization_timestamps[i][0] - new_diarization_timestamps[i - 1][1] &gt;= timedelta(milliseconds=100):
            middle = (new_diarization_timestamps[i][0] - new_diarization_timestamps[i - 1][1]) / 2
            new_diarization_timestamps[i][0] -= middle
            new_diarization_timestamps[i - 1][1] += middle

    # Converting list so we have a milliseconds format
    for elt in new_diarization_timestamps:
        elt[0] = elt[0].total_seconds() * 1000
        elt[1] = elt[1].total_seconds() * 1000

    return new_diarization_timestamps</code></pre>



<p><strong>Create &amp; Optimize the subtitles</strong></p>



<p>Some people tend to speak naturally very quickly. Also, conversations can sometimes be heated. In both of these cases, there is a good chance that the transcribed text is very dense, and not suitable for displaying subtitles (too much text displayed does not allow to see the video anymore). </p>



<p>We will therefore define the following function. Its role will be to split a speech segment in 2, if the length of the text is judged too long.</p>



<pre class="wp-block-code"><code class="">def optimize_subtitles(transcription, srt_index, sub_start, sub_end, srt_text):
    """
    Optimize the subtitles (avoid a too long reading when many words are said in a short time)
    :param transcription: transcript generated for an audio chunk
    :param srt_index: Numeric counter that identifies each sequential subtitle
    :param sub_start: beginning of the transcript
    :param sub_end: end of the transcript
    :param srt_text: generated .srt transcript
    """

    transcription_length = len(transcription)

    # Length of the transcript should be limited to about 42 characters per line to avoid this problem
    if transcription_length &gt; 42:
        # Split the timestamp and its transcript in two parts
        # Get the middle timestamp
        diff = (timedelta(milliseconds=sub_end) - timedelta(milliseconds=sub_start)) / 2
        middle_timestamp = str(timedelta(milliseconds=sub_start) + diff).split(".")[0]

        # Get the closest middle index to a space (we don't divide transcription_length/2 to avoid cutting a word)
        space_indexes = [pos for pos, char in enumerate(transcription) if char == " "]
        nearest_index = min(space_indexes, key=lambda x: abs(x - transcription_length / 2))

        # First transcript part
        first_transcript = transcription[:nearest_index]

        # Second transcript part
        second_transcript = transcription[nearest_index + 1:]

        # Add both transcript parts to the srt_text
        srt_text += str(srt_index) + "\n" + str(timedelta(milliseconds=sub_start)).split(".")[0] + " --&gt; " + middle_timestamp + "\n" + first_transcript + "\n\n"
        srt_index += 1
        srt_text += str(srt_index) + "\n" + middle_timestamp + " --&gt; " + str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n" + second_transcript + "\n\n"
        srt_index += 1
    else:
        # Add transcript without operations
        srt_text += str(srt_index) + "\n" + str(timedelta(milliseconds=sub_start)).split(".")[0] + " --&gt; " + str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n" + transcription + "\n\n"

    return srt_text, srt_index</code></pre>



<p id="block-d0254180-773e-45b3-b19e-c42153f6f102"><strong>Global function which performs the whole diarization</strong> <strong>action</strong></p>



<p><em>This function just calls all the previous diarization functions to perform it</em></p>



<pre id="block-a4e19f59-8947-4736-93d7-89fdcc2ce9f8" class="wp-block-code"><code class="">def diarization_treatment(filename, dia_pipeline, max_space, srt_token):
    """
    Launch the whole diarization process to get speakers time intervals as pandas timedelta objects
    :param filename: name of the audio file
    :param dia_pipeline: Diarization Model (Differentiate speakers)
    :param max_space: Maximum temporal distance between two silences
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :return: speakers time intervals list and number of different detected speakers
    """
    
    # initialization
    diarization_timestamps = []

    # whole diarization process
    diarization, number_of_speakers = get_diarization(dia_pipeline, filename)

    if len(diarization) &gt; 0:
        diarization_timestamps = convert_str_diarlist_to_timedelta(diarization)
        diarization_timestamps = merge_speaker_times(diarization_timestamps, max_space, srt_token)
        diarization_timestamps = extending_timestamps(diarization_timestamps)

    return diarization_timestamps, number_of_speakers</code></pre>



<p id="block-76f3ec1f-9b23-4948-a124-51534febb30c"><strong>Launch diarization mode</strong></p>



<p>Previously, we were systematically running the <em>transcription_non_diarization()</em> function, which is based on the <em>silence detection method</em>.</p>



<p>But now that the user has the option to select the diarization option in the form, it is time to <strong>write our transcription_diarization() function</strong>. </p>



<p>The only difference between the two is that we replace the silences treatment by the treatment of the results of diarization.</p>



<pre id="block-e9b9cf00-5ac0-4381-877c-430dc393bc79" class="wp-block-code"><code class="">def transcription_diarization(filename, diarization_timestamps, stt_model, stt_tokenizer, diarization_token, srt_token,
                              summarize_token, timestamps_token, myaudio, start, save_result, txt_text, srt_text):
    """
    Performs transcription with the diarization mode
    :param filename: name of the audio file
    :param diarization_timestamps: timestamps of each audio part (ex 10 to 50 secs)
    :param stt_model: Speech to text model
    :param stt_tokenizer: Speech to text model's tokenizer
    :param diarization_token: Differentiate or not the speakers (choice fixed by user)
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :param summarize_token: Summarize or not the transcript (choice fixed by user)
    :param timestamps_token: Display and save or not the timestamps (choice fixed by user)
    :param myaudio: AudioSegment file
    :param start: int value (s) given by st.slider() (fixed by user)
    :param save_result: whole process
    :param txt_text: generated .txt transcript
    :param srt_text: generated .srt transcript
    :return: results of transcribing action
    """
    # Numeric counter that identifies each sequential subtitle
    srt_index = 1

    # Handle a rare case : Only the case if only one "list" in the list (it makes a classic list) not a list of list
    if not isinstance(diarization_timestamps[0], list):
        diarization_timestamps = [diarization_timestamps]

    # Transcribe each audio chunk (from timestamp to timestamp) and display transcript
    for index, elt in enumerate(diarization_timestamps):
        sub_start = elt[0]
        sub_end = elt[1]

        transcription = transcribe_audio_part(filename, stt_model, stt_tokenizer, myaudio, sub_start, sub_end,
                                              index)

        # Initial audio has been split with start &amp; end values
        # It begins to 0s, but the timestamps need to be adjust with +start*1000 values to adapt the gap
        if transcription != "":
            save_result, txt_text, srt_text, srt_index = display_transcription(diarization_token, summarize_token,
                                                                    srt_token, timestamps_token,
                                                                    transcription, save_result, txt_text,
                                                                    srt_text,
                                                                    srt_index, sub_start + start * 1000,
                                                                    sub_end + start * 1000, elt)
    return save_result, txt_text, srt_text</code></pre>



<p>The <em>display_transcription()</em> function returns 3 values for the moment, contrary to what we have just indicated in the <em>transcription_diarization()</em> function. Don&#8217;t worry, we will fix the <em>display_transcription()</em> function in a few moments.</p>



<p>You will also need the function below. It will allow the user to validate his access token to the diarization model and access the home page of our app. Indeed, we are going to create another page by default, which will invite the user to enter his token, if he wishes.</p>



<pre class="wp-block-code"><code class="">def confirm_token_change(hf_token, page_index):
    """
    A function that saves the hugging face token entered by the user.
    It also updates the page index variable so we can indicate we now want to display the home page instead of the token page
    :param hf_token: user's token
    :param page_index: number that represents the home page index (mentioned in the main.py file)
    """
    update_session_state("my_HF_token", hf_token)
    update_session_state("page_index", page_index)</code></pre>



<h3 class="wp-block-heading" id="block-e5d0971f-6a22-467e-ba4a-bc5e19572c83">Display the transcript correctly</h3>



<p id="block-8bf020a4-d6a8-4681-afac-6609255df1b6">Once the transcript is obtained, we must <strong>display</strong> it correctly, <strong>depending on the options</strong> the user has selected.</p>



<p id="block-4938f147-7b32-453e-81d8-09a15fb6d99b">For example, if the user has activated diarization, we need to write the identified speaker before each transcript, like the following result:</p>



<p><em>Speaker1 : &#8220;I would like a cup of tea&#8221;</em></p>



<p id="block-4938f147-7b32-453e-81d8-09a15fb6d99b">This is different from a classic <em>silences detection</em> method, which only writes the transcript, without any names!</p>



<p id="block-c5971940-a2fd-4c7c-9e52-aee8049adce4">There is the same case with the timestamps. We must know if we need to display them or not. We then have <strong>4 different cases</strong>:</p>



<ul class="wp-block-list">
<li>diarization with timestamps, named <strong>DIA_TS</strong></li>



<li>diarization without timestamps, named <strong>DIA</strong></li>



<li>non_diarization with timestamps, named <strong>NODIA_TS</strong></li>



<li>non_diarization without timestamps, named <strong>NODIA</strong></li>
</ul>



<p id="block-0620b0c4-2413-4623-90fa-706a1c50c8d6">To display the correct elements according to the chosen mode, let&#8217;s <strong>modify the <em>display_transcription()</em> function</strong>. </p>



<p id="block-0620b0c4-2413-4623-90fa-706a1c50c8d6"><strong>Replace the old one</strong> by the following code:</p>



<pre id="block-e20b214e-af00-4c59-a5e5-6196a14445b1" class="wp-block-code"><code class="">def display_transcription(diarization_token, summarize_token, srt_token, timestamps_token, transcription, save_result, txt_text, srt_text, srt_index, sub_start, sub_end, elt=None):
    """
    Display results
    :param diarization_token: Differentiate or not the speakers (choice fixed by user)
    :param summarize_token: Summarize or not the transcript (choice fixed by user)
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :param timestamps_token: Display and save or not the timestamps (choice fixed by user)
    :param transcription: transcript of the considered audio
    :param save_result: whole process
    :param txt_text: generated .txt transcript
    :param srt_text: generated .srt transcript
    :param srt_index : numeric counter that identifies each sequential subtitle
    :param sub_start: start value (s) of the considered audio part to transcribe
    :param sub_end: end value (s) of the considered audio part to transcribe
    :param elt: timestamp (diarization case only, otherwise elt = None)
    """
    # Display will be different depending on the mode (dia, no dia, dia_ts, nodia_ts)
    
    # diarization mode
    if diarization_token:
        if summarize_token:
            update_session_state("summary", transcription + " ", concatenate_token=True)
        
        if not timestamps_token:
            temp_transcription = elt[2] + " : " + transcription
            st.write(temp_transcription + "\n\n")

            save_result.append([int(elt[2][-1]), elt[2], " : " + transcription])
            
        elif timestamps_token:
            temp_timestamps = str(timedelta(milliseconds=sub_start)).split(".")[0] + " --&gt; " + \
                              str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n"
            temp_transcription = elt[2] + " : " + transcription
            temp_list = [temp_timestamps, int(elt[2][-1]), elt[2], " : " + transcription, int(sub_start / 1000)]
            save_result.append(temp_list)
            st.button(temp_timestamps, on_click=click_timestamp_btn, args=(sub_start,))
            st.write(temp_transcription + "\n\n")
            
            if srt_token:
                srt_text, srt_index = optimize_subtitles(transcription, srt_index, sub_start, sub_end, srt_text)


    # Non diarization case
    else:
        if not timestamps_token:
            save_result.append([transcription])
            st.write(transcription + "\n\n")
            
        else:
            temp_timestamps = str(timedelta(milliseconds=sub_start)).split(".")[0] + " --&gt; " + \
                              str(timedelta(milliseconds=sub_end)).split(".")[0] + "\n"
            temp_list = [temp_timestamps, transcription, int(sub_start / 1000)]
            save_result.append(temp_list)
            st.button(temp_timestamps, on_click=click_timestamp_btn, args=(sub_start,))
            st.write(transcription + "\n\n")
            
            if srt_token:
                srt_text, srt_index = optimize_subtitles(transcription, srt_index, sub_start, sub_end, srt_text)

        txt_text += transcription + " "  # So x seconds sentences are separated

    return save_result, txt_text, srt_text, srt_index</code></pre>



<p id="block-0851dc0b-f347-4738-ab2d-5b9c0227c253">We also need to <strong>add the following function</strong> which allow us to create our <em>txt_text</em> variable from the <em>st.session.state[&#8216;process&#8217;]</em> variable in a diarization case. This is necessary because, in addition to displaying the spoken sentence which means the transcript part, we must display the identity of the speaker, and eventually the timestamps, which are all stored in the session state variable. </p>



<pre id="block-618a8493-cb76-4fe4-9fce-24287e9eb588" class="wp-block-code"><code class="">def create_txt_text_from_process(punctuation_token=False, t5_model=None, t5_tokenizer=None):
    """
    If we are in a diarization case (differentiate speakers), we create txt_text from st.session.state['process']
    There is a lot of information in the process variable, but we only extract the identity of the speaker and
    the sentence spoken, as in a non-diarization case.
    :param punctuation_token: Punctuate or not the transcript (choice fixed by user)
    :param t5_model: T5 Model (Auto punctuation model)
    :param t5_tokenizer: T5’s Tokenizer (Auto punctuation model's tokenizer)
    :return: Final transcript (without timestamps)
    """
    txt_text = ""
    # The information to be extracted is different according to the chosen mode
    if punctuation_token:
        with st.spinner("Transcription is finished! Let us punctuate your audio"):
            if st.session_state["chosen_mode"] == "DIA":
                for elt in st.session_state["process"]:
                    # [2:] don't want ": text" but only the "text"
                    text_to_punctuate = elt[2][2:]
                    if len(text_to_punctuate) &gt;= 512:
                        text_to_punctutate_list = split_text(text_to_punctuate, 512)
                        punctuated_text = ""
                        for split_text_to_punctuate in text_to_punctutate_list:
                            punctuated_text += add_punctuation(t5_model, t5_tokenizer, split_text_to_punctuate)
                    else:
                        punctuated_text = add_punctuation(t5_model, t5_tokenizer, text_to_punctuate)

                    txt_text += elt[1] + " : " + punctuated_text + '\n\n'

            elif st.session_state["chosen_mode"] == "DIA_TS":
                for elt in st.session_state["process"]:
                    text_to_punctuate = elt[3][2:]
                    if len(text_to_punctuate) &gt;= 512:
                        text_to_punctutate_list = split_text(text_to_punctuate, 512)
                        punctuated_text = ""
                        for split_text_to_punctuate in text_to_punctutate_list:
                            punctuated_text += add_punctuation(t5_model, t5_tokenizer, split_text_to_punctuate)
                    else:
                        punctuated_text = add_punctuation(t5_model, t5_tokenizer, text_to_punctuate)

                    txt_text += elt[2] + " : " + punctuated_text + '\n\n'
    else:
        if st.session_state["chosen_mode"] == "DIA":
            for elt in st.session_state["process"]:
                txt_text += elt[1] + elt[2] + '\n\n'

        elif st.session_state["chosen_mode"] == "DIA_TS":
            for elt in st.session_state["process"]:
                txt_text += elt[2] + elt[3] + '\n\n'

    return txt_text</code></pre>



<p>Also for the purpose of correct display, we need to <strong>update the <em>display_results()</em> function</strong> so that it adapts the display to the selected mode among DIA_TS, DIA, NODIA_TS, NODIA. This will also avoid <em>&#8216;List index out of range&#8217; </em>errors, as the <em>process</em> variable does not contain the same number of elements depending on the mode used.</p>



<pre class="wp-block-code"><code class=""># Update the following function code
def display_results():

    # Add a button to return to the main page
    st.button("Load an other file", on_click=update_session_state, args=("page_index", 0,))

    # Display results
    st.audio(st.session_state['audio_file'], start_time=st.session_state["start_time"])

    # Display results of transcript by steps
    if st.session_state["process"] != []:

        if st.session_state["chosen_mode"] == "NODIA":  # Non diarization, non timestamps case
            for elt in (st.session_state['process']):
                st.write(elt[0])

        elif st.session_state["chosen_mode"] == "DIA":  # Diarization without timestamps case
            for elt in (st.session_state['process']):
                st.write(elt[1] + elt[2])

        elif st.session_state["chosen_mode"] == "NODIA_TS":  # Non diarization with timestamps case
            for elt in (st.session_state['process']):
                st.button(elt[0], on_click=update_session_state, args=("start_time", elt[2],))
                st.write(elt[1])

        elif st.session_state["chosen_mode"] == "DIA_TS":  # Diarization with timestamps case
            for elt in (st.session_state['process']):
                st.button(elt[0], on_click=update_session_state, args=("start_time", elt[4],))
                st.write(elt[2] + elt[3])

    # Display final text
    st.subheader("Final text is")
    st.write(st.session_state["txt_transcript"])

    # Display Summary
    if st.session_state["summary"] != "":
        with st.expander("Summary"):
            st.write(st.session_state["summary"])

    # Display the buttons in a list to avoid having empty columns (explained in the transcription() function)
    col1, col2, col3, col4 = st.columns(4)
    col_list = [col1, col2, col3, col4]
    col_index = 0

    for elt in st.session_state["btn_token_list"]:
        if elt[0]:
            mycol = col_list[col_index]
            if elt[1] == "useless_txt_token":
                # Download your transcription.txt
                with mycol:
                    st.download_button("Download as TXT", st.session_state["txt_transcript"],
                                       file_name="my_transcription.txt")

            elif elt[1] == "srt_token":
                # Download your transcription.srt
                with mycol:
                    st.download_button("Download as SRT", st.session_state["srt_txt"], file_name="my_transcription.srt")
            elif elt[1] == "dia_token":
                with mycol:
                    # Rename the speakers detected in your audio
                    st.button("Rename Speakers", on_click=update_session_state, args=("page_index", 2,))

            elif elt[1] == "summarize_token":
                with mycol:
                    st.download_button("Download Summary", st.session_state["summary"], file_name="my_summary.txt")
            col_index += 1</code></pre>



<p>We then display <strong>4 buttons</strong> that allow you to i<strong>nteract with the implemented functions</strong> (download the transcript in <em>.txt</em> format, in <em>.srt</em>, download the summary, and rename the speakers.</p>



<p>These buttons are placed in 4 columns which allows them to be displayed in one line. The problem is that these options are sometimes enabled and sometimes not. If we assign a button to a column, we risk having an empty column among the four columns, which would not be aesthetically pleasing.</p>



<p>This is where the <em>token_list</em> comes in! This is a list of list which contains in each of its indexes a list, having for first element the value of the token, and in second its denomination. For example, we can find in the <em>token_list</em> the following list: <em>[True, &#8220;dia_token&#8221;]</em>, which means that diarization option has been selected. </p>



<p>From this, we can assign a button to a column only if it contains an token set to <em>True</em>. If the token is set to <em>False</em>, we will retry to use this column for the next token. This avoids creating an empty column.</p>



<h3 class="wp-block-heading" id="block-a695aa41-4c72-4fb6-a38a-634ea54d5710"><strong>Rename Speakers</strong></h3>



<p id="block-d589a3b7-3887-48cb-a7fa-a95648d12309">Of course, it would be interesting to have the possibility to <strong>rename the detected speakers</strong> in the audio file. Indeed, having <em>Speaker0, Speaker1,</em> &#8230; is fine but it could be so much better with <strong>real names</strong>! Guess what? We are going to do this!</p>



<p id="block-9a0918b6-4b78-43e8-b25a-4b8b6f3bbe02">First, we will <strong>create a list</strong> where we will <strong>add each speaker with his &#8216;ID&#8217;</strong> (<em>ex: Speaker1 has 1 as his ID</em>).</p>



<p>Unfortunately, the diarization <strong>does not sort out the interlocutors</strong>. For example, the first one detected might be Speaker3, followed by Speaker0, then Speaker2. This is why it is important to sort this list, for example by placing the lowest ID as the first element of our list. This will allow not to exchange names between speakers.</p>



<p>Once this is done, we need to find a way for the user to interact with this list and modify the names contained in it. </p>



<p>➡️ We are going to create a third page that will be dedicated to this functionality. On this page, we will display each name contained in the list in a <em>st.text_area()</em> widget. The user will be able to see how many people have been detected in his audio and the automatic names (<em>Speaker0, Speaker1,</em> &#8230;) that have been assigned to them, as the screen below shows:</p>



<figure class="wp-block-image aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/11/rename-speakers-page.png" alt="speech to text application speakers differentiation" class="wp-image-24106" width="661" height="406" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/11/rename-speakers-page.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2022/11/rename-speakers-page-300x184.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/11/rename-speakers-page-768x471.png 768w" sizes="auto, (max-width: 661px) 100vw, 661px" /><figcaption class="wp-element-caption">Overview of the <em>Rename Speakers</em> page</figcaption></figure>



<p id="block-942f97d7-aa6c-4931-8e46-efda8d3c22d3">The user is able to modify this text area. Indeed, he can replace each name with the oneche wants but he must <strong>respect the one name per line format</strong>. When he has finished, he can <strong>save his modifications</strong> by <strong>clicking a &#8220;<em>Save changes</em>&#8221; button</strong>, which <strong>calls the callback function <em>click_confirm_rename_btn</em>()</strong> that we will define just after. We also display a <strong>&#8220;<em>Cancel</em>&#8221; button</strong> that will redirect the user to the results page.</p>



<p>All this process is realized by the <em>rename_speakers_window()</em> function. <strong>Add it to your code:</strong></p>



<pre id="block-d94c23c7-cb50-4a25-9472-f1f59f798e94" class="wp-block-code"><code class="">def rename_speakers_window():
    """
    Load a new page which allows the user to rename the different speakers from the diarization process
    For example he can switch from "Speaker1 : "I wouldn't say that"" to "Mat : "I wouldn't say that""
    """

    st.subheader("Here you can rename the speakers as you want")
    number_of_speakers = st.session_state["number_of_speakers"]

    if number_of_speakers &gt; 0:
        # Handle displayed text according to the number_of_speakers
        if number_of_speakers == 1:
            st.write(str(number_of_speakers) + " speaker has been detected in your audio")
        else:
            st.write(str(number_of_speakers) + " speakers have been detected in your audio")

        # Saving the Speaker Name and its ID in a list, example : [1, 'Speaker1']
        list_of_speakers = []
        for elt in st.session_state["process"]:
            if st.session_state["chosen_mode"] == "DIA_TS":
                if [elt[1], elt[2]] not in list_of_speakers:
                    list_of_speakers.append([elt[1], elt[2]])
            elif st.session_state["chosen_mode"] == "DIA":
                if [elt[0], elt[1]] not in list_of_speakers:
                    list_of_speakers.append([elt[0], elt[1]])

        # Sorting (by ID)
        list_of_speakers.sort()  # [[1, 'Speaker1'], [0, 'Speaker0']] =&gt; [[0, 'Speaker0'], [1, 'Speaker1']]

        # Display saved names so the user can modify them
        initial_names = ""
        for elt in list_of_speakers:
            initial_names += elt[1] + "\n"

        names_input = st.text_area("Just replace the names without changing the format (one per line)",
                                   value=initial_names)

        # Display Options (Cancel / Save)
        col1, col2 = st.columns(2)
        with col1:
            # Cancel changes by clicking a button - callback function to return to the results page
            st.button("Cancel", on_click=update_session_state, args=("page_index", 1,))
        with col2:
            # Confirm changes by clicking a button - callback function to apply changes and return to the results page
            st.button("Save changes", on_click=click_confirm_rename_btn, args=(names_input, number_of_speakers, ))

    # Don't have anyone to rename
    else:
        st.error("0 speakers have been detected. Seem there is an issue with diarization")
        with st.spinner("Redirecting to transcription page"):
            time.sleep(4)
            # return to the results page
            update_session_state("page_index", 1)</code></pre>



<p id="block-8897a8bc-7dd3-43cc-8f83-851e8a9aea8e">Now, <strong>write the callback function</strong> that is called when the <em>&#8220;Save changes&#8221;</em> button is clicked. It allows to <strong>save the new speaker&#8217;s names</strong> in the <em>process</em> session state variable and to <strong>recreate the displayed text with the new names</strong> <strong>given by the user</strong> thanks to the previously defined function <em>create_txt_text_from_process()</em>. Finally, it <strong>redirects the user to the results page</strong>.</p>



<pre id="block-b64adcbc-4c78-4451-aa88-d8fc7de54e53" class="wp-block-code"><code class="">def click_confirm_rename_btn(names_input, number_of_speakers):<em>
</em>    """
    If the users decides to rename speakers and confirms his choices, we apply the modifications to our transcript
    Then we return to the results page of the app
    :param names_input: string
    :param number_of_speakers: Number of detected speakers in the audio file
    """
<em>
    </em>try:
        names_input = names_input.split("\n")[:number_of_speakers]

        for elt in st.session_state["process"]:
            elt[2] = names_input[elt[1]]

        txt_text = create_txt_text_from_process()
        update_session_state("txt_transcript", txt_text)
        update_session_state("page_index", 1)

    except TypeError:  # list indices must be integers or slices, not str (happened to me one time when writing non sense names)
        st.error("Please respect the 1 name per line format")
        with st.spinner("We are relaunching the page"):
            time.sleep(3)
            update_session_state("page_index", 1)</code></pre>



<h3 class="wp-block-heading" id="block-b681491a-4806-4d35-a6ed-5b6bc70dc839">Create subtitles for videos (.SRT)</h3>



<p>Idea is very simple here, process is the same as before. We just have in this case to <strong>shorten the timestamps</strong> by adjusting the <em>min_space</em> and the <em>max_space</em> values, so we have a <strong>good video-subtitles synchronization</strong>.</p>



<p id="block-f5aaab26-bda2-468f-8ea0-15bbcc374fda">Indeed, remember that <strong>subtitles must correspond to small time windows</strong> to have <strong>small synchronized transcripts</strong>. Otherwise, there will be too much text. That&#8217;s why we <strong>set the <em>min_space</em> to 1s and the <em>max_space</em> to 8s</strong> instead of the classic min: 25s and max: 45s values.</p>



<pre id="block-6889eddb-3ada-4d53-9d6a-6c132539ff62" class="wp-block-code"><code class="">def silence_mode_init(srt_token):
    """
    Fix min_space and max_space values
    If the user wants a srt file, we need to have tiny timestamps
    :param srt_token: Enable/Disable generate srt file option (choice fixed by user)
    :return: min_space and max_space values
    """
<em>
    </em>if srt_token:
        # We need short intervals if we want a short text
        min_space = 1000  # 1 sec
        max_space = 8000  # 8 secs

    else:

        min_space = 25000  # 25 secs
        max_space = 45000  # 45secs
    return min_space, max_space</code></pre>



<h3 class="wp-block-heading">Update old code</h3>



<p id="block-8e4448c3-d843-4e10-a357-7784ef86c67c">As we have a lot <strong>new parameters </strong><em>(diarization_token, timestamps_token, summarize_token, &#8230;)</em> in our <em>display_transcription() </em>function, we need to <strong>update our <em>transcription_non_diarization()</em> function </strong>so it can interact with these new parameters and display the transcript correctly.</p>



<pre id="block-4ef2e66a-16f6-4bf8-9777-2cf050df76a6" class="wp-block-code"><code class="">def transcription_non_diarization(filename, myaudio, start, end, diarization_token, timestamps_token, srt_token,
                                  summarize_token, stt_model, stt_tokenizer, min_space, max_space, save_result,
                                  txt_text, srt_text):
    """
    Performs transcribing action with the non-diarization mode
    :param filename: name of the audio file
    :param myaudio: AudioSegment file
    :param start: int value (s) given by st.slider() (fixed by user)
    :param end: int value (s) given by st.slider() (fixed by user)
    :param diarization_token: Differentiate or not the speakers (choice fixed by user)
    :param timestamps_token: Display and save or not the timestamps (choice fixed by user)
    :param srt_token: Enable/Disable generate srt file (choice fixed by user)
    :param summarize_token: Summarize or not the transcript (choice fixed by user)
    :param stt_model: Speech to text model
    :param stt_tokenizer: Speech to text model's tokenizer
    :param min_space: Minimum temporal distance between two silences
    :param max_space: Maximum temporal distance between two silences
    :param save_result: whole process
    :param txt_text: generated .txt transcript
    :param srt_text: generated .srt transcript
    :return: results of transcribing action
    """

    # Numeric counter identifying each sequential subtitle
    srt_index = 1

    # get silences
    silence_list = detect_silences(myaudio)
    if silence_list != []:
        silence_list = get_middle_silence_time(silence_list)
        silence_list = silences_distribution(silence_list, min_space, max_space, start, end, srt_token)
    else:
        silence_list = generate_regular_split_till_end(silence_list, int(end), min_space, max_space)

    # Transcribe each audio chunk (from timestamp to timestamp) and display transcript
    for i in range(0, len(silence_list) - 1):
        sub_start = silence_list[i]
        sub_end = silence_list[i + 1]

        transcription = transcribe_audio_part(filename, stt_model, stt_tokenizer, myaudio, sub_start, sub_end, i)

        # Initial audio has been split with start &amp; end values
        # It begins to 0s, but the timestamps need to be adjust with +start*1000 values to adapt the gap
        if transcription != "":
            save_result, txt_text, srt_text, srt_index = display_transcription(diarization_token, summarize_token,
                                                                    srt_token, timestamps_token,
                                                                    transcription, save_result,
                                                                    txt_text,
                                                                    srt_text,
                                                                    srt_index, sub_start + start * 1000,
                                                                    sub_end + start * 1000)

    return save_result, txt_text, srt_text</code></pre>



<p id="block-7f6d7de0-2aca-44d7-9854-9f86022a8fbc"><strong>Also, you need to add these new parameters</strong> to the <em>transcript_from_url()</em> and <em>transcript_from_files()</em> functions.</p>



<pre id="block-0fd5d977-eb66-436f-87a0-a0f564ae3268" class="wp-block-code"><code class="">def transcript_from_url(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline):
    """
    Displays a text input area, where the user can enter a YouTube URL link. If the link seems correct, we try to
    extract the audio from the video, and then transcribe it.

    :param stt_tokenizer: Speech to text model's tokenizer
    :param stt_model: Speech to text model
    :param t5_tokenizer: Auto punctuation model's tokenizer
    :param t5_model: Auto punctuation model
    :param summarizer: Summarizer model
    :param dia_pipeline: Diarization Model (Differentiate speakers)
    """

    url = st.text_input("Enter the YouTube video URL then press Enter to confirm!")
    # If link seems correct, we try to transcribe
    if "youtu" in url:
        filename = extract_audio_from_yt_video(url)
        if filename is not None:
            transcription(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline, filename)
        else:
            st.error("We were unable to extract the audio. Please verify your link, retry or choose another video")</code></pre>



<pre id="block-98c50033-3932-4b8c-8baa-647defb92303" class="wp-block-code"><code class="">def transcript_from_file(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline):
    """
    Displays a file uploader area, where the user can import his own file (mp3, mp4 or wav). If the file format seems
    correct, we transcribe the audio.
    """

    # File uploader widget with a callback function, so the page reloads if the users uploads a new audio file
    uploaded_file = st.file_uploader("Upload your file! It can be a .mp3, .mp4 or .wav", type=["mp3", "mp4", "wav"],
                                     on_change=update_session_state, args=("page_index", 0,))

    if uploaded_file is not None:
        # get name and launch transcription function
        filename = uploaded_file.name
        transcription(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline, filename, uploaded_file)</code></pre>



<p id="block-63b4dd8e-151e-4eb6-9588-d71f7ad078f8">Everything is almost ready, you can finally <strong>update the <em>transcription() </em>function</strong> so it can <strong>call all the new methods we have defined</strong>:</p>



<pre id="block-44999f30-2a48-4245-8b77-a177e94ecc97" class="wp-block-code"><code class="">def transcription(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline, filename,
                  uploaded_file=None):
    """
    Mini-main function
    Display options, transcribe an audio file and save results.
    :param stt_tokenizer: Speech to text model's tokenizer
    :param stt_model: Speech to text model
    :param t5_tokenizer: Auto punctuation model's tokenizer
    :param t5_model: Auto punctuation model
    :param summarizer: Summarizer model
    :param dia_pipeline: Diarization Model (Differentiate speakers)
    :param filename: name of the audio file
    :param uploaded_file: file / name of the audio file which allows the code to reach the file
    """

    # If the audio comes from the Youtube extraction mode, the audio is downloaded so the uploaded_file is
    # the same as the filename. We need to change the uploaded_file which is currently set to None
    if uploaded_file is None:
        uploaded_file = filename

    # Get audio length of the file(s)
    myaudio = AudioSegment.from_file(uploaded_file)
    audio_length = myaudio.duration_seconds

    # Save Audio (so we can display it on another page ("DISPLAY RESULTS"), otherwise it is lost)
    update_session_state("audio_file", uploaded_file)

    # Display audio file
    st.audio(uploaded_file)

    # Is transcription possible
    if audio_length &gt; 0:

        # We display options and user shares his wishes
        transcript_btn, start, end, diarization_token, punctuation_token, timestamps_token, srt_token, summarize_token, choose_better_model = load_options(
            int(audio_length), dia_pipeline)

        # If end value hasn't been changed, we fix it to the max value so we don't cut some ms of the audio because
        # end value is returned by a st.slider which return end value as a int (ex: return 12 sec instead of end=12.9s)
        if end == int(audio_length):
            end = audio_length

        # Switching model for the better one
        if choose_better_model:
            with st.spinner("We are loading the better model. Please wait..."):

                try:
                    stt_tokenizer = pickle.load(open("models/STT_tokenizer2_wav2vec2-large-960h-lv60-self.sav", 'rb'))
                except FileNotFoundError:
                    stt_tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

                try:
                    stt_model = pickle.load(open("models/STT_model2_wav2vec2-large-960h-lv60-self.sav", 'rb'))
                except FileNotFoundError:
                    stt_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

        # Validate options and launch the transcription process thanks to the form's button
        if transcript_btn:

            # Check if start &amp; end values are correct
            start, end = correct_values(start, end, audio_length)

            # If start a/o end value(s) has/have changed, we trim/cut the audio according to the new start/end values.
            if start != 0 or end != audio_length:
                myaudio = myaudio[start * 1000:end * 1000]  # Works in milliseconds (*1000)

            # Transcribe process is running
            with st.spinner("We are transcribing your audio. Please wait"):

                # Initialize variables
                txt_text, srt_text, save_result = init_transcription(start, int(end))
                min_space, max_space = silence_mode_init(srt_token)

                # Differentiate speakers mode
                if diarization_token:

                    # Save mode chosen by user, to display expected results
                    if not timestamps_token:
                        update_session_state("chosen_mode", "DIA")
                    elif timestamps_token:
                        update_session_state("chosen_mode", "DIA_TS")

                    # Convert mp3/mp4 to wav (Differentiate speakers mode only accepts wav files)
                    if filename.endswith((".mp3", ".mp4")):
                        myaudio, filename = convert_file_to_wav(myaudio, filename)
                    else:
                        filename = "../data/" + filename
                        myaudio.export(filename, format="wav")

                    # Differentiate speakers process
                    diarization_timestamps, number_of_speakers = diarization_treatment(filename, dia_pipeline,
                                                                                       max_space, srt_token)
                    # Saving the number of detected speakers
                    update_session_state("number_of_speakers", number_of_speakers)

                    # Transcribe process with Diarization Mode
                    save_result, txt_text, srt_text = transcription_diarization(filename, diarization_timestamps,
                                                                                stt_model,
                                                                                stt_tokenizer,
                                                                                diarization_token,
                                                                                srt_token, summarize_token,
                                                                                timestamps_token, myaudio, start,
                                                                                save_result,
                                                                                txt_text, srt_text)

                # Non Diarization Mode
                else:
                    # Save mode chosen by user, to display expected results
                    if not timestamps_token:
                        update_session_state("chosen_mode", "NODIA")
                    if timestamps_token:
                        update_session_state("chosen_mode", "NODIA_TS")

                    filename = "../data/" + filename
                    # Transcribe process with non Diarization Mode
                    save_result, txt_text, srt_text = transcription_non_diarization(filename, myaudio, start, end,
                                                                                    diarization_token, timestamps_token,
                                                                                    srt_token, summarize_token,
                                                                                    stt_model, stt_tokenizer,
                                                                                    min_space, max_space,
                                                                                    save_result, txt_text, srt_text)

                # Save results so it is not lost when we interact with a button
                update_session_state("process", save_result)
                update_session_state("srt_txt", srt_text)

                # Get final text (with or without punctuation token)
                # Diariation Mode
                if diarization_token:
                    # Create txt text from the process
                    txt_text = create_txt_text_from_process(punctuation_token, t5_model, t5_tokenizer)

                # Non diarization Mode
                else:

                    if punctuation_token:
                        # Need to split the text by 512 text blocks size since the model has a limited input
                        with st.spinner("Transcription is finished! Let us punctuate your audio"):
                            my_split_text_list = split_text(txt_text, 512)
                            txt_text = ""
                            # punctuate each text block
                            for my_split_text in my_split_text_list:
                                txt_text += add_punctuation(t5_model, t5_tokenizer, my_split_text)

                # Clean folder's files
                clean_directory("../data")

                # Display the final transcript
                if txt_text != "":
                    st.subheader("Final text is")

                    # Save txt_text and display it
                    update_session_state("txt_transcript", txt_text)
                    st.markdown(txt_text, unsafe_allow_html=True)

                    # Summarize the transcript
                    if summarize_token:
                        with st.spinner("We are summarizing your audio"):
                            # Display summary in a st.expander widget to don't write too much text on the page
                            with st.expander("Summary"):
                                # Need to split the text by 1024 text blocks size since the model has a limited input
                                if diarization_token:
                                    # in diarization mode, the text to summarize is contained in the "summary" the session state variable
                                    my_split_text_list = split_text(st.session_state["summary"], 1024)
                                else:
                                    # in non-diarization mode, it is contained in the txt_text variable
                                    my_split_text_list = split_text(txt_text, 1024)

                                summary = ""
                                # Summarize each text block
                                for my_split_text in my_split_text_list:
                                    summary += summarizer(my_split_text)[0]['summary_text']

                                # Removing multiple spaces and double spaces around punctuation mark " . "
                                summary = re.sub(' +', ' ', summary)
                                summary = re.sub(r'\s+([?.!"])', r'\1', summary)

                                # Display summary and save it
                                st.write(summary)
                                update_session_state("summary", summary)

                    # Display buttons to interact with results

                    # We have 4 possible buttons depending on the user's choices. But we can't set 4 columns for 4
                    # buttons. Indeed, if the user displays only 3 buttons, it is possible that one of the column
                    # 1, 2 or 3 is empty which would be ugly. We want the activated options to be in the first columns
                    # so that the empty columns are not noticed. To do that, let's create a btn_token_list

                    btn_token_list = [[diarization_token, "dia_token"], [True, "useless_txt_token"],
                                      [srt_token, "srt_token"], [summarize_token, "summarize_token"]]

                    # Save this list to be able to reach it on the other pages of the app
                    update_session_state("btn_token_list", btn_token_list)

                    # Create 4 columns
                    col1, col2, col3, col4 = st.columns(4)

                    # Create a column list
                    col_list = [col1, col2, col3, col4]

                    # Check value of each token, if True, we put the respective button of the token in a column
                    col_index = 0
                    for elt in btn_token_list:
                        if elt[0]:
                            mycol = col_list[col_index]
                            if elt[1] == "useless_txt_token":
                                # Download your transcript.txt
                                with mycol:
                                    st.download_button("Download as TXT", txt_text, file_name="my_transcription.txt",
                                                       on_click=update_session_state, args=("page_index", 1,))
                            elif elt[1] == "srt_token":
                                # Download your transcript.srt
                                with mycol:
                                    update_session_state("srt_token", srt_token)
                                    st.download_button("Download as SRT", srt_text, file_name="my_transcription.srt",
                                                       on_click=update_session_state, args=("page_index", 1,))
                            elif elt[1] == "dia_token":
                                with mycol:
                                    # Rename the speakers detected in your audio
                                    st.button("Rename Speakers", on_click=update_session_state, args=("page_index", 2,))

                            elif elt[1] == "summarize_token":
                                with mycol:
                                    # Download the summary of your transcript.txt
                                    st.download_button("Download Summary", st.session_state["summary"],
                                                       file_name="my_summary.txt",
                                                       on_click=update_session_state, args=("page_index", 1,))
                            col_index += 1

                else:
                    st.write("Transcription impossible, a problem occurred with your audio or your parameters, "
                             "we apologize :(")

    else:
        st.error("Seems your audio is 0 s long, please change your file")
        time.sleep(3)
        st.stop()</code></pre>



<p id="block-c336e842-7a4e-48a6-ad47-c819523e6ef0">Finally, <strong>update the main code</strong> of the python file, which allows to <strong>navigate between the different pages of our application</strong> (<em>token,</em> <em>home</em>, <em>results</em> and <em>rename pages</em>):</p>



<pre id="block-d623da5a-36b6-416a-a18c-79d7f2eb82fc" class="wp-block-code"><code class="">from app import *

if __name__ == '__main__':
    config()

    if st.session_state['page_index'] == -1:
        # Specify token page (mandatory to use the diarization option)
        st.warning('You must specify a token to use the diarization model. Otherwise, the app will be launched without this model. You can learn how to create your token here: https://huggingface.co/pyannote/speaker-diarization')
        text_input = st.text_input("Enter your Hugging Face token:", placeholder="ACCESS_TOKEN_GOES_HERE", type="password")

        # Confirm or continue without the option
        col1, col2 = st.columns(2)

        # save changes button
        with col1:
            confirm_btn = st.button("I have changed my token", on_click=confirm_token_change, args=(text_input, 0), disabled=st.session_state["disable"])
            # if text is changed, button is clickable
            if text_input != "ACCESS_TOKEN_GOES_HERE":
                st.session_state["disable"] = False

        # Continue without a token (there will be no diarization option)
        with col2:
            dont_mind_btn = st.button("Continue without this option", on_click=update_session_state, args=("page_index", 0))

    if st.session_state['page_index'] == 0:
        # Home page
        choice = st.radio("Features", ["By a video URL", "By uploading a file"])

        stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline = load_models()

        if choice == "By a video URL":
            transcript_from_url(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline)

        elif choice == "By uploading a file":
            transcript_from_file(stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline)

    elif st.session_state['page_index'] == 1:
        # Results page
        display_results()

    elif st.session_state['page_index'] == 2:
        # Rename speakers page
        rename_speakers_window()</code></pre>



<p>The idea is the following: </p>



<p>The user arrives at <em>the token page</em> (whose index is -1). He is invited to enter his diarization access token into a <em>text_input()</em> widget. These instructions are given to him by a <em>st.warning()</em>. He can then choose to enter his token and click the <em>confirm_btn</em>, which will then be clickable. But he can also choose not to use this option by clicking on the <em>dont_mind</em> button. In both cases, the variabel page_index will be updated to 0, and the application will then display the <em>home page</em> that will allow the user to transcribe his files.</p>



<p>In this logic, you will understand that the session variable <em>page_index</em> must <strong>no longer be initialized to 0 </strong>(index of the <em>home page</em>), but <strong>to -1</strong>, in order to load the <em>token page</em>. For that, <strong>modify its initialization in the <em>config()</em> function</strong>:</p>



<pre class="wp-block-code"><code class=""># Modify the page_index initialization in the config() function

def config(): 

    # .... 

    if 'page_index' not in st.session_state:
        st.session_state['page_index'] = -1 </code></pre>



<h3 class="wp-block-heading">Conclusion</h3>



<p>Congratulations! Your Speech to Text Application is now full of features. Now it&#8217;s time to have fun with! </p>



<p>You can transcribe audio files, videos, with or without punctuation. You can also generate synchronized subtitles. You have also discovered how to differentiate speakers thanks to diarization, in order to follow a conversation more easily.</p>



<p>➡️ To <strong>significantly reduce the initialization time</strong> of the app and the <strong>execution time of the transcribing</strong>, we recommend that you deploy your speech to text app on powerful GPU ressources with <strong>AI Deploy</strong>. To learn how to do it, please refer to&nbsp;<a href="https://docs.ovh.com/gb/en/publiccloud/ai/deploy/tuto-streamlit-speech-to-text-app/" target="_blank" rel="noreferrer noopener" data-wpel-link="exclude">this documentation</a>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-build-a-speech-to-text-application-with-python-3-3%2F&amp;action_name=How%20to%20build%20a%20Speech-To-Text%20Application%20with%20Python%20%283%2F3%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to build a Speech-To-Text Application with Python (2/3)</title>
		<link>https://blog.ovhcloud.com/how-to-build-a-speech-to-text-application-with-python-2-3/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Wed, 14 Dec 2022 09:26:39 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Apps]]></category>
		<category><![CDATA[AI Solutions]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Docker]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[Streamlit]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=23283</guid>

					<description><![CDATA[A tutorial to create and build your own Speech-To-Text Application with Python. At the end of this second article, your Speech-To-Text application will be more interactive and visually better. Indeed, we are going to center our titles and justify our transcript. We will also add some useful buttons (to download the transcript, to play with [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-build-a-speech-to-text-application-with-python-2-3%2F&amp;action_name=How%20to%20build%20a%20Speech-To-Text%20Application%20with%20Python%20%282%2F3%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>A tutorial to create and build your own <strong>Speech-To-Text Application</strong></em> with Python.</p>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-2-1024x576.png" alt="speech to text app image2" class="wp-image-24059" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-2-1024x576.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-2-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-2-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-2-1536x864.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/speech-to-text-app-2.png 1920w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>At the end of this second article, your Speech-To-Text application will be <strong>more interactive</strong> and<strong> visually better</strong>. </p>



<p>Indeed, we are going to <strong>center</strong> our titles and <strong>justify</strong> our transcript. We will also add some useful <strong>buttons</strong> (to download the transcript, to play with the timestamps). Finally, we will prepare the application for the next tutorial by displaying <strong>sliders and checkboxes</strong> to interact with the next functionalities (speaker differentiation, summarization, video subtitles generation, &#8230;)</p>



<p><em>Final code of the app is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a>.</em></p>



<h3 class="wp-block-heading">Overview of our final app</h3>



<figure class="wp-block-image aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="575" src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/App_Overview-1024x575.png" alt="speech to text streamlit app" class="wp-image-23277" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/07/App_Overview-1024x575.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/App_Overview-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/App_Overview-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/App_Overview-1536x863.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/App_Overview.png 1920w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="has-text-align-center"><em>Overview of our final Speech-To-Text application</em></p>



<h3 class="wp-block-heading">Objective</h3>



<p>In the <a href="https://blog.ovhcloud.com/how-to-build-a-speech-to-text-application-with-python-1-3/" data-wpel-link="internal">previous article</a>, we have seen how to build a <strong>basic</strong> <strong>Speech-To-Text application</strong> with <em>Python</em> and <em>Streamlit</em>. In this tutorial, we will <strong>improve </strong>this application by <strong>changing its appearance</strong>, <strong>improving its interactivity</strong> and <strong>preparing features</strong> used in the <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/natural-language-processing/speech-to-text/miniconda" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">notebooks</a> (transcript a specific audio part, differentiate speakers, generate video subtitles, punctuate and summarize the transcript, &#8230;) that we will implement in the last tutorial! </p>



<p>This article is organized as follows:</p>



<ul class="wp-block-list">
<li>Python libraries</li>



<li>Change appearance with CSS</li>



<li>Improve the app&#8217;s interactivity</li>



<li>Prepare new functionalities</li>
</ul>



<p><em>⚠️ Since this article uses code already explained in the previous <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/natural-language-processing/speech-to-text/miniconda" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">notebook tutorials</a>, w</em>e will not <em>re-explained its <em>usefulness</em></em> here<em>. We therefore recommend that you read the notebooks first.</em></p>



<h4 class="wp-block-heading">1. Python libraries</h4>



<p>To implement our final features (speakers differentiation, summarization, &#8230;) to our speech to text app, we need to <strong>import</strong> the following libraries into our <em>app.py</em> file. We will use them afterwards.</p>



<pre class="wp-block-code"><code class=""># Models
from pyannote.audio import Pipeline
from transformers import pipeline, HubertForCTC, T5Tokenizer, T5ForConditionalGeneration, Wav2Vec2ForCTC, Wav2Vec2Processor, Wav2Vec2Tokenizer
import pickle

# Others
import pandas as pd
import re</code></pre>



<h4 class="wp-block-heading">2. Change appearance with CSS</h4>



<p>Before adding or modifying anything, let&#8217;s <strong>improve the appearance</strong> of our application!</p>



<p>😕 Indeed, you maybe noticed our <strong>transcript is not justified</strong>, <strong>titles are not centered</strong> and there is an <strong>unnecessary space</strong> at the top of the screen.</p>



<p>➡️ To solve this, let&#8217;s use the <em>st.markdown()</em> function to write some <strong>CSS code</strong> thanks to the &#8220;<em>style</em>&#8221; attribute! </p>



<p>Just <strong>add the following lines to the <em>config()</em> function</strong> we have created before, for example after the <em>st.title(&#8220;Speech to Text App 📝&#8221;)</em> line. This will tell <em>Streamlit</em> how it should display the mentioned elements.</p>



<pre class="wp-block-code"><code class="">    st.markdown("""
                    &lt;style&gt;
                    .block-container.css-12oz5g7.egzxvld2{
                        padding: 1%;}
                   
                    .stRadio &gt; label:nth-child(1){
                        font-weight: bold;
                        }
                    .stRadio &gt; div{flex-direction:row;}
                    p, span{ 
                        text-align: justify;
                    }
                    span{ 
                        text-align: center;
                    }
                    """, unsafe_allow_html=True)</code></pre>



<p>We set the parameter &#8220;<em>unsafe_allow_html</em>&#8221; to &#8220;<em>True</em>&#8221; because HTML tags are escaped by default and therefore treated as pure text. Setting this argument to True turns off this behavior.</p>



<p class="has-text-align-left">⬇️ Let&#8217;s look at the result:</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:100%">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:100%">
<figure class="wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex">
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1554" height="873" data-id="23292" src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/without_css-edited.png" alt="speech to text streamlit application without css" class="wp-image-23292" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/07/without_css-edited.png 1554w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/without_css-edited-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/without_css-edited-1024x575.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/without_css-edited-768x431.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/without_css-edited-1536x863.png 1536w" sizes="auto, (max-width: 1554px) 100vw, 1554px" /><figcaption class="wp-element-caption"><br></figcaption></figure>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1630" height="915" data-id="23291" src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/with_css-edited.png" alt="speech to text streamlit application with css" class="wp-image-23291" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/07/with_css-edited.png 1630w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/with_css-edited-300x168.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/with_css-edited-1024x575.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/with_css-edited-768x431.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2022/07/with_css-edited-1536x862.png 1536w" sizes="auto, (max-width: 1630px) 100vw, 1630px" /></figure>
</figure>
</div>
</div>
</div>
</div>



<p class="has-text-align-center"><em>App without CSS (on the left) and with (on the right)</em></p>



<p>This is better now, isn&#8217;t it? </p>



<h4 class="wp-block-heading">3. Improve the app&#8217;s interactivity</h4>



<p>Now we will let the user <strong>interact</strong> with our application. It will no longer only generate a transcript.</p>



<p><strong>3.1 Download transcript</strong></p>



<p>We want the users to be able to <strong>download the generated transcript as a text file</strong>. This will save them from having to copy the transcript and paste it into a text file. We can do this easily with a <strong>download button widget</strong><em>.</em></p>



<p>Unfortunately, <em>Streamlit</em> does not help us with this feature. Indeed, <strong>each time you interact with a button</strong> on the page, the entire <em>Streamlit</em> <strong>script will be re-run</strong> so it will <strong>delete our displayed transcript</strong>. To observe this problem, <strong>add a download button </strong>to the app<strong>,</strong> just after the <em>st.write(txt_text)</em> in the <em>transcription()</em> function, thanks to the following code:</p>



<pre class="wp-block-code"><code class=""># Download transcript button - Add it to the transcription() function, after the st.write(txt_text) line
st.download_button("Download as TXT", txt_text, file_name="my_transcription.txt")</code></pre>



<p>Now, if you transcribe an audio file, you should have a download button at the bottom of the transcript and if you click it, you will get the transcript in a .txt format as expected. But you will notice that the<strong> whole transcript disappears</strong> <strong>without any reason</strong> which is frustrating for the user, as the video below shows: </p>



<figure class="wp-block-video aligncenter"><video height="720" style="aspect-ratio: 1280 / 720;" width="1280" controls src="https://blog.ovhcloud.com/wp-content/uploads/2022/11/speech_to_text_app_click_button_issue.mp4"></video></figure>



<p class="has-text-align-center"><em>Video illustrating the issue with Streamlit button widgets</em></p>



<p>To solve this, we are going to use <strong><em>Streamlit</em> <em>session state</em></strong> and <strong><em>callback functions</em></strong>. Indeed, session state is a way to share variables between reruns. Since <em>Streamlit</em> reruns the app&#8217;s script when we click a button, this is the perfect solution!</p>



<p>➡️ First, let&#8217;s <strong>initialize four session state variables </strong>respectively called <em>audio_file</em>, <em>process, txt_transcript</em> and <em>page_index</em>. </p>



<p><em>As the session states variables are initialized only one time in the code, we can <strong>initialize them all at once </strong>as belo</em>w, <strong>in the <em>config()</em> function</strong>:</p>



<pre class="wp-block-code"><code class=""># Initialize session state variables
# Should be added to the config() function 
if 'page_index' not in st.session_state:
    st.session_state['audio_file'] = None
    st.session_state["process"] = []
    st.session_state['txt_transcript'] = ""
    st.session_state["page_index"] = 0</code></pre>



<p>The first one allow us to <strong>save the audio file </strong>of the user. Then<em>, </em>the<em> process</em> variable (which is a list) will <strong>contain each generated transcript part with its associated timestamps</strong>, while the third variable will only contain the concatenated transcripts, which means the <strong>final text</strong>.</p>



<p>The last variable, <em>page_index</em>, will <strong>determine which page of our application will be displayed</strong> according to its value. Indeed, since clicking the download button removes the displayed transcript, we are going to <strong>create a second page</strong>, named <strong>results page</strong>, where we will <strong>display again</strong> the user&#8217;s audio file and the obtained transcript <strong>thanks to the values saved in our session state variables</strong>. We can then <strong>redirect the user</strong> to this second page as soon as the user <strong>clicks the download button</strong>. This will allow the user to always be able to see his transcript, even if he downloads it!</p>



<p>➡️ Once we have initialized the session state variables, we need to <strong>save</strong> the transcript with the associated timestamps and the final text in <strong>these variables</strong> so we do not lose these information when we click a button.</p>



<p>To do that, we need to <strong>define an <em>update_session_state()</em> function</strong> which will allow us to <strong>update our session state variables</strong>, either by <strong>replacing</strong> their content, or by <strong>concatenating</strong> it, which will be interesting for the transcripts since they are obtained step by step. Indeed, <strong>concatenating each transcript part will allow us to obtain the final transcript</strong>. Here is the function:</p>



<pre class="wp-block-code"><code class="">def update_session_state(var, data, concatenate_token=False):
    """
    A simple function to update a session state variable
    :param var: variable's name
    :param data: new value of the variable
    :param concatenate_token: do we replace or concatenate
    """

    if concatenate_token:
        st.session_state[var] += data
    else:
        st.session_state[var] = data</code></pre>



<p>This is where we will use the variable<em> save_result</em> from the previous article. Actually, <em>save_result </em>is a list which contains the timestamps and the generated transcript. This corresponds to what we want in the <em>process</em> state variable, which will allow us to retrieve the transcripts and associated timestamps and display them on our results page! </p>



<pre class="wp-block-code"><code class="">### Add this line to the transcription() function, after the transcription_non_diarization() call

# Save results
update_session_state("process", save_result)</code></pre>



<p>Let&#8217;s do the same with the <em>audio_file</em> and <em>txt_text</em> variables, so we can also re-display the audio player and the final text on our results page.</p>



<pre class="wp-block-code"><code class="">### Add this line to the transcription() function, after the st.audio(uploaded_file) line, to save the audio file

# Save Audio so it is not lost when we interact with a button (so we can display it on the results page)
update_session_state("audio_file", uploaded_file)</code></pre>



<pre class="wp-block-code"><code class="">### Add this line to the transcription() function, after the if txt_text != ""

# Save txt_text
update_session_state("txt_transcript", txt_text)</code></pre>



<p>Thanks to the content saved in our session state variables (<em>audio_file</em>, <em>process, txt_transcript</em>), we are ready to create our results page.</p>



<p><strong>3.2 Create the results page</strong> <strong>and switch to it</strong></p>



<p>First, we have to tell <em>Streamlit</em> that <strong>clicking the download button</strong> must <strong>change the <em>page_index</em></strong> <strong>value</strong>. Indeed, remember that its value determines which page of our app is displayed. </p>



<p>If this variable is 0, we will see the home page. If we click a download button, the app&#8217;s script is restarted and the transcript will disappear from the home page. But since the <em>page_index</em> value will now be set to 1 when a button is clicked, we will display the results page instead of the home page and we will no longer have an empty page. </p>



<p>To do this, we simply <strong>add the previous function</strong> to our download button thanks to the <em>on_click</em> parameter, so we can indicate to our app that we want to update the <em>page_index</em> session state variable from 0 to 1 (to go from the home page to the results page) when we click this button.</p>



<pre class="wp-block-code"><code class="">### Modify the code of the download button, in the transcription() function, at the end of the the if txt_text != "" statement

st.download_button("Download as TXT", txt_text, file_name="my_transcription.txt", on_click=update_session_state, args=("page_index", 1,))</code></pre>



<p>Now that the <em>page_index</em> value is updated, we need to check its value to know if the displayed page should be the home page or the results page. </p>



<p>We do this value checking into the main code of our app. You can <strong>replace the old main code by the following one</strong>:</p>



<pre class="wp-block-code"><code class="">if __name__ == '__main__':
    config()

    # Default page
    if st.session_state['page_index'] == 0:
        choice = st.radio("Features", ["By a video URL", "By uploading a file"])

        stt_tokenizer, stt_model = load_models()

        if choice == "By a video URL":
            transcript_from_url(stt_tokenizer, stt_model)

        elif choice == "By uploading a file":
            transcript_from_file(stt_tokenizer, stt_model)

    # Results page
    elif st.session_state['page_index'] == 1:
        # Display Results page
        display_results()</code></pre>



<p>Now that we have created this page, all that remains is to display elements on it (titles, buttons, audio file, transcript)! </p>



<pre class="wp-block-code"><code class="">def display_results():

    st.button("Load an other file", on_click=update_session_state, args=("page_index", 0,))
    st.audio(st.session_state['audio_file'])

    # Display results of transcription by steps
    if st.session_state["process"] != []:
        for elt in (st.session_state['process']):

            # Timestamp
            st.write(elt[0])

            # Transcript for this timestamp
            st.write(elt[1])

    # Display final text
    st.subheader("Final text is")
    st.write(st.session_state["txt_transcript"])

    
    # Download your transcription.txt
    st.download_button("Download as TXT", st.session_state["txt_transcript"], file_name="my_transcription.txt")</code></pre>



<p>👀 You maybe noticed that at the beginning of the previous function, we have added a <strong>&#8220;<em>Load an other file</em>&#8221; button</strong>. If you look at it, you will see it has a <strong>callback function that updates the <em>page_index</em> to 0</strong>. In other words, this button <strong>allows the user to return to the home page </strong>so he can transcribe an other file.</p>



<p>Now let&#8217;s see what happens when we interact with this download button:</p>



<figure class="wp-block-video"><video height="1080" style="aspect-ratio: 1920 / 1080;" width="1920" controls src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/click_button_streamlit_solve-3.mov"></video></figure>



<p class="has-text-align-center"><em>Video illustrating the solved issue with Streamlit button widgets</em></p>



<p>As you can see, <strong>clicking the download button no longer makes the transcript disappear</strong>, thanks to our results page! We still have the <em>st.audio()</em> widget, the <em>process</em> as well as the final text and the download button. We have solved our problem!</p>



<p><strong>3.3 Jump audio start to each timestamps</strong></p>



<p>Our Speech-To-Text application would be so much better if the timestamps were displayed as buttons so that the user can <strong>click them and listen to the considered audio part</strong> thanks to the audio player widget we placed. Now that we know how to manipulate session state variables and callback functions, there is not much left to do 😁!</p>



<p>First, <strong>define a new session state variable</strong> in the config() function named <em><strong>start_time</strong></em>. It will indicate to our app where the <strong>starting point</strong> of the <em>st.audio() </em>widget should be. For the moment, it is always at 0s.</p>



<pre class="wp-block-code"><code class="">### Add this initialization to the config() function, with the other session state variables

st.session_state["start_time"] = 0</code></pre>



<p>Then, we define a new <strong>callback function</strong> that handles a <strong>timestamp button click</strong>. Just like before, it needs to <strong>redirect us to the results page</strong>, as we do not want the transcript to disappear. But it also needs to <strong>update the <em>start_time</em> variable</strong> to the beginning value of the timestamp button clicked by the user, so the starting point of the audio player can change. </p>



<p>For example, if the timestamp is [10s &#8211; 20s], we will set the starting point of the audio player to 10 seconds so that the user can check on the audio player the generated transcript for this part.</p>



<p>Here is the new callback function:</p>



<pre class="wp-block-code"><code class="">def click_timestamp_btn(sub_start):
    """
    When user clicks a Timestamp button, we go to the display results page and st.audio is set to the sub_start value)
    It allows the user to listen to the considered part of the audio
    :param sub_start: Beginning of the considered transcript (ms)
    """

    update_session_state("page_index", 1)
    update_session_state("start_time", int(sub_start / 1000)) # division to convert ms to s</code></pre>



<p>Now, we need to <strong>replace</strong> the timestamp text by a timestamp button, so we can click it.</p>



<p>To do this, just replace the <em>st.write(temp_timestamps)</em> of the <strong><em>display_transcription()</em> </strong>function<strong> by a widget button</strong> that calls our new callback function, with <em>sub_start</em> as an argument, which corresponds to the beginning value of the timestamp. In the previous example, <em>sub_start</em> would be 10s.</p>



<pre class="wp-block-code"><code class="">### Modify the code that displays the temp_timestamps variable, in the display_transcription() function
st.button(temp_timestamps, on_click=click_timestamp_btn, args=(sub_start,))</code></pre>



<p>To make it work, we also need to <strong>modify 3 lines of code in the <em>display_results()</em></strong> function, which manages the results page, because it needs to:</p>



<ul class="wp-block-list">
<li>Make the <em>st.audio()</em> widget starts from the <em>start_time</em> session state value</li>



<li>Display timestamps of the results page as buttons instead of texts</li>



<li>Call the <em>update_session_state()</em> function when we click one of these buttons to update the <em>start_time</em> value, so it changes the starting point of the audio player </li>
</ul>



<pre class="wp-block-code"><code class="">def display_results():
    st.button("Load an other file", on_click=update_session_state, args=("page_index", 0,))
    st.audio(st.session_state['audio_file'], start_time=st.session_state["start_time"])

    # Display results of transcription by steps
    if st.session_state["process"] != []:
        for elt in (st.session_state['process']):
            # Timestamp
            st.button(elt[0], on_click=update_session_state, args=("start_time", elt[2],))
            
            #Transcript for this timestamp
            st.write(elt[1])

    # Display final text
    st.subheader("Final text is")
    st.write(st.session_state["txt_transcript"])

    # Download your transcription.txt
    st.download_button("Download as TXT", st.session_state["txt_transcript"], file_name="my_transcription.txt")</code></pre>



<p>When you&#8217;ve done this, each timestamp button (home page and results page) will be able to change the starting point of the audio player, as you can see on this video:</p>



<figure class="wp-block-video"><video height="1080" style="aspect-ratio: 1920 / 1080;" width="1920" controls src="https://blog.ovhcloud.com/wp-content/uploads/2022/07/click_timestamp_btn.mov"></video></figure>



<p class="has-text-align-center"><em>Video illustrating the timestamp button click</em></p>



<p>This feature is really useful to easily check each of the obtained transcripts! </p>



<h4 class="wp-block-heading">4. Preparing new functionalities</h4>



<p>Now that the application is taking shape, it is time to add the many features we studied in the <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/natural-language-processing/speech-to-text/miniconda" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">notebooks</a>.</p>



<p>Among these options are the possibility to:</p>



<ul class="wp-block-list">
<li>Trim/Cut an audio, if the user wants to transcribe only a specific part of the audio file</li>



<li>Differentiate speakers (Diarization)</li>



<li>Punctuate the transcript</li>



<li>Summarize the transcript</li>



<li>Generate subtitles for videos</li>



<li>Change the speech-to-text model to a better one (result will be longer)</li>



<li>Display or not the timestamps</li>
</ul>



<p><strong>4.1 Let the user enable these functionalities or not</strong></p>



<p>First of all, we need to provide the user with a way to customize his transcript by <strong>choosing the options he wants to activate</strong>.</p>



<p>➡️ To do this, we will use sliders &amp; check boxes as it shown on the screenshot below:</p>



<figure class="wp-block-image aligncenter size-full is-resized is-style-default"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2022/10/image-7.png" alt="sliders checkboxes streamlit app" class="wp-image-23746" width="543" height="264" srcset="https://blog.ovhcloud.com/wp-content/uploads/2022/10/image-7.png 718w, https://blog.ovhcloud.com/wp-content/uploads/2022/10/image-7-300x146.png 300w" sizes="auto, (max-width: 543px) 100vw, 543px" /><figcaption class="wp-element-caption">Overview of the displayed options</figcaption></figure>



<p>Add the following function to your code. It will display all the options on our application.</p>



<pre class="wp-block-code"><code class="">def load_options(audio_length, dia_pipeline):
    """
    Display options so the user can customize the result (punctuate, summarize the transcript ? trim the audio? ...)
    User can choose his parameters thanks to sliders &amp; checkboxes, both displayed in a st.form so the page doesn't
    reload when interacting with an element (frustrating if it does because user loses fluidity).
    :return: the chosen parameters
    """
    # Create a st.form()
    with st.form("form"):
        st.markdown("""&lt;h6&gt;
            You can transcript a specific part of your audio by setting start and end values below (in seconds). Then, 
            choose your parameters.&lt;/h6&gt;""", unsafe_allow_html=True)

        # Possibility to trim / cut the audio on a specific part (=&gt; transcribe less seconds will result in saving time)
        # To perform that, user selects his time intervals thanks to sliders, displayed in 2 different columns
        col1, col2 = st.columns(2)
        with col1:
            start = st.slider("Start value (s)", 0, audio_length, value=0)
        with col2:
            end = st.slider("End value (s)", 0, audio_length, value=audio_length)

        # Create 3 new columns to displayed other options
        col1, col2, col3 = st.columns(3)

        # User selects his preferences with checkboxes
        with col1:
            # Get an automatic punctuation
            punctuation_token = st.checkbox("Punctuate my final text", value=True)

            # Differentiate Speakers
            if dia_pipeline == None:
                st.write("Diarization model unvailable")
                diarization_token = False
            else:
                diarization_token = st.checkbox("Differentiate speakers")

        with col2:
            # Summarize the transcript
            summarize_token = st.checkbox("Generate a summary", value=False)

            # Generate a SRT file instead of a TXT file (shorter timestamps)
            srt_token = st.checkbox("Generate subtitles file", value=False)

        with col3:
            # Display the timestamp of each transcribed part
            timestamps_token = st.checkbox("Show timestamps", value=True)

            # Improve transcript with an other model (better transcript but longer to obtain)
            choose_better_model = st.checkbox("Change STT Model")

        # Srt option requires timestamps so it can matches text with time =&gt; Need to correct the following case
        if not timestamps_token and srt_token:
            timestamps_token = True
            st.warning("Srt option requires timestamps. We activated it for you.")

        # Validate choices with a button
        transcript_btn = st.form_submit_button("Transcribe audio!")

    return transcript_btn, start, end, diarization_token, punctuation_token, timestamps_token, srt_token, summarize_token, choose_better_model</code></pre>



<p>This function is very simple to understand:</p>



<p>First of all, we display all options in a <em><strong>st.form()</strong></em>, so the page doesn&#8217;t reload each time the user interacts with an element (<em>Streamlit&#8217;s</em> feature which can be frustrating because in our case it wastes time). If you are curious, you can try to run your app with without the <em>st.form()</em> to observe the problem 😊.</p>



<p>Then, we create some <strong>columns</strong>. They allow us to display the elements one under the other, aligned, to improve the visual appearance. Here too, you can display the elements one after the other without using columns, but it will look different.</p>



<p>We will call this function in the <em>transcription()</em> function, in the next article 😉. But if you want to test it now, you can call this function after the <em>st.audio()</em> widget. Just keep in mind that this will only display the options, but it won&#8217;t change the result since the options are not implemented yet.</p>



<p><strong>4.2 Session states variables</strong></p>



<p>To interact with these features, we need to initialize more session states variables (I swear these are the last ones 🙃):</p>



<pre class="wp-block-code"><code class="">### Add new initialization to our config() function

st.session_state['srt_token'] = 0  # Is subtitles parameter enabled or not
st.session_state['srt_txt'] = ""  # Save the transcript in a subtitles case to display it on the results page
st.session_state["summary"] = ""  # Save the summary of the transcript so we can display it on the results page
st.session_state["number_of_speakers"] = 0  # Save the number of speakers detected in the conversation (diarization)
st.session_state["chosen_mode"] = 0  # Save the mode chosen by the user (Diarization or not, timestamps or not)
st.session_state["btn_token_list"] = []  # List of tokens that indicates what options are activated to adapt the display on results page
st.session_state["my_HF_token"] = "ACCESS_TOKEN_GOES_HERE"  # User's Token that allows the use of the diarization model
st.session_state["disable"] = True  # Default appearance of the button to change your token
</code></pre>



<p>To quickly introduce you to their usefulness:</p>



<ul class="wp-block-list">
<li><em>srt_token: </em>Indicates if the user has activated or not the subtitles option in the form</li>



<li><em>srt_text</em>: Contains the transcript as a subtitles format (.SRT) in order to save it when we click a button</li>



<li><em>summary</em>: Contains the short transcript given by the summarization model, for the same reason</li>



<li><em>number_of_speakers</em>: Number of speakers detected by the diarization algorithm in the audio recording</li>



<li><em>chosen_mode</em>: Indicates what options the user has selected so we know which information should be displayed (timestamps? results of diarization?)</li>



<li><em>btn_token_list</em>: Handle which buttons should be displayed. You will understand why it is needed in the next article</li>



<li><em>my_HF_token</em>: Save the user&#8217;s token that allows the use of the diarization model</li>



<li><em>disable</em>: Boolean that allows to make the change user&#8217;s token button clickable or not (not clickable if token has not been added)</li>
</ul>



<p>You will also need to <strong>add the following line of code to the <em>init_transcription()</em></strong> function:</p>



<pre class="wp-block-code"><code class=""># Add this line to the init_transcription() function
update_session_state("summary", "")</code></pre>



<p>This will reset the summary for each new audio file transcribed.</p>



<p><strong>4.3 Import the models</strong></p>



<p>Of course, to interact with these functionalities, we need to <strong>load new A.I. models</strong>. </p>



<p>⚠️ Reminder: We have used each of them in the previous <a href="https://github.com/ovh/ai-training-examples/tree/main/notebooks/natural-language-processing/speech-to-text/miniconda" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">notebook tutorials</a>. We will not re-explain their usefulness here.</p>



<p><strong>4.3.1 Create a token to access to the diarization model</strong></p>



<p>Since version 2 of <em>pyannote.audio</em> library, an <strong>access token</strong> has been implemented and is <strong>mandatory</strong> in order to use the diarization model (which allows speakers differentiation)</p>



<p>To create your access token, you will need to:</p>



<ul class="wp-block-list">
<li><a href="https://huggingface.co/join" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Create an <em>Hugging Face</em> account</a> and <strong>verify your email address</strong></li>



<li>Visit the <em><a href="http://hf.co/pyannote/speaker-diarization" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">speaker-diarization</a></em> page and the <em><a href="http://hf.co/pyannote/segmentation" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">segmentation</a></em> page, and <strong>accept user conditions </strong>on both pages (only if requested)</li>



<li>Visit the <a href="http://hf.co/settings/tokens" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">token page</a> to <strong>create</strong> an access token (Read Role)</li>
</ul>



<p><strong>4.3.2 Load the models</strong></p>



<p>Once you have your token, you can <strong>modify the code</strong> of the <em>load_models()</em> function to <strong>add the models</strong>:</p>



<pre class="wp-block-code"><code class="">@st.cache(allow_output_mutation=True)
def load_models():
    """
    Instead of systematically downloading each time the models we use (transcript model, summarizer, speaker differentiation, ...)
    thanks to transformers' pipeline, we first try to directly import them locally to save time when the app is launched.
    This function has a st.cache(), because as the models never change, we want the function to execute only one time
    (also to save time). Otherwise, it would run every time we transcribe a new audio file.
    :return: Loaded models
    """

    # Load facebook-hubert-large-ls960-ft model (English speech to text model)
    with st.spinner("Loading Speech to Text Model"):
        # If models are stored in a folder, we import them. Otherwise, we import the models with their respective library

        try:
            stt_tokenizer = pickle.load(open("models/STT_processor_hubert-large-ls960-ft.sav", 'rb'))
        except FileNotFoundError:
            stt_tokenizer = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")

        try:
            stt_model = pickle.load(open("models/STT_model_hubert-large-ls960-ft.sav", 'rb'))
        except FileNotFoundError:
            stt_model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")

    # Load T5 model (Auto punctuation model)
    with st.spinner("Loading Punctuation Model"):
        try:
            t5_tokenizer = torch.load("models/T5_tokenizer.sav")
        except OSError:
            t5_tokenizer = T5Tokenizer.from_pretrained("flexudy/t5-small-wav2vec2-grammar-fixer")

        try:
            t5_model = torch.load("models/T5_model.sav")
        except FileNotFoundError:
            t5_model = T5ForConditionalGeneration.from_pretrained("flexudy/t5-small-wav2vec2-grammar-fixer")

    # Load summarizer model
    with st.spinner("Loading Summarization Model"):
        try:
            summarizer = pickle.load(open("models/summarizer.sav", 'rb'))
        except FileNotFoundError:
            summarizer = pipeline("summarization")

    # Load Diarization model (Differentiate speakers)
    with st.spinner("Loading Diarization Model"):
        try:
            dia_pipeline = pickle.load(open("models/dia_pipeline.sav", 'rb'))
        except FileNotFoundError:
            dia_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1", use_auth_token=st.session_state["my_HF_token"])

            #If the token hasn't been modified, dia_pipeline will automatically be set to None. The functionality will then be disabled.

    return stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline</code></pre>



<p>⚠️ <strong>Don&#8217;t forget to modify the <em>use_auth_token</em></strong><em>=&#8221;ACCESS TOKEN GOES HERE&#8221;</em> by your <strong>personal token</strong>. Otherwise, the app will be launched without the diarization functionality.</p>



<p>As the <em>load_models()</em> function now returns 6 variables (instead of 2), we need to <strong>change the line of code that calls this function</strong> to avoid an error. This one is <strong>in the main</strong>:</p>



<pre class="wp-block-code"><code class=""># Replace the load_models() code line call in the main code by the following one
stt_tokenizer, stt_model, t5_tokenizer, t5_model, summarizer, dia_pipeline = load_models()</code></pre>



<p>As we discussed in the previous article, having <strong>more models makes the initialization </strong>of the speech to text app <strong>longer</strong>. You will notice this if you run the app.</p>



<p>➡️ This is why we now propose <strong>two ways to import the models</strong> in the previous function. </p>



<p>The first one, that we will use by default, consists in searching the models in a folder, where we will save all the models used by our app. If we find the model in this folder, we will import it, instead of downloading it. This allows not to depend on the download speed of an internet connection and makes the application usable as soon as it is launched! </p>



<p>If we don&#8217;t find the model in this folder, we will switch to the second solution which is to reproduce the way we have always used: <strong>download models from their libraries</strong>. The problem is that it takes several minutes to download all the models and then launch the application, which is quite frustrating.</p>



<p>➡️ We will show you how you can save these models in a folder in the documentation that will help you to deploy your project on AI Deploy.</p>



<h3 class="wp-block-heading">Conclusion</h3>



<p class="has-text-align-left">Well done 🥳 ! Your application is now visually pleasing and offers more interactivity thanks to the download button and those that allow you to play with the audio player. You also managed to create a form that will allow the user to indicate what functionalities he wants use!</p>



<p>➡️ Now it&#8217;s time to create these features and add them to our application! You can discover how in <a href="https://blog.ovhcloud.com/how-to-build-a-speech-to-text-application-with-python-3-3" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">the next article</a> 😉.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-build-a-speech-to-text-application-with-python-2-3%2F&amp;action_name=How%20to%20build%20a%20Speech-To-Text%20Application%20with%20Python%20%282%2F3%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2022/11/speech_to_text_app_click_button_issue.mp4" length="4867513" type="video/mp4" />
<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2022/07/click_button_streamlit_solve-3.mov" length="6829193" type="video/quicktime" />
<enclosure url="https://blog.ovhcloud.com/wp-content/uploads/2022/07/click_timestamp_btn.mov" length="3781979" type="video/quicktime" />

			</item>
	</channel>
</rss>
