<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>OVHcloud Data Processing Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/ovhcloud-data-processing/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/ovhcloud-data-processing/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Thu, 03 Jun 2021 08:27:48 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>OVHcloud Data Processing Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/ovhcloud-data-processing/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Improving the quality of data with Apache Spark</title>
		<link>https://blog.ovhcloud.com/improving-the-quality-of-data-with-apache-spark/</link>
		
		<dc:creator><![CDATA[Hubert Stefani]]></dc:creator>
		<pubDate>Tue, 15 Sep 2020 15:34:26 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache Spark]]></category>
		<category><![CDATA[Data Processing]]></category>
		<category><![CDATA[OVHcloud Data Processing]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=18676</guid>

					<description><![CDATA[Today we are proposing you a guest post by Hubert Stefani, Chief Innovation Officer and Cofounder of Novagen Conseil As data consultant experts and heavy Apache Spark users, we felt honoured to become early adopters of OVHcloudData Processing. As a first use case to test this offering, we chose our quality assessment process. As a [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fimproving-the-quality-of-data-with-apache-spark%2F&amp;action_name=Improving%20the%20quality%20of%20data%20with%20Apache%20Spark&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
<p><em>Today we are proposing you a guest post by</em> Hubert Stefani, Chief Innovation Officer and Cofounder of <a href="http://www.novagen.tech/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Novagen Conseil</a></p>
</div></div>



<figure class="wp-block-image size-large is-resized"><img fetchpriority="high" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/09/IMG_0269-1024x537.png" alt="Improving the quality of data with Apache Spark" class="wp-image-19307" width="768" height="403" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0269-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0269-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0269-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0269.png 1200w" sizes="(max-width: 768px) 100vw, 768px" /></figure>



<p>As data consultant experts and heavy Apache Spark users, we felt honoured to become early adopters of <a href="https://www.ovhcloud.com/en-ie/public-cloud/data-processing/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloudData Processing</a>. As a first use case to test this offering, we chose our quality assessment process.</p>



<p>As a data consultancy company based in Paris, we build complete and innovative data strategies for our large corporate and public customers: the top fortune banks, public authorities, retailers, fashion industry, transportation leaders etc. We offer them huge scale BI, data lake creation and management, business innovation with data science. Within our Data Lab, we select the best-in-class technology and create what we call ‘boosters’ ie ready to-deploy or customized data assets.</p>



<p>When it comes to selecting a new technology solution, we have the following check list:</p>



<ul class="wp-block-list"><li><strong>Innovation and evolutivity</strong>: depth of functionalities, additional value and usability</li><li><strong>Performance and cost-effectiveness</strong>: intrinsic performances, but also technical architectures that adapt to customer needs</li><li><strong>Open standards and governance</strong>: to support our customers’ cloud or multi-cloud strategies, we choose to rely on open standards to deploy on different targets and preserve reversibility.</li></ul>


<div class="lazyblock-youtube-gdpr-compliant-Z2pdIhN wp-block-lazyblock-youtube-gdpr-compliant"><script type="module">
  import 'https://blog.ovhcloud.com/wp-content/assets/ovhcloud-gdrp-compliant-embedding-widgets/src/ovhcloud-gdrp-compliant-spreaker.js';
</script>
      
      <ovhcloud-gdrp-compliant-spreaker
          spreaker=""
          debug></ovhcloud-gdrp-compliant-spreaker> 

</div>


<h3 class="wp-block-heading"> Apache Spark, our Swiss Army knife</h3>



<p>About a month ago OVHcloud’s Data and AI Product Manager, Bastien Verdebout approached us to test its new product OVHcloud Data Processing, built on top of Apache Spark. The answer was of course yes!</p>



<p>One of the reasons we felt so eager to discover this data processing as a service solution was that we have an extensive usage of Apache Spark; it’s our our Swiss Army knife to process data.</p>



<ul class="wp-block-list"><li>It works on extremely high scale of data,</li><li>It meets the needs of data engineering and data science,</li><li>It allows the processing of data at rest and data streaming</li><li>It’s the de facto standard for data workloads on-premises and in the Cloud,</li><li>It offers built-in APIs for Python, Scala, Java and R.</li></ul>



<p>We have progressively developped software assets on top of Apache Spark to address recurring challenges such as:</p>



<ul class="wp-block-list"><li>ETL processing in data lake environnements,</li><li>Quality KPIs on top of data lake sources,</li><li>Machine Learning algorithm for Natural Language Processing, Time Series predictions&#8230;</li></ul>



<h3 class="wp-block-heading">The ideal use case: data quality assessment</h3>



<p>We have considered the following charateristics of<strong> OVHCloud Data processing</strong>:</p>



<ul class="wp-block-list"><li>Processing engine built on top of<strong> Apache Spark 2.4.3</strong></li><li>Jobs start after <strong>a few seconds</strong> (vs minutes to launch a cluster)</li><li>Ability to<strong> adjust power dedicated to different Spark jobs</strong>: start with low power (1 driver and 1 executor with 4 cores and 8Gb of memory) to high scale processing (potential hundreds of cores and Gb of memories)</li><li>A full <strong>Compute/Storage separation</strong> aligned with <strong>standard of cloud architectures</strong>, including S3 APIs to access data stored in Object Storage layer. &nbsp; </li><li>Jobs execution and monitoring through <strong>Command Line Interface</strong> and <strong>API</strong>&nbsp;</li></ul>



<p>These characteristics led us to choose our quality assessment process as an ideal use case which requires both interactivity and adjustable compute resources to deliver quality KPIs through Spark processes.</p>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="752" src="https://www.ovh.com/blog/wp-content/uploads/2020/09/IMG_0268-1024x752.png" alt="Why we need Spark as a Service" class="wp-image-19299" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0268-1024x752.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0268-300x220.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0268-768x564.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/09/IMG_0268.png 1497w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading"> OVHCloud Data Processing at work</h3>



<figure class="wp-block-image size-large"><img decoding="async" width="960" height="540" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/image-blog-post-Novagen-2.png" alt="" class="wp-image-18981" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-2.png 960w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-2-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-2-768x432.png 768w" sizes="(max-width: 960px) 100vw, 960px" /></figure>



<p>The corresponding command generated by our software is:</p>



<pre class="wp-block-code"><code class="">./ovh-spark-submit --projectid ec7d2cb6da084055a0501b2d8d8d62a1 \
  --class tech.novagen.spark.Launcher --driver-cores 4 --driver-memory 8G \
  --executor-cores 4 --executor-memory 8G --num-executors 5 \ 
  swift://sparkjars/QualitySparkExecutor-1.0-spark.jar --apiServer=5.1.1.2:80</code></pre>



<p>We have a command which is quite similar to a usual spark-submit, except for the jar path, which requires the binary to be in an Object Storage bucket that we access with swift url specification. (NB: this command could have been created with a call to the OVHCloud Data Processing API).</p>



<p>Starting from this point, we can now fine tune our process portfolio and play with the allocation of different power with little limitation (except for the quotas assigned to any public cloud project).</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="960" height="540" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/image-blog-post-Novagen-3.png" alt="" class="wp-image-18982" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-3.png 960w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-3-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-3-768x432.png 768w" sizes="auto, (max-width: 960px) 100vw, 960px" /></figure>



<h2 class="wp-block-heading"> Real-time display of job logs</h2>



<p>In the end, for tuning and post-mortem job analysis, we can take advantage of the saved log files. It is noteworthy that OVHcloud Data Processing offers a real time display of job logs which is very convenient and provides a complementary supervision through Grafana dashboards.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="960" height="540" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/image-blog-post-Novagen-4.png" alt="" class="wp-image-18983" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-4.png 960w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-4-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-4-768x432.png 768w" sizes="auto, (max-width: 960px) 100vw, 960px" /></figure>



<p>This is a first yet significant test of <a href="https://www.ovhcloud.com/en-ie/public-cloud/data-processing/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVHcloud Data Processing</a>; so far, it proved an excellent match with the Novagen quality process use case and allowed us to validate several crucial points when it comes to testing a new data solution:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="960" height="540" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/image-blog-post-Novagen-5.png" alt="" class="wp-image-18984" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-5.png 960w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-5-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/image-blog-post-Novagen-5-768x432.png 768w" sizes="auto, (max-width: 960px) 100vw, 960px" /></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>This is the beginning of this product, and we will have a close look at the upcoming functionalities. The OVHCloud team unveiled part of its roadmap, and it looks really promising!</p><cite>Hubert Stefani, Chief Innovation Officer of Novagen Conseil</cite></blockquote>



<p></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fimproving-the-quality-of-data-with-apache-spark%2F&amp;action_name=Improving%20the%20quality%20of%20data%20with%20Apache%20Spark&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
