A journey into the wondrous land of Machine Learning Archives - OVHcloud Blog

A journey into the wondrous land of Machine Learning, or “Cleaning data is funnier than cleaning my flat!” (Part 3)

Guillaume Ruty — Tue, 12 Apr 2022 08:13:52 +0000

What am I doing here?

The story so far…

As you might know if you have read our blog for more than a year, a few years ago, I bought a flat in Paris. If you don’t know, the real estate market in Paris is expensive but despite that, it is so tight that a good flat at a correct price can be for sale for less than a day.

Obviously, you have to take a decision quite fast, and considering the prices, you have to trust your decision. Of course, to trust your decision, you have to take your time, study the market, make some visits etc… This process can be quite long (in my case it took a year between the time I decided that I wanted to buy a flat and the time I actually commited to buying my current flat), and even spending a lot of time will never allow you to have a perfect understanding of the market. What if there was a way to do that very quickly and with a better accuracy than with the standard process?

As you might also know if you are one of our regular readers, I tried to solve this problem with Machine Learning, using an end-to-end software called Dataiku. In a first blog post, we learned how to make a basic use of Dataiku, and discovered that just knowing how to click on a few buttons wasn’t quite enough: you had to bring some sense in your data and in the training algorithm, or you would find absurd results.

In a second entry, we studied a bit more the data, tweaked a few parameters and values in Dataiku’s algorithms and trained a new model. This yielded a much better result, and this new model was – if not accurate – at least relevant: the same flat had a higher predicted place when it was bigger or supposedly in a better neighbourhood. However, it was far from perfect and really lacked accuracy for several reasons, some of them out of our control.

However, all of this was done on one instance of Dataiku – a licensed software – on a single VM. There are multiple reasons that could push me to do things differently:

Maybe you want to only use open-source frameworks to make sure you have full control over exactly what runs and how it works
Maybe you don’t want to pay a license to make an extensive use of the software
Maybe you have a vast amount of data to process and transform and even a big VM is not enough
Maybe you want to only use “universal” tools and languages to make sure you will always find people with the appropriate skills so that your pipelines are maintaineable

But seriously, what am I doing?

What we did very intuitively (and somewhat naively) with Dataiku was actually a quite complex pipeline that is often called ELT, for Extract, Load and Transform.

We extracted the data when we downloaded it from the French open data website
We loaded it when we uploaded it to our Dataiku Instance
We transformed it when we used Dataiku’s data processing capabilities to make it digestible for a machine learning algorithm

And obviously, after this ELT process, we added a step to train a model on the transformed data.

So what are we going to do to redo all of that without Dataiku’s help?

For the extract phase, we are actually going to cheat a little bit and do things the exact same way: we will download the data directly from the website on a public cloud VM. Obviously this is not very scalable, but the extract phase is the most dependant on the source format: in that case, it’s csv files from a webserver.
For the load phase, we are going to push these csv files to an object store – in that case OVHcloud PCS (Public Cloud Storage) – using the standard S3 CLI. An object store is ideal to store this kind of data because:
- It is infinitely scalable so no matter how much data I push, I won’t have to handle storage provisioning etc…
- It works best with large objects (> 1MB. In our case, the csv files are around ~500MB)
- It works with standard APIs that most cloud providers provide and that you can also deploy on-premise if you wish
For the transform phase, we are going to use the most well-known distributed processing framework: Apache Spark. Spark is a framework that transparently distribute the operations on data on several worker nodes. Thus, it allows you to manipulate much bigger datasets than what you could do on a single VM. However, setting up a Spark cluster to launch Spark jobs is not an easy task. Therefore, we are going to use OVHcloud data processing. With data processing, we can just upload our code in an object storage bucket and launch a job in a few clicks/a CLI call, not handling any infrastructure in the process. This job will read raw data from a S3 bucket and write the processed data in another S3 bucket.
After ELT has been done, we still need to train the model! An interactive and user-friendly way to develop the code to train machine learning models is through the use of Jupyter notebooks. A jupyter notebook is a web-based UI allowing you to write and execute python code in blocks in an interactive way, allowing you to explore data and try different things in a seamless fashion. As it turns out, OVHcloud has a product called OVHcloud AI notebook that allows you to spin up a notebook on multiple GPUs in just a few clicks or a CLI call. Even better, you can attach S3 buckets to your notebook, transparently pushing the data contained in the bucket to high-speed storage near the notebooks to make sure you make full use of your GPUs. As you might have guessed already, we will use this product to spin up a notebook, work on the clean data, train a model and evaluate its results.

When ELT becomes ELTT

Now that we know what we are going to do, let us proceed!

Preparing your setup

Before beginning, we have to properly set up our environment to be able to launch the different tools and products. Throughout this tutorial, we will show you how to do everything with CLIs. However, all these manipulations can also be done on OVHcloud’s manager (GUI), in which case you won’t have to configure these tools.

Setting up your environment

For all the manipulations described in the next phase of this article, we will use a Virtual Machine deployed in OVHcloud’s Public Cloud that will serve as the extraction agent to download the raw data from the web and push it to S3 as well as a CLI machine to launch data processing and notebook jobs. It is a d2-4 flavor with 4GB of RAM, 2 vCores and 50 GB of local storage running Debian 10, deployed in Graveline’s datacenter. During this tutorial, I run a few UNIX commands but you should easily be able to adapt them to whatever OS you use if needed. All the CLI tools specific to OVHcloud’s products are available on multiple OSs.

You will also need an OVHcloud NIC (user account) as well as a Public Cloud Project created for this account with a quota high enough to deploy a GPU (if that is not the case, you will be able to deploy a notebook on CPU rather than GPU, the training phase will juste take more time). To create a Public Cloud project, you can follow these steps.

Installing and configuring your tools

Here is a list of the CLI tools and other that we will use during this tutorial and why:

wget to retrieve the data from a public data website.
gunzip to unzip it.
aws CLI, that we will use to interact with OVHcloud’s Public Cloud Storage, a product allowing you to manipulate object storage buckets either through the Swift or the S3 API. Alternatively, you could use the Swift CLI. You will find how to configure your aws CLI h e re.
ovh-spark-submit CLI to launch OVHcloud data processing (Spark-as-a-Service) jobs. You will find the instructions to install and configure the CLI here.
ovhai CLI to launch OVHcloud AI notebook or AI training jobs. You will find out how to install and configure this CLI here.

Additionally you will find commented code samples for the processing and training steps in this Github repository.

Creating your object storage buckets

In this tutorial, we will use several object storage buckets. Since we will use the S3 API, we will call them S3 bucket, but as mentioned above, if you use OVHcloud standard Public Cloud Storage, you could also use the Swift API. However, you are restricted to only the S3 API if you use our new high-performance object storage offer, currently in Beta.

For this tutorial, we are going to create and use the following S3 buckets:

transactions-ecoex-raw to store the raw data before it is processed
transactions-ecoex-processing to store the Spark code and environment files used to process the raw data
transactions-ecoex-cleanto store the data once it has been cleaned
transactions-ecoex-modelto store the weights of the trained model once it has been trained

To create these buckets, use the following commands after having configured your aws CLI as explained above:

aws s3 mb s3://transactions-ecoex-raw
aws s3 mb s3://transactions-ecoex-processing
aws s3 mb s3://transactions-ecoex-clean
aws s3 mb s3://transactions-ecoex-model

Now that you have your environment set up and your S3 buckets ready, we can begin the tutorial!

Tutorial

Extracting/Ingesting the Data

First, let us download the data files directly on Etalab’s website and unzip them:

wget -r -l2 -P data -np "https://files.data.gouv.fr/geo-dvf/latest/csv/" -A "*full.csv.gz" --reject-regex="/communes/|/departements/"
cd data
for FILE in `find files.data.gouv.fr -name "*.csv.gz"`
do
      NEWFILE=`echo $FILE | tr '/' '-'`
      mv $FILE $NEWFILE
done
rm -rf files.data.gouv.fr/*
rm -r files.data.gouv.fr
gunzip *.csv.gz

You should now have the following files in your directory, each one corresponding to the French real estate transaction of a specific year:

debian@d2-4-gra5:~/data$ ls
files.data.gouv.fr-geo-dvf-latest-csv-2016-full.csv  
files.data.gouv.fr-geo-dvf-latest-csv-2019-full.csv
files.data.gouv.fr-geo-dvf-latest-csv-2017-full.csv  
files.data.gouv.fr-geo-dvf-latest-csv-2020-full.csv
files.data.gouv.fr-geo-dvf-latest-csv-2018-full.csv

Now, use the S3 CLI to push these files in the relevant S3 bucket:

find * -type f -name "*.csv" -exec aws s3 cp {} s3://transactions-ecoex-raw/{} \;

You should now have those 5 files in your S3 bucket:

debian@d2-4-gra5:~/data$ aws s3 ls transactions-ecoex-raw
2021-09-21 17:43:32  511804467 files.data.gouv.fr-geo-dvf-latest-csv-2016-full.csv
2021-09-21 17:43:54  590281357 files.data.gouv.fr-geo-dvf-latest-csv-2017-full.csv
2021-09-21 17:44:14  580600597 files.data.gouv.fr-geo-dvf-latest-csv-2018-full.csv
2021-09-21 17:44:36  614788794 files.data.gouv.fr-geo-dvf-latest-csv-2019-full.csv
2021-09-21 17:44:52  426715300 files.data.gouv.fr-geo-dvf-latest-csv-2020-full.csv

What we just did with a small VM was ingesting data into a S3 bucket. In real-life usecases with more data, we would probably use dedicated tools to ingest the data. However, in our example with just a few GB of data coming from a public website, this does the trick.

Processing the data

Now that you have your raw data in place to be processed, you just have to upload the code necessary to run your data processing job. Our data processing product allows you to run Spark code written either in Java, Scala or Python. In our case, we used Pyspark on Python. Your code should consist in 3 files:

A .py file containing your code. In my case, this file is real_estate_processing.py
A environment.yml file containing your dependencies.
If – like me – you don’t wish to put your S3 credentials in your .py file, a separate .env file containing them. You will need to use a library like python-dotenv and source it in the environment.yml file to handle that.

Once you have your code files, go to the folder containing them and push them on the appropriate S3 bucket:

cd ~/EcoEx_Tech_Masterclass/Processing/
aws s3 cp real_estate_processing.py s3://transactions-ecoex-processing/processing.py
aws s3 cp environment.yml s3://transactions-ecoex-processing/environment.yml
aws s3 cp .env s3://transactions-ecoex-processing/.env

Your bucket should now look like that:

debian@d2-4-gra5:~/EcoEx_Tech_Masterclass/Processing$ aws s3 ls transactions-ecoex-processing
2021-10-04 10:14:52         94 .env
2021-10-04 10:14:09         99 environment.yml
2021-10-04 10:20:08       4425 processing.py

You are now ready to launch your data processing job. The following command will allow you to launch this job on 10 executors, each with 4 vCores and 15 GB of RAM.

ovh-spark-submit \
    --projectid $OS_PROJECT_ID \
    --jobname transactions-ecoex-processing \
    --class org.apache.spark.examples.SparkPi \
    --driver-cores 4 \
    --driver-memory 15G \
    --executor-cores 4 \
    --executor-memory 15G \
    --num-executors 10 \
    swift://transactions-ecoex-processing/processing.py 1000

Note that the data processing product uses the Swift API to retrieve the code files. This is totally transparent to the user, and the fact that we used the S3 CLI to create the bucket has absolutely no impact. When the job is over, you should see the following in your transactions-ecoex-clean bucket:

debian@d2-4-gra5:~$ aws s3 ls transactions-ecoex-clean
                             PRE data_clean.parquet/
debian@d2-4-gra5:~$ aws s3 ls transactions-ecoex-clean/data_clean.parquet/
2021-10-04 16:35:48          0 _SUCCESS
2021-10-04 16:27:13      50769 part-00000-ac3acfc2-c5b3-430e-91b4-7f5e50b537a6-c000.snappy.parquet
2021-10-04 16:27:16      52253 part-00001-ac3acfc2-c5b3-430e-91b4-7f5e50b537a6-c000.snappy.parquet
2021-10-04 16:27:18      51412 part-00002-ac3acfc2-c5b3-430e-91b4-7f5e50b537a6-c000.snappy.parquet
2021-10-04 16:27:21      46962 part-00003-ac3acfc2-c5b3-430e-91b4-7f5e50b537a6-c000.snappy.parquet
2021-10-04 16:27:23      49130 part-00004-ac3acfc2-c5b3-430e-91b4-7f5e50b537a6-c000.snappy.parquet
2021-10-04 16:27:26      50046 part-00005-ac3acfc2-c5b3-430e-91b4-7f5e50b537a6-c000.snappy.parquet
.......

Before going further, let us look at the size of the data before and after cleaning:

debian@d2-4-gra5:~$ aws s3 ls s3://transactions-ecoex-raw --human-readable --summarize
2021-10-18 18:59:21  488.1 MiB full.csv.2016
2021-10-18 18:59:21  562.9 MiB full.csv.2017
2021-10-18 18:59:21  553.7 MiB full.csv.2018
2021-10-18 18:59:21  586.3 MiB full.csv.2019
2021-10-18 18:59:21  406.9 MiB full.csv.2020

Total Objects: 5
   Total Size: 2.5 GiB

debian@d2-4-gra5:~$ aws s3 ls s3://transactions-ecoex-clean --recursive --human-readable --summarize
2021-10-19 09:20:23    0 Bytes data_clean.parquet/_SUCCESS
2021-10-19 09:19:09   49.5 KiB data_clean.parquet/part-00000-48156d39-f3fb-495b-b829-edee3777701f-c000.snappy.parquet
2021-10-19 09:19:10   51.1 KiB data_clean.parquet/part-00001-48156d39-f3fb-495b-b829-edee3777701f-c000.snappy.parquet
2021-10-19 09:19:09   50.3 KiB data_clean.parquet/part-00002-48156d39-f3fb-495b-b829-edee3777701f-c000.snappy.parquet
...
...
2021-10-19 09:20:11   49.1 KiB data_clean.parquet/part-00199-48156d39-f3fb-495b-b829-edee3777701f-c000.snappy.parquet

Total Objects: 197
   Total Size: 9.4 MiB

As you can see, with ~2.5 GB of raw data, we extracted only ~10 MB of actually useful data (only 0,4%)!! What is noteworthy here is that that you can easily imagine usecases where you need a large-scale infrastructure to ingest and process the raw data but where one or a few VMs are enough to work on the clean data. Obviously, this is more often the case when working with text/structured data than with raw sound/image/videos.

Before we start training a model, take a look at these two screenshots from OVHcloud’s data processing UI to erase any doubt you have about the power of distributed computing:

In the first picture, you see the time taken for this job when launching only 1 executor- 8:35 minutes. This duration is reduced to only 2:56 minutes when launching the same job (same code etc…) on 4 executors: almost 3 times faster. And since you pay-as-you go, this will only cost you ~33% more in that case for the same operation done 3 times faster- without any modification to your code, only one argument in the CLI call. Let us now use this data to train a model.

Training the model

To train the model, you are going to use OVHcloud AI notebook to deploy a … notebook! With the following command, you will:

Deploy a notebook running a Jupyterlab, …
… pre-configured to run on a VM with 1 GPU, …
… with several important libraries such as Tensorflow or Pytorch installed, …
… with your S3 bucket transactions-ecoex-clean synchronized on a high-performance storage cluster that is mounted on /workspace/datain your notebook, …
… and ready to write the model when it has been trained on the /workspace/model path that will be synchronized with your transactions-ecoex-model when the job is over.

ovhai notebook run one-for-all jupyterlab \
    --name transactions-ecoex-training \
    --framework-version v98-ovh.beta.1 \
    --flavor ai1-1-gpu \
    --gpu 1 \
    --volume transactions-ecoex-clean@GRA/:/workspace/data:RO \
    --volume transactions-ecoex-model@GRA/:/workspace/model:RW

In our case, we launch a notebook with only 1 GPU because the code samples we provide would not leverage several GPUs for a single job. I could adapt my code to parallelize the training phase on multiple GPUs, in which case I could launch a job with up to 4 parallel GPUs.Once this is done, just get the URL of your notebook with the following command and connect to it with your browser:

Once you’re done, just get the URL of your notebook with the following command and connect to it with your browser:

debian@d2-4-gra5:~$ ovhai notebook list
ID                                   STATE   AGE FRAMEWORK   VERSION          EDITOR     URL
525bcb3e-d57b-4111-91ff-4a8759061f75 RUNNING 20m one-for-all v98-ovh.beta.1   jupyterlab https://XXXXXXXXXX.notebook.gra.training.ai.cloud.ovh.net

You can now import the real-estate-training.ipynb file to the notebook with just a few clicks. If you don’t want to import it from the computer you use to access the notebook (for example if like me you use a VM to work and have cloned the git repo on this VM and not on your computer), you can push the .ipynb file to your transactions-ecoex-clean or transactions-ecoex-model bucket and re-synchronize the bucket to your notebook while it runs by using the ovhai notebook pull-data command. You will then find the notebook file in the corresponding directory.

Once you have imported the notebook file to your notebook instance, just open it and follow the directives. If you are interested in the result but don’t want to do it yourself, let’s sum up what the notebook does:

It reads the clean data from the folder where it has been synchronized from the S3 bucket and plots a few figures.
In these two plots, you can see that the most represented price range is between 400 and 500k€, and that the price per m² is approximatively a gaussian curve centered on ~10k€. If you know Paris’ real estate market, this shouldn’t surprise you.

The next two plots show you the geographical repartition of the prices – absolute and per m². They tell us an interesting story: if you look at the absolute prices, you can see that the West of Paris looks more expensive than the East. However, the price per m² shows that for en equal surface, it is the center of Paris that is more expensive. That means that the center of Paris is relatively more expensive but has smaller flats to offer, so probably less families there!

Finally, a last visualization plot shows you a correlation table of all numerical values contained in the data. Here are a few noteworthy facts:
- As expected, there is a very high correlation between the absolute price and the surface of a flat. This is common sense for whoever has tried to buy a flat in a city.
- There is a positive correlation between the price per m² and the date of the transaction: again this is not a surprise as Paris’ real estate steadily increases in value over the years.
- There is however only a very slightly negative correlation between the surface of a flat and its price per m². This means that the bigger a flat is, the lesser its surface is taken into account in the price. However, this negative correlation is very small so it doesn’t have that much effect.

After that, the data is transformed a bit to be digestible for the neural network:
- It is divided in numerical data and categorical data (mainly the street and postal codes);
- A statistical algorithm is applied to the numerical values to drop outliers that were not dropped in our processing phase by empiric methods;
- The numerical values are normalized;
- The categorical data is encoded in discreet numerical values;
- Finally, the two datasets are merged again.
Once this is done, we create a neural network and finally train a model on a subset of our data! After this first model has been trained, we test it on the remainder of the data. In the next plot, you can see a mapping between the prices predicted by this model and the actual prices. As you can see, our model fares quite poorly and severely undervalues flats most flats.

Use the models built in this tutorial at your own risk…

After this disappointing result, we create a new neural network with more neurons per layer and train it again. The following plot shows its result on our testing set. While it remains quite unprecise and still slightly undervalues flats in average, the results are much better than with the previous model.

Before stopping the notebook, we save both models to a local directory that will be synchronized to our transactions-ecoex-model S3 bucket.

So, what can we conclude from all of this? First, even if the second model is obviously better than the first, it is still very noisy: while not far from correct on average, there is still a huge variance. Where does this variance come from?

Well, it is not easy to say. To paraphrase the finishing part of my last article:

We trained this model on only two combinations of basic parameters (the number of neurons per layer). If we were to do that seriously, we would have launched tens or hundreds of trainings in parallel with other parameters and selected the best. There are almost certainly some sets of parameters that would have converged towards a better model.
Even with the best set of parameters the data is not perfect. How can it differentiate between two flats at the exact same address and with the same surface, one on the ground floor in a narrow street and in front of a very noisy bar, the other on the last floor, higher than the neighbouring buildings, South-oriented towards a calm public garden with a direct view on the Eiffel Tower? The answer is no, as these informations are not contained in the data. And yet, this second flat would probably be more expensive than the first one. Even a very well-trained model would almost certainly overvalue the first one and undervalue the second one.

Conclusion

In this article, I tried to give you a glimpse at the tools that Data Scientists commonly use to manipulate data and train models at scale, in the Cloud or on their own infrastructure:

Object storage to store the raw and/or clean data;
Spark or any other distributed processing framework to clean the data – obviously alternatives exist for specific usecase, for example labelling tools to label images when the goal is to train an object detection in images model;
Notebooks running Machine-Learning frameworks such as Tensorflow, PyTorch, Scikit-Learn etc… to prototype and code algorithms.

Hopefuly, you now have a better understanding on how Machine Learning algorithms work, what their limitations are, and how Data Scientists work on data to create models.

As explained earlier, all the code used to obtain these results can be found here. Please don’t hesitate to replicate what I did or adapt it to other usecases!

A journey through the wondrous land of Machine Learning or “Can I really buy a palace in Paris for 100,000€?” (Part 2)

Guillaume Ruty — Thu, 03 Sep 2020 14:52:09 +0000

Spoiler alert, no you can’t.

A few months ago, I explained how to use Dataiku – a well-known interactive AI studio – and how to use data, made available by the French government, to build a Machine Learning model predicting the market value of real estate. Sadly, it failed miserably: when I tried it on the transactions made in my street, the same year I bought my flat, the model predicted that all of them had the same market value.

In this blog post, I will point out several reasons why our experiment failed, and then I will try to train a new model, taking into account what we will have learned.

Why our model failed

There are several reasons why our model failed. Three of them stand out:

The open data format layout
The data variety
Dataiku’s default model parameters

Open Data Format Layout

You can find a description of the data layout on the dedicated webpage. I will not list all the columns of the schema (there are 40 of them), but the most important one is the first one: id_mutation. This information is a unique transaction number, and not an unusual column to find.

However, if you look at the dataset itself, you will see that some transactions are spread over multiple lines. They correspond to transactions regrouping multiple parcels. In the example of my own transaction, there are two lines: one for the flat itself, and one for a separate basement under the building.

The problem is that the full price is found on every such line. From the point of view of my AI studio, which only sees a set of lines it interprets as data points, it looks like my basement and my flats are two different properties that cost an equal amount! This gets worse for properties that have lands and several constructs attached to them. How can we expect our algorithm to learn appropriately under these conditions?

Dataiku doesn’t naturally understand that a data point can consist of multiple lines!

Data Variety

In this case, we are trying to predict the price of a flat in Paris. However, the data we gave the algorithm regroups every real estate transaction made in France over the last few years. While you might think that more data is always better, this is not necessary the case.

The real estate market changes according to where you are, and Paris is a very specific case in France, with prices being much higher than in other big cities and the rest of France. Of course, this can be seen in the data, but the training algorithm does not know that in advance, and it is very hard for it to learn how to price a small flat in Paris and a farm with acres of land in Lozère at the same time.

Model training parameters

In the last blog post, you have seen how easy it is to use Dataiku. But it comes at a price: the default script can be used for very simple use-cases. But it is not suited for complex tasks – like predicting real-estate prices. I myself do not have much experience with Dataiku. However, by digging deeper into the details, I was able to correct a few obvious mistakes:

Data types: A lot of the columns in the dataset are specific types: integers, geographic coordinates, dates etc. Most of them are correctly identified by Dataiku, but some of them – such as geographic coordinates, or dates – are not.
Data analysis: If you remember the previous post, at one point we were looking at different models trained by the algorithm. We didn’t take the time to look at the design automated by the model. This section allows us to tweak several elements; such as the types of algorithms we run, the learning parameters, the choice of the dataset etc…

With so many features present in the dataset, Dataiku tried to reduce the number of features it would analyze, in order to simplify the learning algorithm. But it made poor choices. For example, it considers the street number but not the street itself. Even worse, it doesn’t even look at the date, or the parcels’ surface area (but it does consider land surface when present…), which is by far the most important factor in most cities!

How to fix all of that

Fortunately, there are ways to solve these issues. Dataiku integrates tools to transform and filter your datasets before running your algorithms. It also allows you to change training parameters. Rather than walking you through all the steps, I’m going to summarize what I did for each of the issues we identified earlier:

Data layout

First, I grouped the lines that corresponded with the same transactions. Depending on the fields, I either summed them up (when it was a living area surface, for example), kept one of them (address), or concatenated them (when it was the identifier for an outbuilding, for example).
Second, I removed several unnecessary or redundant fields that add noise to the algorithm; such as street name (there are already per-city-unique street codes), street number suffix (“Bis” or “Ter” commonly found in an address after a street number) or other administration-related information.
Finally, some transactions contain not only several parcels (on several lines) but also several subparcels per parcel, each with its own surface and subparcel number. This subdivision is mostly administrative, and subparcels are often previously adjoining flats that have been reunited. To simplify the data, I cut the subparcel numbers and summed their respective surfaces, before regrouping the lines.

Data variety

First, as we are trying to train a model to estimate the price of Parisian flats, I filtered out all the transactions that didn’t happen in Paris (which as you can expect is most of it).
Second, I removed all the transactions that had incomplete data for important fields (such as surface or address).
Finally, I removed outliers: transactions corresponding to properties that don’t correspond to standard flats; such as houses, commercial land, very high-end flats etc…

Model training parameters

Model training parameters:
- First, I made sure that the model considered all the features. Note: rather than removing unnecessary fields from the dataset, I could have just told the algorithm to ignore the corresponding features. However, my preference is to increase the readability of the dataset to make it easier to explore. Moreover, Dataiku loads data in RAM to process it, so making it run on a clean dataset makes it more RAM-efficient.
- Second, I trained the algorithm on different sets of features: in some cases I kept the district but not the street. As there are a lot of different streets in Paris this is a categorical feature with high cardinality (lots of different possibilities that can’t be numerized).
- Finally, I tried different families of Machine Learning algorithms: Random Forest – basically building a decision tree; XGBoost – gradient boosting; SVM (Support Vector Machine)- a generalization of linear classifiers; and KNN (K-Nearest-Neighbours) – which tries to categorize data points by looking at its neighbors according to different metrics.

Did it work?

So, after all that, how did we fare? Well, first off, let us look at the R2 score of our models. Depending on the training session, our best models have an R2 score between 0.8 and 0.85. As a reminder, a R2 score of 1 would mean that the model perfectly predicts the price of every data point used in the training evaluation phase. The best models in our previous tries had an R2 score between 0.1 and 0.2, so we are already clearly better here. Let us now look at a few predictions from this model.

First, I re-checked all the transactions from my street. This time, the prediction for my flat is ~16% lower than the price I paid. But unlike last time, every flat has a different estimate and these estimates are all in the correct order of magnitude. Most values have less than 20% error when compared to the real price, and the worst estimates have ~50% error. Obviously, this margin of error is unacceptable when investing in a flat. However, when compared to the previous iteration of our model – that returned the same estimate for all the flats in my street – we are making significant progress.

So, now that we at least have the correct order of magnitude, let’s try and tweak some values in our input dataset to see if the model reacts predictably. To do this, I took the data point of my own transaction and created new data points, each time by changing one of the features of the original data point:

the surface to reduce it
the coordinates (street name, street code, geographic coordinates, etc) to put it in a cheaper district
the date of transaction to year 2015 (3 years prior to the real date)

With each of these modifications, we would expect the new estimates to be lower than the original one (the real estate market in Paris is in permanent inflation). Let us look at the results:

Real Price	Original Estimate	Reduced Surface Estimate	Other District Estimate	Older Estimate
100%	84%	45%	61%	76%

At least the model behaves in an appropriate way, qualitatively speaking.

How could we do better?

At this point, we have used common sense to significantly improve our previous results and build a model that gives predictions in a good order of magnitude and that behaves as we expect when tweaking the features of data points. However, the remaining margin of error makes it unsuitable for real-world application . But why, and what could we do to keep improving our model? Well there are several reasons:

Data complexity: I am going to contradict myself a little. While complex data is harder to digest for a Machine Learning algorithm, it is necessary to preserve this complexity if it reflects a complexity in the final result. In this case, we might not only have oversimplified the data, but the original data itself may lack a lot of relevant information.

We trained our algorithm on general location and surface, which admittedly are the most important criteria, but our dataset lacks very important information such as floors, exposure, construction years, insulation diagnostics, condition, accessibility, view, general state of the flats etc…

There are private datasets built by notarial offices that are more complete than our open dataset, but while those might have features such as floor or construction year, they would probably lack more subjective information; such as general state or view.

The dataset lacks very important information about the flats.

Data amount: Even if we had very complete data, we would need a vast amount of it. The more features we include in our training, the more data we need. And for such a complex task, the ~150K transactions per year we have in Paris are probably not enough. A solution could be to create artificial data points: flats that don’t really exist, but that human experts would still be able to evaluate.

But there are three issues with that: first, any bias in the experts would inevitably be passed on the model. Second, we would have to generate a huge number of artificial, but realistic data points and then would need the help of multiple human experts to label it. Finally, the aforementioned experts would label this artificial data based on their current experience. It would be very hard for them to remember the market prices from a few years ago. This means to have a homogeneous dataset over the years, we would have to create this artificial data over time and at the same pace as the real transactions happen.
Skills: Finally, being a data scientist is a full-time job that requires experience and skill. A real data scientist would probably be able to reach better results than what I obtained by adjusting the learning parameters and choosing the most appropriate algorithms.

Furthermore, even good data scientists would have to know their way around real estate and its pricing. It’s very hard to build advanced Machine Learning models without having a good comprehension of the topic at hand.

Summary

In this blog post, we discussed why our previous attempt at training a model to predict the price of flats in Paris failed. The data we used was not cleaned enough and we used Dataiku’s default training parameters rather than verifying that they made sense.

After that, we corrected our mistakes, cleaned the data and tweaked the training parameters. This improved the result of our model a lot, but not enough to use it realistically. There are ways to improve the model further, but the available datasets lack some information and the amount of data itself may not be sufficient to build a robust model.

Fortunately, the intent of this series was never to predict the price of flats in Paris perfectly. If it was possible, there would be no more real estate agencies. Instead, it serves as an illustration of how anyone can take raw data, find a problem related to the data and train a model to tackle this problem.

However, the dataset that we used in this example was quite small: only a few gigabytes. Everything happened on a single VM and we had to do everything manually, on a fixed dataset. What would I do if I wanted to handle petabytes of data? If I wanted to handle continuously streaming data? If I wanted to expose my model so that external applications could query it?

That is what we are going to look at next time, in the final blog post of the series.

A journey into the wondrous land of Machine Learning, or “Did I get ripped off?” (Part 1)

Guillaume Ruty — Fri, 17 Apr 2020 15:31:58 +0000

Two years ago, I bought an apartment. It took me a year to get a proper idea of the real estate market: make some visits, get disappointed a couple of times, and finally find the flat of my dreams (and that I perceived to be priced appropriately for the market).

But how exactly did I come up with this perceived “appropriate price”? Well, I saw a lot of ads for a lot of apartments. I found out that some properties were desirable and thus marketable (having a balcony, a good view, being in a safe district, etc), and step by step I built up my perception based on evidence of how these features impacted the price of an apartment.

In other words, I learned from experience, and this knowledge allowed me to build a mental model of how much an apartment should cost depending on its characteristics.

But like every other human being, I am fallible, and thus keep asking myself the same question: “Did I get ripped off?”. Well, looking back at the process that led me there, it is evident that there is a way to follow the same process, but with more data, less bias, and hopefully in better time: Machine Learning.

In this series of blog posts, I am going to explore how Machine Learning could have helped me estimate the price of the apartment I bought; what tools you would need to do that, how you would proceed, and what difficulties you might encounter. Now, before we dive head first into this wondrous land, here is a quick disclaimer: the journey we are about to go through together serves as an illustration of how Machine Learning works and why it is not, in fact, magic.

You will not finish this journey with an algorithm that is able to price any apartment with extreme precision. You will not be the next real estate mogul (if it was that simple, I would be selling courses on how to become a millionaire on Facebook for 20€). Hopefully, however, if I have done my job correctly, you will understand why Machine Learning is simpler than you might think, but not as simple at it may appear.

What do you need to build a model with Machine Learning?

Labelled data

In this example, for predicting property exchange value, we are looking for data that contains the characteristics of houses and apartments as well as the price they were sold for (the price being the label). Luckily, a few years ago, the French government decided to build an open data platform providing access to dozens of datasets related to the administration – such as a public drug database, real-time atmospheric pollutant concentration and many others. Luckily for us, they included a dataset containing property transactions!

Well, here is our dataset. If you look through the webpage, you will see that there are multiple community contributions, which all enrich the data, making it suitable for a macro-analysis. For our purposes we can use the dataset provided by Etalab, the public organisation responsible for maintaining the open data platform.

Software

This is where it gets a bit tricky. If I were a gifted data scientist, I could simply rely on an environment – such as TensorFlow, PyTorch, or ScikitLearn – to load my data, and find out which algorithm is best to train a model. From here, I could determine the optimal learning parameters and then train it. Unfortunately, I am not a gifted data scientist. But, in my misfortune, I am still lucky: there are tools – generally grouped under the name “AI Studios” – which are developed exactly for this purpose, yet require none of the skills. In this example, we will use Dataiku, one of the most well-known and effective AI Studios:

Infrastructure

Obviously, the software has to run on a computer. I could try to install Dataiku on my machine, but since we are on the OVHCloud blog, it seems suitable that we would deploy it in the Cloud! As it turns out, you can easily deploy a Virtual Machine with Dataiku pre-installed on our Public Cloud (you could also do that on some of our competitors’ public cloud) with a free community license. With the Dataiku tool, I should be able to upload the dataset, train a model, and deploy it. Easy, right?

Working with your dataset

Once you have installed your VM, Dataiku automatically launches a web server accessible through the port 11000. To access it, simply go to your browser and type http://:11000. You will then be greeted by the welcome screen prompting you to register for a free community license. Once all this is done, create your first project and follow the steps to import your first dataset (you should just have to drag and drop a few files) and save it.

Once this is done, go to the “Flow” view of your project. You should see your dataset there. Select it and you will see a wide range of possible actions.

Now, let’s get straight to the point and train a model. To do that, go to the “Lab” action and select the “Quick Model” then “Prediction” options, with “valeur_fonciere” as the target variable (as we are trying to predict the price).

Since we are not experts, let’s try the Automated Machine Learning proposed by Dataiku with an interpretable model and train it. Dataiku should automatically train two models, a decision tree and a regression, with one having a better score than the other (the default metric chosen is the R2 score, one of the multiple metrics that can be used to evaluate the performance of a model). Click on this model and you will be able to deploy it. You are now able to use this model on any dataset following the same scheme.

Now that the model is trained, let’s try it and see if it predicts the correct price for the apartment I bought! Now for very obvious reasons, I will not share with you my address, the price of my apartment or other such personal informations. Therefore, for the remainder of this post, I will pretend that I bought my apartment for 100€ (lucky me!) and normalize every other price the same way.

As I mentioned above, in order to query the model, we need to build a new dataset comprised of the data we want to test. In my case, I just had to build a new csv file from the header (you have to keep it in the file for Dataiku to understand it) and the line that relates to the actual transaction I made (easy as it was already done).

If I wanted to do that for a property I intended to buy, I would have had to gather as much information as possible to best fill the criteria in the scheme (address, surface, postal code, geographic coordinates, etc) and build that line myself. At this stage, I am also able to build a dataset with multiple lines to query the model on several cases at once.

Once this new dataset is built, just upload it like first time and go back to the flow view. Your new dataset should appear there. Click on the model on the right of the flow and then on the “Score” action. Select your sample dataset, go with the default options and run the job. You should now have a new dataset appearing in your flow. If you explore this dataset, you will see that there is a new column at the end, containing the predictions!

In my case, for my 100€ apartment, the price predicted is 105€. That can mean two things:

The model we trained is quite good, and the property was a good deal!

Or…

The model made a lucky guess.

Lets give it a go on the other transactions that have taken place on the same street and in the same year. Well, this time the result is underwhelming: apparently, every apartment bought on my street in 2018 is worth exactly 105€, regardless of their size or features. If I had known that, I would probably have bought a bigger apartment! It would appear that our model is not as smart as we initially thought and we still have work to do…

In this post, we explored where machine learning might come in handy, which data would be helpful, the software we need to make use of it, and looked at the infrastructure required to run the software. We found that everyone can give machine learning a go – it is not magic – but we also found that we would have to work a bit harder to get results. Indeed, rushing into the wondrous journey as we did, we did not take the time to look more closely at the data itself and how to make it easier to exploit the model – nor did we look at what the model was actually observing when predicting a price. If we had, we would have realized that indeed, it didn’t make sense. In other words, we did not qualify our problem enough. But let that not diminish our enthusiasm on our journey, for this is merely a small setback!

In the next post, we will go further on our journey and understand why the model we trained was ultimately useless: we will look more precisely at the data we have at our disposal and will find multiple ways to make it more suitable for the machine learning algorithms – following simple, common sense guidelines. Hopefully, with better data, we will then be able to greatly improve our model.