Spoiler alert, no you can’t.
A few months ago, I explained how to use Dataiku – a well-known interactive AI studio – and how to use data, made available by the French government, to build a Machine Learning model predicting the market value of real estate. Sadly, it failed miserably: when I tried it on the transactions made in my street, the same year I bought my flat, the model predicted that all of them had the same market value.
In this blog post, I will point out several reasons why our experiment failed, and then I will try to train a new model, taking into account what we will have learned.
Why our model failed
There are several reasons why our model failed. Three of them stand out:
- The open data format layout
- The data variety
- Dataiku’s default model parameters
Open Data Format Layout
You can find a description of the data layout on the dedicated webpage. I will not list all the columns of the schema (there are 40 of them), but the most important one is the first one: id_mutation. This information is a unique transaction number, and not an unusual column to find.
However, if you look at the dataset itself, you will see that some transactions are spread over multiple lines. They correspond to transactions regrouping multiple parcels. In the example of my own transaction, there are two lines: one for the flat itself, and one for a separate basement under the building.
The problem is that the full price is found on every such line. From the point of view of my AI studio, which only sees a set of lines it interprets as data points, it looks like my basement and my flats are two different properties that cost an equal amount! This gets worse for properties that have lands and several constructs attached to them. How can we expect our algorithm to learn appropriately under these conditions?
Data Variety
In this case, we are trying to predict the price of a flat in Paris. However, the data we gave the algorithm regroups every real estate transaction made in France over the last few years. While you might think that more data is always better, this is not necessary the case.
The real estate market changes according to where you are, and Paris is a very specific case in France, with prices being much higher than in other big cities and the rest of France. Of course, this can be seen in the data, but the training algorithm does not know that in advance, and it is very hard for it to learn how to price a small flat in Paris and a farm with acres of land in Lozère at the same time.
Model training parameters
In the last blog post, you have seen how easy it is to use Dataiku. But it comes at a price: the default script can be used for very simple use-cases. But it is not suited for complex tasks – like predicting real-estate prices. I myself do not have much experience with Dataiku. However, by digging deeper into the details, I was able to correct a few obvious mistakes:
- Data types: A lot of the columns in the dataset are specific types: integers, geographic coordinates, dates etc. Most of them are correctly identified by Dataiku, but some of them – such as geographic coordinates, or dates – are not.
- Data analysis: If you remember the previous post, at one point we were looking at different models trained by the algorithm. We didn’t take the time to look at the design automated by the model. This section allows us to tweak several elements; such as the types of algorithms we run, the learning parameters, the choice of the dataset etc…
With so many features present in the dataset, Dataiku tried to reduce the number of features it would analyze, in order to simplify the learning algorithm. But it made poor choices. For example, it considers the street number but not the street itself. Even worse, it doesn’t even look at the date, or the parcels’ surface area (but it does consider land surface when present…), which is by far the most important factor in most cities!
How to fix all of that
Fortunately, there are ways to solve these issues. Dataiku integrates tools to transform and filter your datasets before running your algorithms. It also allows you to change training parameters. Rather than walking you through all the steps, I’m going to summarize what I did for each of the issues we identified earlier:
Data layout
- First, I grouped the lines that corresponded with the same transactions. Depending on the fields, I either summed them up (when it was a living area surface, for example), kept one of them (address), or concatenated them (when it was the identifier for an outbuilding, for example).
- Second, I removed several unnecessary or redundant fields that add noise to the algorithm; such as street name (there are already per-city-unique street codes), street number suffix (“Bis” or “Ter” commonly found in an address after a street number) or other administration-related information.
- Finally, some transactions contain not only several parcels (on several lines) but also several subparcels per parcel, each with its own surface and subparcel number. This subdivision is mostly administrative, and subparcels are often previously adjoining flats that have been reunited. To simplify the data, I cut the subparcel numbers and summed their respective surfaces, before regrouping the lines.
Data variety
- First, as we are trying to train a model to estimate the price of Parisian flats, I filtered out all the transactions that didn’t happen in Paris (which as you can expect is most of it).
- Second, I removed all the transactions that had incomplete data for important fields (such as surface or address).
- Finally, I removed outliers: transactions corresponding to properties that don’t correspond to standard flats; such as houses, commercial land, very high-end flats etc…
Model training parameters
- Model training parameters:
- First, I made sure that the model considered all the features. Note: rather than removing unnecessary fields from the dataset, I could have just told the algorithm to ignore the corresponding features. However, my preference is to increase the readability of the dataset to make it easier to explore. Moreover, Dataiku loads data in RAM to process it, so making it run on a clean dataset makes it more RAM-efficient.
- Second, I trained the algorithm on different sets of features: in some cases I kept the district but not the street. As there are a lot of different streets in Paris this is a categorical feature with high cardinality (lots of different possibilities that can’t be numerized).
- Finally, I tried different families of Machine Learning algorithms: Random Forest – basically building a decision tree; XGBoost – gradient boosting; SVM (Support Vector Machine)- a generalization of linear classifiers; and KNN (K-Nearest-Neighbours) – which tries to categorize data points by looking at its neighbors according to different metrics.
- First, I made sure that the model considered all the features. Note: rather than removing unnecessary fields from the dataset, I could have just told the algorithm to ignore the corresponding features. However, my preference is to increase the readability of the dataset to make it easier to explore. Moreover, Dataiku loads data in RAM to process it, so making it run on a clean dataset makes it more RAM-efficient.
Did it work?
So, after all that, how did we fare? Well, first off, let us look at the R2 score of our models. Depending on the training session, our best models have an R2 score between 0.8 and 0.85. As a reminder, a R2 score of 1 would mean that the model perfectly predicts the price of every data point used in the training evaluation phase. The best models in our previous tries had an R2 score between 0.1 and 0.2, so we are already clearly better here. Let us now look at a few predictions from this model.
First, I re-checked all the transactions from my street. This time, the prediction for my flat is ~16% lower than the price I paid. But unlike last time, every flat has a different estimate and these estimates are all in the correct order of magnitude. Most values have less than 20% error when compared to the real price, and the worst estimates have ~50% error. Obviously, this margin of error is unacceptable when investing in a flat. However, when compared to the previous iteration of our model – that returned the same estimate for all the flats in my street – we are making significant progress.
So, now that we at least have the correct order of magnitude, let’s try and tweak some values in our input dataset to see if the model reacts predictably. To do this, I took the data point of my own transaction and created new data points, each time by changing one of the features of the original data point:
- the surface to reduce it
- the coordinates (street name, street code, geographic coordinates, etc) to put it in a cheaper district
- the date of transaction to year 2015 (3 years prior to the real date)
With each of these modifications, we would expect the new estimates to be lower than the original one (the real estate market in Paris is in permanent inflation). Let us look at the results:
Real Price | Original Estimate | Reduced Surface Estimate | Other District Estimate | Older Estimate |
100% | 84% | 45% | 61% | 76% |
At least the model behaves in an appropriate way, qualitatively speaking.
How could we do better?
At this point, we have used common sense to significantly improve our previous results and build a model that gives predictions in a good order of magnitude and that behaves as we expect when tweaking the features of data points. However, the remaining margin of error makes it unsuitable for real-world application . But why, and what could we do to keep improving our model? Well there are several reasons:
Data complexity: I am going to contradict myself a little. While complex data is harder to digest for a Machine Learning algorithm, it is necessary to preserve this complexity if it reflects a complexity in the final result. In this case, we might not only have oversimplified the data, but the original data itself may lack a lot of relevant information.
We trained our algorithm on general location and surface, which admittedly are the most important criteria, but our dataset lacks very important information such as floors, exposure, construction years, insulation diagnostics, condition, accessibility, view, general state of the flats etc…
There are private datasets built by notarial offices that are more complete than our open dataset, but while those might have features such as floor or construction year, they would probably lack more subjective information; such as general state or view.
- Data amount: Even if we had very complete data, we would need a vast amount of it. The more features we include in our training, the more data we need. And for such a complex task, the ~150K transactions per year we have in Paris are probably not enough. A solution could be to create artificial data points: flats that don’t really exist, but that human experts would still be able to evaluate.
But there are three issues with that: first, any bias in the experts would inevitably be passed on the model. Second, we would have to generate a huge number of artificial, but realistic data points and then would need the help of multiple human experts to label it. Finally, the aforementioned experts would label this artificial data based on their current experience. It would be very hard for them to remember the market prices from a few years ago. This means to have a homogeneous dataset over the years, we would have to create this artificial data over time and at the same pace as the real transactions happen. - Skills: Finally, being a data scientist is a full-time job that requires experience and skill. A real data scientist would probably be able to reach better results than what I obtained by adjusting the learning parameters and choosing the most appropriate algorithms.
Furthermore, even good data scientists would have to know their way around real estate and its pricing. It’s very hard to build advanced Machine Learning models without having a good comprehension of the topic at hand.
Summary
In this blog post, we discussed why our previous attempt at training a model to predict the price of flats in Paris failed. The data we used was not cleaned enough and we used Dataiku’s default training parameters rather than verifying that they made sense.
After that, we corrected our mistakes, cleaned the data and tweaked the training parameters. This improved the result of our model a lot, but not enough to use it realistically. There are ways to improve the model further, but the available datasets lack some information and the amount of data itself may not be sufficient to build a robust model.
Fortunately, the intent of this series was never to predict the price of flats in Paris perfectly. If it was possible, there would be no more real estate agencies. Instead, it serves as an illustration of how anyone can take raw data, find a problem related to the data and train a model to tackle this problem.
However, the dataset that we used in this example was quite small: only a few gigabytes. Everything happened on a single VM and we had to do everything manually, on a fixed dataset. What would I do if I wanted to handle petabytes of data? If I wanted to handle continuously streaming data? If I wanted to expose my model so that external applications could query it?
That is what we are going to look at next time, in the final blog post of the series.