Chapter 9  Tabular Modelling Deep Dive
Just notes for archiving
 Tabular Modeling Deep Dive
They are beautiful. (from my sister's place)
Tabular modeling takes data in the form of a table (like a spreadsheet or CSV). The objective is to predict the value in one column based on the values in the other columns. In this chapter we will not only look at deep learning but also more general machine learning techniques like random forests, as they can give better results depending on your problem.
We will look at how we should preprocess and clean the data as well as how to interpret the result of our models after training, but first, we will see how we can feed columns that contain categories into a model that expects numbers by using embeddings.
In tabular data some columns may contain numerical data, like "age," while others contain string values, like "sex." The numerical data can be directly fed to the model (with some optional preprocessing), but the other columns need to be converted to numbers. Since the values in those correspond to different categories, we often call this type of variables categorical variables. The first type are called continuous variables.
jargon:Continuous and Categorical Variables: Continuous variables are numerical data, such as "age," that can be directly fed to the model, since you can add and multiply them directly. Categorical variables contain a number of discrete levels, such as "movie ID," for which addition and multiplication don't have meaning (even if they're stored as numbers).
At the end of 2015, the Rossmann sales competition ran on Kaggle. Competitors were given a wide range of information about various stores in Germany, and were tasked with trying to predict sales on a number of days. The goal was to help the company to manage stock properly and be able to satisfy demand without holding unnecessary inventory. The official training set provided a lot of information about the stores. It was also permitted for competitors to use additional data, as long as that data was made public and available to all participants.
One of the gold medalists used deep learning, in one of the earliest known examples of a stateoftheart deep learning tabular model. Their method involved far less feature engineering, based on domain knowledge, than those of the other gold medalists. The paper, "Entity Embeddings of Categorical Variables" describes their approach. In an onlineonly chapter on the book's website we show how to replicate it from scratch and attain the same accuracy shown in the paper. In the abstract of the paper the authors (Cheng Guo and Felix Berkhahn) say:
:Entity embedding not only reduces memory usage and speeds up neural networks compared with onehot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables... [It] is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit... As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.
We have already noticed all of these points when we built our collaborative filtering model. We can clearly see that these insights go far beyond just collaborative filtering, however.
The paper also points out that (as we discussed in the last chapter) an embedding layer is exactly equivalent to placing an ordinary linear layer after every onehotencoded input layer. The authors used the diagram in <
The insight is important because we already know how to train linear layers, so this shows that from the point of view of the architecture and our training algorithm the embedding layer is just another layer. We also saw this in practice in the last chapter, when we built a collaborative filtering neural network that looks exactly like this diagram.
Where we analyzed the embedding weights for movie reviews, the authors of the entity embeddings paper analyzed the embedding weights for their sales prediction model. What they found was quite amazing, and illustrates their second key insight. This is that the embedding transforms the categorical variables into inputs that are both continuous and meaningful.
The images in <
On the left is a plot of the embedding matrix for the possible values of the State
category. For a categorical variable we call the possible values of the variable its "levels" (or "categories" or "classes"), so here one level is "Berlin," another is "Hamburg," etc. On the right is a map of Germany. The actual physical locations of the German states were not part of the provided data, yet the model itself learned where they must be, based only on the behavior of store sales!
Do you remember how we talked about distance between embeddings? The authors of the paper plotted the distance between store embeddings against the actual geographic distance between the stores (see <
We've even tried plotting the embeddings for days of the week and months of the year, and found that days and months that are near each other on the calendar ended up close as embeddings too, as shown in <
What stands out in these two examples is that we provide the model fundamentally categorical data about discrete entities (e.g., German states or days of the week), and then the model learns an embedding for these entities that defines a continuous notion of distance between them. Because the embedding distance was learned based on real patterns in the data, that distance tends to match up with our intuitions.
In addition, it is valuable in its own right that embeddings are continuous, because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights and continuous activation values, which are updated via gradient descent (a learning algorithm for finding the minimums of continuous functions).
Another benefit is that we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer before it interacts with the raw continuous input data. This is how fastai and Guo and Berkhahn handle tabular models containing continuous and categorical variables.
An example using this concatenation approach is how Google does its recommendations on Google Play, as explained in the paper "Wide & Deep Learning for Recommender Systems". <
Interestingly, the Google team actually combined both approaches we saw in the previous chapter: the dot product (which they call cross product) and neural network approaches.
Let's pause for a moment. So far, the solution to all of our modeling problems has been: train a deep learning model. And indeed, that is a pretty good rule of thumb for complex unstructured data like images, sounds, natural language text, and so forth. Deep learning also works very well for collaborative filtering. But it is not always the best starting point for analyzing tabular data.
Most machine learning courses will throw dozens of different algorithms at you, with a brief technical description of the math behind them and maybe a toy example. You're left confused by the enormous range of techniques shown and have little practical understanding of how to apply them.
The good news is that modern machine learning can be distilled down to a couple of key techniques that are widely applicable. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:
 Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data (such as you might find in a database table at most companies)
 Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language)
Although deep learning is nearly always clearly superior for unstructured data, these two approaches tend to give quite similar results for many kinds of structured data. But ensembles of decision trees tend to train faster, are often easier to interpret, do not require special GPU hardware for inference at scale, and often require less hyperparameter tuning. They have also been popular for quite a lot longer than deep learning, so there is a more mature ecosystem of tooling and documentation around them.
Most importantly, the critical step of interpreting a model of tabular data is significantly easier for decision tree ensembles. There are tools and methods for answering the pertinent questions, like: Which columns in the dataset were the most important for your predictions? How are they related to the dependent variable? How do they interact with each other? And which particular features were most important for some particular observation?
Therefore, ensembles of decision trees are our first approach for analyzing a new tabular dataset.
The exception to this guideline is when the dataset meets one of these conditions:
 There are some highcardinality categorical variables that are very important ("cardinality" refers to the number of discrete levels representing categories, so a highcardinality categorical variable is something like a zip code, which can take on thousands of possible levels).
 There are some columns that contain data that would be best understood with a neural network, such as plain text data.
In practice, when we deal with datasets that meet these exceptional conditions, we always try both decision tree ensembles and deep learning to see which works best. It is likely that deep learning will be a useful approach in our example of collaborative filtering, as we have at least two highcardinality categorical variables: the users and the movies. But in practice things tend to be less cutanddried, and there will often be a mixture of high and lowcardinality categorical variables and continuous variables.
Either way, it's clear that we are going to need to add decision tree ensembles to our modeling toolbox!
Up to now we've used PyTorch and fastai for pretty much all of our heavy lifting. But these libraries are mainly designed for algorithms that do lots of matrix multiplication and derivatives (that is, stuff like deep learning!). Decision trees don't depend on these operations at all, so PyTorch isn't much use.
Instead, we will be largely relying on a library called scikitlearn (also known as sklearn
). Scikitlearn is a popular library for creating machine learning models, using approaches that are not covered by deep learning. In addition, we'll need to do some tabular data processing and querying, so we'll want to use the Pandas library. Finally, we'll also need NumPy, since that's the main numeric programming library that both sklearn and Pandas rely on.
We don't have time to do a deep dive into all these libraries in this book, so we'll just be touching on some of the main parts of each. For a far more in depth discussion, we strongly suggest Wes McKinney's Python for Data Analysis (O'Reilly). Wes is the creator of Pandas, so you can be sure that the information is accurate!
First, let's gather the data we will use.
The dataset we use in this chapter is from the Blue Book for Bulldozers Kaggle competition, which has the following description: "The goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations."
This is a very common type of dataset and prediction problem, similar to what you may see in your project or workplace. The dataset is available for download on Kaggle, a website that hosts data science competitions.
Kaggle is an awesome resource for aspiring data scientists or anyone looking to improve their machine learning skills. There is nothing like getting handson practice and receiving realtime feedback to help you improve your skills.
Kaggle provides:
 Interesting datasets
 Feedback on how you're doing
 A leaderboard to see what's good, what's possible, and what's stateoftheart
 Blog posts by winning contestants sharing useful tips and techniques
Until now all our datasets have been available to download through fastai's integrated dataset system. However, the dataset we will be using in this chapter is only available from Kaggle. Therefore, you will need to register on the site, then go to the page for the competition. On that page click "Rules," then "I Understand and Accept." (Although the competition has finished, and you will not be entering it, you still have to agree to the rules to be allowed to download the data.)
The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using pip
by running this in a notebook cell:
!pip install kaggle
You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called kaggle.json to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell in the notebook associated with this chapter (e.g., creds = '{"username":"xxx","key":"xxx"}'
):
creds = '{"username":"nikemr","key":"33d71d96879caa193ff2e354a50bed97"}'
Then execute this cell (this only needs to be run once):
cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
cred_path.parent.mkdir(exist_ok=True)
cred_path.write_text(creds)
cred_path.chmod(0o600)
Now you can download datasets from Kaggle! Pick a path to download the dataset to:
path = URLs.path('bluebook')
path
path.ls()
And use the Kaggle API to download the dataset to that path, and extract it:
if not path.exists():
path.mkdir(parents=true)
api.competition_download_cli('bluebookforbulldozers', path=path)
file_extract(path/'bluebookforbulldozers.zip')
path.ls(file_type='text')
file_extract(path/'bluebookforbulldozers.zip')
Now that we have downloaded our dataset, let's take a look at it!
unzip bluebookforbulldozers.zip
Kaggle provides information about some of the fields of our dataset. The Data explains that the key fields in train.csv are:

SalesID
:: The unique identifier of the sale. 
MachineID
:: The unique identifier of a machine. A machine can be sold multiple times. 
saleprice
:: What the machine sold for at auction (only provided in train.csv). 
saledate
:: The date of the sale.
In any sort of data science work, it's important to look at your data directly to make sure you understand the format, how it's stored, what types of values it holds, etc. Even if you've read a description of the data, the actual data may not be what you expect. We'll start by reading the training set into a Pandas DataFrame. Generally it's a good idea to specify low_memory=False
unless Pandas actually runs out of memory and returns an error. The low_memory
parameter, which is True
by default, tells Pandas to only look at a few rows of data at a time to figure out what type of data is in each column. This means that Pandas can actually end up using different data type for different rows, which generally leads to data processing errors or model training problems later.
Let's load our data and have a look at the columns:
low_memory=False
df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
df.columns
That's a lot of columns for us to look at! Try looking through the dataset to get a sense of what kind of information is in each one. We'll shortly see how to "zero in" on the most interesting bits.
At this point, a good next step is to handle ordinal columns. This refers to columns containing strings or similar, but where those strings have a natural ordering. For instance, here are the levels of ProductSize
:
df['ProductSize'].unique()
We can tell Pandas about a suitable ordering of these levels like so:
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
df['ProductSize'].unique()
The most important data column is the dependent variable—that is, the one we want to predict. Recall that a model's metric is a function that reflects how good the predictions are. It's important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you. If no variable represents that metric, you should see if you can build the metric from the variables that are available.
However, in this case Kaggle tells us what metric to use: root mean squared log error (RMSLE) between the actual and predicted auction prices. We need do only a small amount of processing to use this: we take the log of the prices, so that rmse
of that value will give us what we ultimately need:
dep_var = 'SalePrice'
df[dep_var] = np.log(df[dep_var])
We are now ready to explore our first machine learning algorithm for tabular data: decision trees.
Decision tree ensembles, as the name suggests, rely on decision trees. So let's start there! A decision tree asks a series of binary (that is, yes or no) questions about the data. After each question the data at that part of the tree is split between a "yes" and a "no" branch, as shown in <
This sequence of questions is now a procedure for taking any data item, whether an item from the training set or a new one, and assigning that item to a group. Namely, after asking and answering the questions, we can say the item belongs to the same group as all the other training data items that yielded the same set of answers to the questions. But what good is this? The goal of our model is to predict values for items, not to assign them into groups from the training dataset. The value is that we can now assign a prediction value for each of these groups—for regression, we take the target mean of the items in the group.
Let's consider how we find the right questions to ask. Of course, we wouldn't want to have to create all these questions ourselves—that's what computers are for! The basic steps to train a decision tree can be written down very easily:
 Loop through each column of the dataset in turn.
 For each column, loop through each possible level of that column in turn.
Note: What does tha mean(above)?1. Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
 Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple "model" where our predictions are simply the average sale price of the item's group.
 After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
 We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each by going back to step 1 for each group.
 Continue this process recursively, until you have reached some stopping criterion for each group—for instance, stop splitting a group further when it has only 20 items in it.
Although this is an easy enough algorithm to implement yourself (and it is a good exercise to do so), we can save some time by using the implementation built into sklearn.
First, however, we need to do a little data preparation.
A:Here's a productive question to ponder. If you consider that the procedure for defining a decision tree essentially chooses one sequence of splitting questions about variables, you might ask yourself, how do we know this procedure chooses the correct sequence? The rule is to choose the splitting question that produces the best split (i.e., that most accurately separates the items into two distinct categories), and then to apply the same rule to the groups that split produces, and so on. This is known in computer science as a "greedy" approach. Can you imagine a scenario in which asking a “less powerful” splitting question would enable a better split down the road (or should I say down the trunk!) and lead to a better result overall?
The first piece of data preparation we need to do is to enrich our representation of dates. The fundamental basis of the decision tree that we just described is bisection— dividing a group into two. We look at the ordinal variables and divide up the dataset based on whether the variable's value is greater (or lower) than a threshold, and we look at the categorical variables and divide up the dataset based on whether the variable's level is a particular level. So this algorithm has a way of dividing up the dataset based on both ordinal and categorical data.
But how does this apply to a common data type, the date? You might want to treat a date as an ordinal value, because it is meaningful to say that one date is greater than another. However, dates are a bit different from most ordinal values in that some dates are qualitatively different from others in a way that that is often relevant to the systems we are modeling.
In order to help our algorithm handle dates intelligently, we'd like our model to know more than whether a date is more recent or less recent than another. We might want our model to make decisions based on that date's day of the week, on whether a day is a holiday, on what month it is in, and so forth. To do this, we replace every date column with a set of date metadata columns, such as holiday, day of week, and month. These columns provide categorical data that we suspect will be useful.
fastai comes with a function that will do this for us—we just have to pass a column name that contains dates:
df = add_datepart(df, 'saledate')
Let's do the same for the test set while we're there:
df_test = pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')
We can see that there are now lots of new columns in our DataFrame:
' '.join(o for o in df.columns if o.startswith('sale'))
This is a good first step, but we will need to do a bit more cleaning. For this, we will use fastai objects called TabularPandas
and TabularProc
.
A second piece of preparatory processing is to be sure we can handle strings and missing data. Out of the box, sklearn cannot do either. Instead we will use fastai's class TabularPandas
, which wraps a Pandas DataFrame and provides a few conveniences. To populate a TabularPandas
, we will use two TabularProc
s, Categorify
and FillMissing
. A TabularProc
is like a regular Transform
, except that:
 It returns the exact same object that's passed to it, after modifying the object in place.
 It runs the transform once, when data is first passed in, rather than lazily as the data is accessed.
Categorify
is a TabularProc
that replaces a column with a numeric categorical column. FillMissing
is a TabularProc
that replaces missing values with the median of the column, and creates a new Boolean column that is set to True
for any row where the value was missing. These two transforms are needed for nearly every tabular dataset you will use, so this is a good starting point for your data processing:
procs = [Categorify, FillMissing]
TabularPandas
will also handle splitting the dataset into training and validation sets for us. However we need to be very careful about our validation set. We want to design it so that it is like the test set Kaggle will use to judge the contest.
Recall the distinction between a validation set and a test set, as discussed in <
We don't get to see the test set. But we do want to define our validation data so that it has the same sort of relationship to the training data as the test set will have.
In some cases, just randomly choosing a subset of your data points will do that. This is not one of those cases, because it is a time series.
If you look at the date range represented in the test set, you will discover that it covers a sixmonth period from May 2012, which is later in time than any date in the training set. This is a good design, because the competition sponsor will want to ensure that a model is able to predict the future. But it means that if we are going to have a useful validation set, we also want the validation set to be later in time than the training set. The Kaggle training data ends in April 2012, so we will define a narrower training dataset which consists only of the Kaggle training data from before November 2011, and we'll define a validation set consisting of data from after November 2011.
To do this we use np.where
, a useful function that returns (as the first element of a tuple) the indices of all True
values:
cond = (df.saleYear<2011)  (df.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx),list(valid_idx))
TabularPandas
needs to be told which columns are continuous and which are categorical. We can handle that automatically using the helper function cont_cat_split
:
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
A TabularPandas
behaves a lot like a fastai Datasets
object, including providing train
and valid
attributes:
len(to.train),len(to.valid)
We can see that the data is still displayed as strings for categories (we only show a few columns here because the full table is too big to fit on a page):
to.show(3)
However, the underlying items are all numeric:
to.items.head(3)
The conversion of categorical columns to numbers is done by simply replacing each unique level with a number. The numbers associated with the levels are chosen consecutively as they are seen in a column, so there's no particular meaning to the numbers in categorical columns after conversion. The exception is if you first convert a column to a Pandas ordered category (as we did for ProductSize
earlier), in which case the ordering you chose is used. We can see the mapping by looking at the classes
attribute:
to.classes['ProductSize']
Since it takes a minute or so to process the data to get to this point, we should save it—that way in the future we can continue our work from here without rerunning the previous steps. fastai provides a save
method that uses Python's pickle system to save nearly any Python object:
save_pickle(path/'to.pkl',to)
To read this back later, you would type:
to = (path/'to.pkl').load()
Now that all this preprocessing is done, we are ready to create a decision tree.
To begin, we define our independent and dependent variables:
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y
Now that our data is all numeric, and there are no missing values, we can create a decision tree:
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(xs, y);
To keep it simple, we've told sklearn to just create four leaf nodes. To see what it's learned, we can display the tree:
draw_tree(m, xs, size=10, leaves_parallel=True, precision=2)
Understanding this picture is one of the best ways to understand decision trees, so we will start at the top and explain each part step by step.
The top node represents the initial model before any splits have been done, when all the data is in one group. This is the simplest possible model. It is the result of asking zero questions and will always predict the value to be the average value of the whole dataset. In this case, we can see it predicts a value of 10.10 for the logarithm of the sales price. It gives a mean squared error of 0.48. The square root of this is 0.69. (Remember that unless you see m_rmse
, or a root mean squared error, then the value you are looking at is before taking the square root, so it is just the average of the square of the differences.) We can also see that there are 404,710 auction records in this group—that is the total size of our training set. The final piece of information shown here is the decision criterion for the best split that was found, which is to split based on the coupler_system
column.
Moving down and to the left, this node shows us that there were 360,847 auction records for equipment where coupler_system
was less than 0.5. The average value of our dependent variable in this group is 10.21. Moving down and to the right from the initial model takes us to the records where coupler_system
was greater than 0.5.
The bottom row contains our leaf nodes: the nodes with no answers coming out of them, because there are no more questions to be answered. At the far right of this row is the node containing records where coupler_system
was greater than 0.5. The average value here is 9.21, so we can see the decision tree algorithm did find a single binary decision that separated highvalue from lowvalue auction results. Asking only about coupler_system
predicts an average value of 9.21 versus 10.1.
Returning back to the top node after the first decision point, we can see that a second binary decision split has been made, based on asking whether YearMade
is less than or equal to 1991.5. For the group where this is true (remember, this is now following two binary decisions, based on coupler_system
and YearMade
) the average value is 9.97, and there are 155,724 auction records in this group. For the group of auctions where this decision is false, the average value is 10.4, and there are 205,123 records. So again, we can see that the decision tree algorithm has successfully split our more expensive auction records into two more groups which differ in value significantly.
We can show the same information using Terence Parr's powerful dtreeviz library:
samp_idx = np.random.permutation(len(y))[:500]
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
orientation='LR')
This shows a chart of the distribution of the data for each split point. We can clearly see that there's a problem with our YearMade
data: there are bulldozers made in the year 1000, apparently! Presumably this is actually just a missing value code (a value that doesn't otherwise appear in the data and that is used as a placeholder in cases where a value is missing). For modeling purposes, 1000 is fine, but as you can see this outlier makes visualization of the values we are interested in more difficult. So, let's replace it with 1950:
xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950
valid_xs.loc[valid_xs['YearMade']<1900, 'YearMade'] = 1950
That change makes the split much clearer in the tree visualization, even although it doesn't actually change the result of the model in any significant way. This is a great example of how resilient decision trees are to data issues!
m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs, y)
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
orientation='LR')
Let's now have the decision tree algorithm build a bigger tree. Here, we are not passing in any stopping criteria such as max_leaf_nodes
:
m = DecisionTreeRegressor()
m.fit(xs, y);
We'll create a little function to check the root mean squared error of our model (m_rmse
), since that's how the competition was judged:
def r_mse(pred,y): return round(math.sqrt(((predy)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
m_rmse(m, xs, y)
So, our model is perfect, right? Not so fast... remember we really need to check the validation set, to ensure we're not overfitting:
m_rmse(m, valid_xs, valid_y)
Oops—it looks like we might be overfitting pretty badly. Here's why:
m.get_n_leaves(), len(xs)
We've got nearly as many leaf nodes as data points! That seems a little overenthusiastic. Indeed, sklearn's default settings allow it to continue splitting nodes until there is only one item in each leaf node. Let's change the stopping rule to tell sklearn to ensure every leaf node contains at least 25 auction records:
m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
That looks much better. Let's check the number of leaves again:
m.get_n_leaves()
Much more reasonable!
A:Here's my intuition for an overfitting decision tree with more leaf nodes than data items. Consider the game Twenty Questions. In that game, the chooser secretly imagines an object (like, "our television set"), and the guesser gets to pose 20 yes or no questions to try to guess what the object is (like "Is it bigger than a breadbox?"). The guesser is not trying to predict a numerical value, but just to identify a particular object out of the set of all imaginable objects. When your decision tree has more leaves than there are possible objects in your domain, then it is essentially a welltrained guesser. It has learned the sequence of questions needed to identify a particular data item in the training set, and it is "predicting" only by describing that item's value. This is a way of memorizing the training set—i.e., of overfitting.
Building a decision tree is a good way to create a model of our data. It is very flexible, since it can clearly handle nonlinear relationships and interactions between variables. But we can see there is a fundamental compromise between how well it generalizes (which we can achieve by creating small trees) and how accurate it is on the training set (which we can achieve by using large trees).
So how do we get the best of both worlds? We'll show you right after we handle an important missing detail: how to handle categorical variables.
In the previous chapter, when working with deep learning networks, we dealt with categorical variables by onehot encoding them and feeding them to an embedding layer. The embedding layer helped the model to discover the meaning of the different levels of these variables (the levels of a categorical variable do not have an intrinsic meaning, unless we manually specify an ordering using Pandas). In a decision tree, we don't have embeddings layers—so how can these untreated categorical variables do anything useful in a decision tree? For instance, how could something like a product code be used?
The short answer is: it just works! Think about a situation where there is one product code that is far more expensive at auction than any other one. In that case, any binary split will result in that one product code being in some group, and that group will be more expensive than the other group. Therefore, our simple decision tree building algorithm will choose that split. Later during training the algorithm will be able to further split the subgroup that contains the expensive product code, and over time, the tree will home in on that one expensive product.
It is also possible to use onehot encoding to replace a single categorical variable with multiple onehotencoded columns, where each column represents a possible level of the variable. Pandas has a get_dummies
method which does just that.
However, there is not really any evidence that such an approach improves the end result. So, we generally avoid it where possible, because it does end up making your dataset harder to work with. In 2019 this issue was explored in the paper "Splitting on Categorical Predictors in Random Forests" by Marvin Wright and Inke König, which said:
:The standard approach for nominal predictors is to consider all $2^{k1} − 1$ 2partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories.
Now that you understand how decisions tree work, it's time for the bestofbothworlds solution: random forests.
In 1994 Berkeley professor Leo Breiman, one year after his retirement, published a small technical report called "Bagging Predictors", which turned out to be one of the most influential ideas in modern machine learning. The report began:
:Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions... The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests… show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy. Here is the procedure that Breiman is proposing:
 Randomly choose a subset of the rows of your data (i.e., "bootstrap replicates of your learning set").
 Train a model using this subset.
 Save that model, and then return to step 1 a few times.
 This will give you a number of trained models. To make a prediction, predict using all of the models, and then take the average of each of those model's predictions.
This procedure is known as "bagging." It is based on a deep and important insight: although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. Different models will make different errors. The average of those errors, therefore, is: zero! So if we take the average of all of the models' predictions, then we should end up with a prediction that gets closer and closer to the correct answer, the more models we have. This is an extraordinary result—it means that we can improve the accuracy of nearly any kind of machine learning algorithm by training it multiple times, each time on a different random subset of the data, and averaging its predictions.
In 2001 Leo Breiman went on to demonstrate that this approach to building models, when applied to decision tree building algorithms, was particularly powerful. He went even further than just randomly choosing rows for each model's training, but also randomly selected from a subset of columns when choosing each split in each decision tree. He called this method the random forest. Today it is, perhaps, the most widely used and practically important machine learning method.
In essence a random forest is a model that averages the predictions of a large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree and other tree parameters. Bagging is a particular approach to "ensembling," or combining the results of multiple models together. To see how it works in practice, let's get started on creating our own random forest!
We can create a random forest just like we created a decision tree, except now, we are also specifying parameters that indicate how many trees should be in the forest, how we should subset the data items (the rows), and how we should subset the fields (the columns).
In the following function definition n_estimators
defines the number of trees we want, max_samples
defines how many rows to sample for training each tree, and max_features
defines how many columns to sample at each split point (where 0.5
means "take half the total number of columns"). We can also specify when to stop splitting the tree nodes, effectively limiting the depth of the tree, by including the same min_samples_leaf
parameter we used in the last section. Finally, we pass n_jobs=1
to tell sklearn to use all our CPUs to build the trees in parallel. By creating a little function for this, we can more quickly try different variations in the rest of this chapter:
def rf(xs, y, n_estimators=40, max_samples=200_000,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(n_jobs=1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)
m = rf(xs, y);
Our validation RMSE is now much improved over our last result produced by the DecisionTreeRegressor
, which made just one tree using all the available data:
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
One of the most important properties of random forests is that they aren't very sensitive to the hyperparameter choices, such as max_features
. You can set n_estimators
to as high a number as you have time to train—the more trees you have, the more accurate the model will be. max_samples
can often be left at its default, unless you have over 200,000 data points, in which case setting it to 200,000 will make it train faster with little impact on accuracy. max_features=0.5
and min_samples_leaf=4
both tend to work well, although sklearn's defaults work well too.
The sklearn docs show an example of the effects of different max_features
choices, with increasing numbers of trees. In the plot, the blue plot line uses the fewest features and the green line uses the most (it uses all the features). As you can see in <
To see the impact of n_estimators
, let's get the predictions from each individual tree in our forest (these are in the estimators_
attribute):
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
As you can see, preds.mean(0)
gives the same results as our random forest:
r_mse(preds.mean(0), valid_y)
Let's see what happens to the RMSE as we add more and more trees. As you can see, the improvement levels off quite a bit after around 30 trees:
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)]);
The performance on our validation set is worse than on our training set. But is that because we're overfitting, or because the validation set covers a different time period, or a bit of both? With the existing information we've seen, we can't tell. However, random forests have a very clever trick called outofbag (OOB) error that can help us with this (and more!).
Recall that in a random forest, each tree is trained on a different subset of the training data. The OOB error is a way of measuring prediction error on the training set by only including in the calculation of a row's error trees where that row was not included in training. This allows us to see whether the model is overfitting, without needing a separate validation set.
A:My intuition for this is that, since every tree was trained with a different randomly selected subset of rows, outofbag error is a little like imagining that every tree therefore also has its own validation set. That validation set is simply the rows that were not selected for that tree's training. This is particularly beneficial in cases where we have only a small amount of training data, as it allows us to see whether our model generalizes without removing items to create a validation set. The OOB predictions are available in the
oob_prediction_
attribute. Note that we compare them to the training labels, since this is being calculated on trees using the training set.
r_mse(m.oob_prediction_, y)
We can see that our OOB error is much lower than our validation set error. This means that something else is causing that error, in addition to normal generalization error. We'll discuss the reasons for this later in this chapter.
This is one way to interpret our model's predictions—let's focus on more of those now.
For tabular data, model interpretation is particularly important. For a given model, the things we are most likely to be interested in are:
 How confident are we in our predictions using a particular row of data?
 For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
 Which columns are the strongest predictors, which can we ignore?
 Which columns are effectively redundant with each other, for purposes of prediction?
 How do predictions vary, as we vary these columns?
As we will see, random forests are particularly well suited to answering these questions. Let's start with the first one!
We saw how the model averages the individual tree's predictions to get an overall prediction—that is, an estimate of the value. But how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions across the trees, instead of just the mean. This tells us the relative confidence of predictions. In general, we would want to be more cautious of using the results for rows where trees give very different results (higher standard deviations), compared to cases where they are more consistent (lower standard deviations).
In the earlier section on creating a random forest, we saw how to get predictions over the validation set, using a Python list comprehension to do this for each tree in the forest:
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
preds.shape
Now we have a prediction for every tree and every auction (40 trees and 7,988 auctions) in the validation set.
Using this we can get the standard deviation of the predictions over all the trees, for each auction:
preds_std = preds.std(0)
Here are the standard deviations for the predictions for the first five auctions—that is, the first five rows of the validation set:
preds_std[:5]
As you can see, the confidence in the predictions varies widely. For some auctions, there is a low standard deviation because the trees agree. For others it's higher, as the trees don't agree. This is information that would be useful in a production setting; for instance, if you were using this model to decide what items to bid on at auction, a lowconfidence prediction might cause you to look more carefully at an item before you made a bid.
It's not normally enough just to know that a model can make accurate predictions—we also want to know how it's making predictions. feature importance gives us insight into this. We can get these directly from sklearn's random forest by looking in the feature_importances_
attribute. Here's a simple function we can use to pop them into a DataFrame and sort them:
def rf_feat_importance(m, df):
return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
).sort_values('imp', ascending=False)
The feature importances for our model show that the first few most important columns have much higher importance scores than the rest, with (not surprisingly) YearMade
and ProductSize
being at the top of the list:
fi = rf_feat_importance(m, xs)
fi[:10]
A plot of the feature importances shows the relative importances more clearly:
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
plot_fi(fi[:30]);
The way these importances are calculated is quite simple yet elegant. The feature importance algorithm loops through each tree, and then recursively explores each branch. At each branch, it looks to see what feature was used for that split, and how much the model improves as a result of that split. The improvement (weighted by the number of rows in that group) is added to the importance score for that feature. This is summed across all branches of all trees, and finally the scores are normalized such that they add to 1.
It seems likely that we could use just a subset of the columns by removing the variables of low importance and still get good results. Let's try just keeping those with a feature importance greater than 0.005:
to_keep = fi[fi.imp>0.005].cols
len(to_keep)
We can retrain our model using just this subset of the columns:
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]
m = rf(xs_imp, y)
And here's the result:
m_rmse(m, xs_imp, y), m_rmse(m, valid_xs_imp, valid_y)
Our accuracy is about the same, but we have far fewer columns to study:
len(xs.columns), len(xs_imp.columns)
We've found that generally the first step to improving a model is simplifying it—78 columns was too many for us to study them all in depth! Furthermore, in practice often a simpler, more interpretable model is easier to roll out and maintain.
This also makes our feature importance plot easier to interpret. Let's look at it again:
plot_fi(rf_feat_importance(m, xs_imp));
One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings: for example, ProductGroup
and ProductGroupDesc
. Let's try to remove any redundent features.
Let's start with:
cluster_columns(xs_imp)