In this tutorial, we are going to learn an advanced concept related to decision trees called random forests. The idea behind random forests is the concept of ensembling. Sir Francis Galton (1822–1911), an English philosopher and statistician, was the brain behind the basic ideas of standard deviation and correlation. Once, during a visit to a livestock fair, Galton got interested in a simple game where people tried to guess the weight of an ox. Lots of folks joined in, but no one hit the exact weight: 1,198 pounds.
But guess what? Galton discovered something cool – when he averaged out all the guesses, it was super close to the real weight: 1,197 pounds. This reminded him of the Condorcet jury theorem, showing that combining many simple guesses can give a really good result. Fast forward to 2004, an American financial journalist named James Michael Surowiecki wrote a book called “The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations.” Surowiecki’s idea is pretty neat – when you gather info from different people, you can make better decisions, sometimes even better than what a super-smart person might decide. This is the basic idea of ensemble learning in machine learning.
An ensemble in the context of machine learning refers to the technique of combining multiple individual models to create a stronger, more robust predictive model. The idea behind ensemble methods is to leverage the diversity of multiple models to improve overall performance, generalization, and accuracy. A group of predictors is called an ensemble and an ensemble Learning algorithm is called an Ensemble method.
Now the question is how do we ensemble models? What are the combination methods? Where does random forest fit in this whole thing? That’s what we are going to learn in this tutorial.
Table of Contents
Prerequisites
- Linear Algebra And Calculus For Machine Learning
- Probability And Statistics For Machine Learning
- Python, Numpy, Matplotlib And Pandas
- Decision Trees – Learn here.
What You Will Learn
- Bias-Variance Tradeoff
- Bagging
- Pasting
- Ensemble Learning
- Random Forests
- Stacking
- Application
Bias-Variance Tradeoff
To understand random forests, we need to understand the bias and variance tradeoffs in machine learning. Imagine you are building multiple models on different samples of a dataset. The bias in data is also reflected in the predictions of these models. This means since we are training multiple models on different samples each will have its own bias output as each sample has its own bias. Meaning our models will produce a range of predictions. Bias measures how far off in general these model’s predictions are from the correct value. When training multiple models on different samples, variance can be observed in the variability of predictions across these models for the same input. The variance here is how much predictions for a given point vary between different realizations of the data point.
When dealing with predictive models, the prediction errors are mainly because of the errors due to bias and errors due to variance. Our goal in machine learning is to reduce bias and variance so that the model can make fewer errors. But in the real world due to finite data, we have to settle with a tradeoff between bias-variance.
At its core, reducing bias and variance is nothing but dealing with model complexity or dealing with underfitting or overfitting. As our model becomes more complex, the bias reduces and variance increases. So, we try to find an optimum model complexity. A simple model may not have enough capacity to capture the underlying patterns in the data, resulting in high bias. It makes strong assumptions and may oversimplify the true relationships. As the model complexity increases, it becomes more flexible and capable of fitting complex patterns in the data. It can better approximate the underlying relationships, leading to a reduction in bias.
Simple models are less sensitive to variations in the training data, resulting in lower variance. They are more stable and consistent but may miss intricate patterns. More complex models, with greater flexibility, can fit the training data more closely. However, they become sensitive to noise and fluctuations, leading to higher variance. The model might start capturing random patterns present in the training data that don’t generalize well.
Overfitting occurs when a highly complex model fits the training data too closely, capturing noise and leading to high variance. It performs well on the training data but poorly on new data. Underfitting occurs when a model is too simple to capture the underlying patterns, resulting in high bias. It performs poorly on both training and new data.
A detailed discussion on this topic can be found in this tutorial here.
Bagging (Bootstrap Aggregating)
A unique way to reduce the variance is by a method called bagging. It is an ensemble learning technique used to improve the stability and accuracy of machine learning models. The basic idea behind bagging is to train multiple instances of the same learning algorithm on different subsets of the training data, and then combine their predictions to achieve a more robust and generalizable model. What this is doing basically is creating a bigger population for us which will allow us to reduce the variance.
Here’s how bagging works:
- Bootstrap Sampling:
- Random subsets of the training data are created by sampling with replacement (bootstrap sampling). This means that some instances may be repeated in a subset while others may not be included at all.
- Model Training:
- A base learning algorithm (e.g., decision tree) is trained independently on each of these bootstrap samples. As a result, multiple models are created, each exposed to a slightly different perspective of the data.
- Prediction Combination:
- When making predictions, the individual models’ outputs are combined through a voting or averaging process. For classification tasks, the mode (most frequent class) is often used, while for regression tasks, the average of predictions is taken.
Notes On Ensemble:
- Each model in the bagging ensemble is trained independently of the others. This independence means that if one model overfits a particular pattern or noise in the data, other models may not necessarily follow the same pattern. As a result, the overall ensemble is less likely to be sensitive to specific instances in the training data, reducing the risk of overfitting. The diversity introduced by bagging improves the stability and generalization performance of the ensemble. The combined knowledge of multiple models tends to be more reliable and less sensitive to variations in the training data.
- If individual models make errors in certain instances, the ensemble has the potential to compensate for these errors. Some models may correctly predict instances where others fail, leading to a more accurate and robust overall prediction.
- It helps reduce the variance of the model by combining predictions from multiple models trained on different subsets of the data. This is particularly beneficial when dealing with complex models prone to overfitting such as decision trees. By training on diverse subsets of the data, bagging makes the model less sensitive to variations in the training data. It improves stability and generalization performance. Bagging can improve a model’s robustness by reducing the impact of outliers or noisy instances in the training data. The independent training of models allows for parallelization, making bagging suitable for distributed computing environments.
But what about bias?
It tends to reduce both bias and variance but not on individual models. The net result is that the ensemble, formed by combining predictions from multiple predictors, tends to have a similar bias to that of individual predictors trained on the original training set. However, the variance of the ensemble is lower compared to a single predictor trained on the complete dataset.
Just like bagging, there is another sampling method called pasting.
Pasting
In pasting, each base model is trained on a random subset of the training data, sampled without replacement. This means that once an instance is selected for a particular subset, it cannot be selected again for that subset. Multiple base models (predictors) are trained independently on different subsets of the data so that each model sees a distinct subset of the training instances.
Advantages of Pasting:
- Reduction in Variance:
- Similar to bagging, pasting primarily aims to reduce variance. By training models on different subsets of the data, it helps create an ensemble that is less sensitive to the noise or fluctuations present in any single training set.
- Improved Generalization:
- The diversity introduced by pasting contributes to improved generalization to unseen data. The ensemble is more likely to capture the underlying patterns in the data rather than memorizing specific instances.
- Parallelization:
- Pasting allows for parallelization during training because each model is trained independently. This makes it computationally efficient and suitable for distributed computing environments.
Comparison with Bagging:
- Sampling Approach:
- In bagging, sampling is done with replacement, allowing instances to be selected multiple times for a particular subset. In pasting, sampling is done without replacement, ensuring that each instance is selected only once for a particular subset.
- Use Cases:
- Bagging is often used with high-variance models, such as decision trees. Pasting can be suitable for situations where the base models are sensitive to the training instances and can potentially overfit.
Now the question is how do we combine these model’s output to get the final output? Various methods are available but I am mentioning only the ones that are mostly used.
Combination Methods
There are various combination methods available. You don’t need to know all of them as most of them are not even implemented in Sklearn. Here are the important ones:
- Voting (Classification):
- In binary classification, each model in the ensemble predicts the class label for a given instance. The final prediction is determined by a majority vote. The class that receives the most votes is chosen as the ensemble’s prediction.
- For multiclass classification, a similar voting process is applied. Each model predicts a class, and the class with the highest number of votes is selected as the final prediction.
- The voting process can be of two types: Soft Voting And Hard Voting.
- In hard voting, each base model in the ensemble makes a classification prediction, and the final ensemble prediction is determined by a simple majority vote. The class that receives the most votes is selected as the final prediction. For example, if there are three base models, and they predict classes A, B, and A, then the majority class is A, and the final prediction is A.
- In soft voting, each base model provides a probability estimate for each class, and the final ensemble prediction is based on the average or weighted average of these probabilities. For example, if there are three base models, and they provide probability estimates for classes A, B, and A as (0.8, 0.6, 0.7), then the final prediction is the class with the highest average probability – A. It works well when individual models provide meaningful probability estimates. This is not for all models.
- Averaging (Regression):
- In regression tasks, where the goal is to predict a continuous numerical value, each model produces a numeric prediction. The final prediction is obtained by averaging these numeric predictions. This can be a simple arithmetic mean or a weighted mean, where the weights are determined based on the models’ performance or other criteria.
- Weighted Aggregation:
- Each model’s prediction may be given a specific weight based on its performance or other considerations. The final prediction is then a weighted combination of individual predictions. This approach allows for giving more influence to models that have demonstrated better accuracy on the validation set or during training.
Now that we have these basics in mind. Let’s come to our topic of discussion random forests.
What are Random Forests?
In machine learning, decision trees are like individual experts. They’re simple and intuitive, but they can sometimes make errors or be overly sensitive to the data they were trained on. Random Forests, on the other hand, act as a group of decision trees, offering a more robust and accurate prediction by combining the strengths of multiple trees.
Example: Decision Trees vs. Random Forests
Decision Tree Scenario: Suppose you’re predicting whether someone will enjoy outdoor activities based on weather conditions. A decision tree might say, “If it’s sunny, they’ll likely enjoy it; if it’s rainy, they won’t.” However, this decision tree might become too specific and make errors if it encounters a rare rainy day that people actually enjoy.
Random Forest Scenario: Now, consider a Random Forest consisting of several decision trees. Each tree might look at different aspects of the weather, like temperature, humidity, or wind speed. When it’s time to make a prediction, each tree casts its vote on whether the person will enjoy the activity. The final decision is determined by the majority vote among all the trees. This way, even if one tree makes a mistake due to unusual circumstances, the overall prediction remains reliable.
The random forest is based on applying bagging to decision trees, with one important extension: in addition to sampling the records, the algorithm also samples the variables. During the training process, each decision tree in the Random Forest is exposed to a random subset of the training data and a random subset of the features. This introduces diversity among the trees, preventing them from becoming too similar. When it’s time to make a prediction, each tree in the Random Forest “votes” on the outcome. For classification, the majority vote determines the predicted class; the average of the individual predictions is taken for regression.
Advantages of Random Forests:
- Reduced Overfitting: Decision trees can be prone to overfitting, meaning they memorize the training data instead of learning general patterns. Random Forests mitigate this by averaging over multiple trees, which tends to smooth out individual peculiarities and provide a more generalized prediction.
- Improved Accuracy: The collective wisdom of multiple trees often leads to more accurate predictions compared to a single decision tree.
- Robustness: Random Forests are less sensitive to outliers or noisy data because they consider a broader range of perspectives.
- Feature Importance: Random Forests can also provide insights into feature importance. By observing how much each feature contributes to the accuracy of predictions across the ensemble, one can gain a better understanding of which features are more influential. We will learn about feature importance in the notebook in further detail.
Comparison with Decision Trees:
- Decision Trees:
- Single, straightforward model.
- Prone to overfitting.
- Might be sensitive to the specific training data.
- Random Forests:
- Ensemble of decision trees.
- Reduces overfitting through averaging.
- It is more robust due to the diversity of trees.
We will see the implementation of random forests in the notebook.
Random Forests also give a measure of feature importance, it’s often relative within the model but you should note that the importance values do not necessarily reflect real-world meanings or causal relationships. The importance values are specific to the predictions made by the model.
“Random Patches” and “Random Subspaces”
Bagging methods come in many flavours but mostly differ from each other by the way they draw random subsets of the training set:
Random Patches:
- Especially with high-dimensional data, Random Patches involve creating multiple subsets of the training dataset by randomly sampling both instances (data points) and features (attributes or variables).
- Each subset (patch) is used to train a separate model. These models are then aggregated to make predictions.
- The randomness in selecting both instances and features helps create diverse models, enhancing the overall model’s robustness and generalization.
- The Random Patches method is commonly used with algorithms like the Random Forest, where each tree is trained on a different subset of both instances and features.
Random Subspaces:
- Random Subspaces is a similar concept but focuses on randomizing only the features (attributes or variables) while keeping all instances in each subset.
- In this approach, each subset (subspace) is created by randomly selecting a subset of features for training a model.
- Again, the goal is to introduce diversity among the models by training them on different sets of features.
- Often used with algorithms like Bagged Decision Trees or other base learners.
Comparison:
- The key difference lies in what is randomized—Random Patches randomize both instances and features, while Random Subspaces randomize only features.
- Both methods contribute to reducing overfitting and improving the generalization performance of ensemble models.
- These techniques are particularly useful when dealing with high-dimensional data, as they introduce variability in the training process.
Out-Of-Bag-Evaluation
When training each decision tree in the ensemble, a random subset of the original training dataset is created by sampling with replacement. This means that some instances from the original dataset may be included multiple times, while others may not be included at all. On average, each decision tree in the Random Forest is trained on about 63% of the original training instances. The reason for this is the nature of bootstrapping, where, on average, about 63% of the instances are selected in each bootstrap sample. The remaining 37% of the instances that are not included in the bootstrap sample for a particular tree are referred to as out-of-bag (OOB) instances. Since these instances are not used in training a specific tree, they can be considered as a validation set for that particular tree. We use these OOB instances for evaluation. We will see its implementation in the notebook.
If you are curious about why 63% then please visit this post here. It has something to do with the probability of instance sampling. And do share your discovery in the forum.
Boosting
To understand boosting, we need to know the concept of weak learners and strong learners.
A decision stump is a simple decision tree with only one level (depth of one). It makes a decision based on a single feature and a threshold. Now, consider a decision stump that predicts whether a person will buy a product based on their age. The decision stump might say “yes” if the person is older than 30 and “no” otherwise. It is an example of a weak learner. A weak learner is a model that performs slightly better than random chance or is just slightly better than random guessing. They are often simple models, such as decision stumps or shallow decision trees, which have limited expressive power. These models may not perform well on their own, but they can be combined or boosted to create a strong learner.
Now consider a random forest, it is a strong leaner because it can achieve high predictive performance on its own. Strong learners are usually complex models with high expressive power and the ability to capture intricate relationships in the data. They can be more resource-intensive and prone to overfitting, as they may learn the training data too well.
Boosting is an ensemble learning technique that combines multiple weak learners to create a strong learner. The goal is to improve overall predictive performance by sequentially training weak models on the dataset, with each subsequent model giving more emphasis to the examples that the previous models misclassified. The final prediction is often made by combining the predictions of all the weak learners, typically through a weighted sum or a voting mechanism. Bagging requires little tuning but boosting requires much greater care in its application.
There are various boosting algorithms available but the most popular ones are AdaBoost, Gradient boosting, and Stochastic Gradient Boosting.
1. AdaBoost (Adaptive Boosting)
AdaBoost is the most popular boosting algorithm used. It is called adaptive because, it does not need to know error bounds on the weak classifiers, nor does it need to know the number of classifiers in advance. The algorithm assigns different weights to training examples based on their classification errors, allowing subsequent weak learners to focus on the instances that were misclassified by previous models. AdaBoost is a greedy algorithm.
How AdaBoost Works:
- Weighted Instances: AdaBoost assigns weights to each training instance. Initially, all weights are equal.
- Sequential Training: Weak learners are trained iteratively, and the algorithm pays more attention to instances that are misclassified by giving them higher weights.
- Weighted Voting: The final prediction is a weighted combination of the weak learners. Each weak learner’s weight is determined based on its accuracy, with more accurate learners having a higher say in the final prediction.
# AdaBoost Algorithm
# Input: Training dataset D, Number of weak learners T
# Initialize instance weights
initialize_weights()
# For each iteration t from 1 to T
for t in range(T):
# Train weak learner
weak_learner = train_weak_learner(D, weights)
# Calculate weighted error of weak learner
epsilon_t = calculate_weighted_error(weak_learner, D, weights)
# Calculate weak learner weight
alpha_t = 0.5 * log((1 - epsilon_t) / epsilon_t)
# Update instance weights
update_weights(weights, alpha_t, weak_learner, D)
# Combine weak learners into a strong learner
def final_strong_learner(x):
return sign(sum(alpha_t * weak_learner(x) for alpha_t, weak_learner in zip(alphas, weak_learners)))
# Output: Final strong learner is a linear combination of weak learners.
Code language: PHP (php)
Though not necessary If you are curious about learning its mathematical implementation I would recommend reading the first 20 pages of this paper.
AdaBoost uses exponential loss for updating instance weights, emphasizing instances that are misclassified more often. AdaBoost is sensitive to outliers and noise. Techniques such as limiting tree depth or using more robust weak learners can mitigate these issues.
When To Use:
- You Have Weak Learners:
- AdaBoost works well when you have weak learners, which are models that perform slightly better than random chance. Common weak learners include shallow decision trees (stumps).
- High-Dimensional Data:
- AdaBoost can be effective in high-dimensional datasets where features might not be informative on their own, but their combinations contribute to better predictive performance.
- Classification Tasks:
- AdaBoost is primarily designed for classification tasks. It can be used when you have a binary or multiclass classification problem. However, Sklearn also allows you to use it for regression.
- You Want to Combine Multiple Models:
- If you want to combine the predictions of multiple models to create a strong ensemble classifier, AdaBoost is a suitable choice.
AdaBoost is a powerful ensemble learning algorithm, several variants and extensions have been developed to address specific challenges or improve its performance in certain scenarios. Here are some notable variants of AdaBoost:
- Real AdaBoost:
- Idea: It extends AdaBoost to handle real-valued predictions rather than just binary classification.
- Application: Real AdaBoost is suitable for problems where the target variable has multiple classes or when you need probabilistic predictions.
- SAMME (Stagewise Additive Modeling using a Multiclass Exponential loss):
- Idea: SAMME is an extension of AdaBoost designed for multi-class classification problems.
- Application: It is commonly used when dealing with more than two classes, and it generalizes AdaBoost to work in the multi-class setting.
- SAMME.R (Real SAMME):
- Idea: SAMME.R is an improvement over SAMME, which handles real-valued class probabilities rather than discrete class labels.
- Application: SAMME.R is beneficial when the base learner can provide class probabilities, such as in the case of decision trees with probability estimates.
You don’t need to remember each one of them. In scikit-learn, the default AdaBoost algorithm used for classification is a variant of SAMME, specifically SAMME.R (Real SAMME). This variant allows handling real-valued class probabilities. For regression tasks, scikit-learn’s AdaBoost uses the exponential loss function and AdaBoost.R2 which is designed specifically for regression problems, where the goal is to predict a continuous target variable. We will see what to use in practice.
2. Gradient Boosting
Our next boosting algorithm is gradient boosting which trains weak learners sequentially, with each one addressing the errors of the previous ones. Gradient Boosting aims to improve the accuracy of predictions by optimizing a differentiable loss function through the iterative addition of weak learners. Gradient Boosting optimizes the loss function by minimizing gradients, whereas AdaBoost focuses on adjusting instance weights to correct misclassification.
Components of Gradient Boosting:
a. Weak Learners (Base Models):
- Typically Decision Trees: Decision trees are commonly used as weak learners, often shallow trees to avoid overfitting.
- Can Be Other Models: While decision trees are common, Gradient Boosting can use other types of models as well.
b. Loss Function:
- Defines the Objective: A differentiable loss function is chosen based on the nature of the problem (regression or classification).
- Measures Prediction Error: The goal is to minimize the loss, which represents the discrepancy between predictions and true values.
c. Gradient Descent:
- Optimization Technique: Gradient Descent is used to minimize the loss function.
- Adjusts Predictions: At each stage, the new weak learner is trained to correct the errors made by the existing ensemble.
d. Shrinkage (Learning Rate):
- Controls Contribution of Each Weak Learner: A shrinkage parameter (learning rate) is introduced to control the contribution of each weak learner.
- Prevents Overfitting: Shrinkage serves as a regularization technique by penalizing the impact of each weak learner, preventing the model from fitting the training data too closely.
Gradient Boosting Algorithm:
- Initialize the Model:
- Set the initial prediction as the average (for regression) or log odds (for classification) of the target variable.
- For each iteration (t = 1 to T):
- Calculate the negative gradient of the loss function with respect to the current predictions.
- Train a weak learner (e.g., decision tree) to predict the negative gradient.
- Determine the step size (learning rate) to update the predictions.
- Update the predictions by adding the product of the step size and the predictions of the weak learner.
- Combine Weak Learners:
- The final prediction is the sum of the initial prediction and the weighted sum of the weak learners’ predictions.
Challenges and Considerations:
- Computational Complexity: Training can be computationally expensive, especially with large datasets.
- Sensitivity to Noisy Data: Gradient Boosting can be sensitive to noisy data and outliers.
- Tuning Hyperparameters: Requires careful tuning of hyperparameters such as learning rate, tree depth, and number of iterations.
If you are looking for mathematical intuition please check this paper. The concepts explained above will be clear in the notebook.
Gradient Boosting has evolved over time, leading to various variants and extensions that address specific challenges or aim to improve certain aspects of the algorithm. Some notable variants of Gradient Boosting include:
- XGBoost (eXtreme Gradient Boosting):
- Key Features:
- Regularization: Incorporates L1 and L2 regularization terms in the objective function to control model complexity.
- Parallelization: Enables parallel and distributed computing for faster training.
- Handling Missing Values: Can handle missing values in the dataset during training.
- Advantages:
- Often achieves better performance and faster training compared to traditional Gradient Boosting.
- Widely used in various machine learning competitions.
- Key Features:
- LightGBM (Light Gradient Boosting Machine):
- Key Features:
- Gradient-Based One-Side Sampling (GOSS): Efficiently selects instances with large gradients for training, reducing the number of instances used.
- Exclusive Feature Bundling: Optimizes the use of memory by bundling exclusive features together.
- Support for Categorical Features: Can handle categorical features directly.
- Advantages:
- Designed for distributed and efficient training, particularly on large datasets.
- Efficiently handles large categorical feature spaces.
- Key Features:
Stacking
Now, let’s move to our next concept stacking. Imagine you’re tasked with predicting whether a student passes or fails an exam based on various features like study hours, attendance, and previous grades. You decide to use these models: a Decision Tree, and a Logistic Regression. In stacking, instead of relying on just one of these models, you combine their strengths to create a more robust predictor.
Stacked generalization is a method for combining estimators to reduce their biases. More precisely, the predictions of each individual estimator are stacked together and used as input to a final estimator to compute the prediction. This final estimator is trained through cross-validation. Stacking goes beyond simple averaging or voting by training a meta-model to make predictions based on the outputs of diverse base models.
Imagine we have three types of predictors (models) – a Decision Tree, a Linear Regression, and a k-nearest Neighbors (KNN) model. We want to create a super predictor, a Blender, that combines their strengths.
1. Training the First Layer:
We take our dataset and split it into two parts – Training Set A and Hold-Out Set B.
Training Set A:
- Train our three models (Decision Tree, Linear Regression, KNN) on Training Set A. Each model learns to make predictions based on the features in this set.
Hold-Out Set B:
- We keep Hold-Out Set B untouched for now. These are instances our models have never seen during their training.
2. Making Predictions with First Layer:
Now, each model makes predictions on Hold-Out Set B. We get three sets of predictions – one from each model.
3. Creating a 3D Training Set for the Blender:
We create a new 3D training set using these three sets of predictions as features. The target values (actual outcomes) from Hold-Out Set B are still there.
4. Training the Blender (Meta-Model):
Our Blender is like a smart friend that learns from these three sets of predictions. We train a meta-model (Blender) on this new 3D training set.
5. Predicting with the Stacked Model:
Now, when a new instance comes in:
- Each model (Decision Tree, Linear Regression, KNN) predicts the outcome.
- These predictions become features for our Blender.
- The Blender combines these features to make the final prediction.
6. Layered Blending (Optional):
If we want to go even deeper, we can repeat this process. We split our data into three subsets, train models on the first subset, use those models to create a 3D training set for the next layer, and so on.
7. Sequential Prediction:
For a new instance, predictions travel through each layer sequentially – first from the base models, then through the Blender (meta-model), and so on if we have multiple layers.
This stacking approach allows us to harness the collective intelligence of diverse models, improving our overall predictive performance.
Why Stack?
- Diversity Benefits:
- Each model brings a unique perspective. Decision Trees capture complexity, Logistic Regression simplifies, and KNN looks at neighbors.
- Reducing Overfitting:
- While individual models may overfit, combining them with a meta-model helps smooth out their individual quirks.
- Improved Generalization:
- The ensemble can often generalize better to new, unseen data.
Implementations of Concepts In Python
Footnotes And Sources:
- Scott Fortmann Roe ↩︎
Sources And Further Reading:
- Data Mining with Decision Trees – Book
- https://www.math.mcgill.ca/yyang/resources/doc/randomforest.pdf
- Hands-on Machine Learning Book
- Friedman, J.H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics
- Tianqi Chen, Carlos Guestrin, “XGBoost: A Scalable Tree Boosting System”
- LightGBM: A Highly Efficient Gradient BoostingDecision Tree”
- T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning Ed. 2”, Springer, 2009.
- Understanding Random Forests From Theory To Application by Gilles Louppe.