In this post, we will learn about our next machine-learning algorithm, Decision Trees. Imagine you’re trying to decide what movie to watch. You ask simple questions like “Do you want action or comedy?” based on your answers, you keep narrowing down your choices until you find the perfect movie.
That’s exactly how Decision Trees work in computers. They help machines make smart choices by asking simple questions. Let’s dive into the world of Decision Trees and see how they make decisions and how we can use decision trees in machine learning problems.
Table of Contents
Prerequisites
- Linear Algebra For Machine Learning
- Probability And Statistics For Machine Learning
- Python, Numpy, Matplotlib And Pandas
What You Will Learn
- Basic Concepts & Definitions
- Information Theory For ML
- ID3, C4.5, C5.0 and CART algorithms
- CART – Classification and Regression Trees
- Regularization for decision trees
- Python Implementation
- Minimal Cost-Complexity Pruning
- And more
Basic Concepts Related To Decision Trees
Before we start let’s consider an example and see how we can solve a problem by breaking it down into several simple questions. Consider a problem: Based on our data whether we should offer a loan to a person or not? This is our main question and now we wish to solve this using a decision tree approach.
Our dataset:
| Income ($) | Credit Score | Loan Approval |
|------------|--------------|---------------|
| 30,000 | 650 | No |
| 50,000 | 700 | Yes |
| 80,000 | 600 | No |
| 120,000 | 750 | Yes |
| 40,000 | 680 | Yes |
| 90,000 | 620 | No |
| 75,000 | 720 | Yes |
| 110,000 | 690 | Yes |
We have two features here income and credit score, and it’s a binary classification problem of whether to approve the loan or not (Yes | NO). We will reach a final decision by asking simple questions based on our dataset. Decision trees can be represented in a flow chart format:
Credit Score < 700?
+---------------+
| |
Yes No
| |
v v
Income < 60,000? Loan Approval = Yes
+---------------+-------------------+
| |
Yes No
| |
v v
Loan Approval = No Loan Approval = Yes
We are first checking the credit score if it is above 700, we are approving the loan and if it is less than 700 then we are checking the income feature to analyse the risk of offering the loan. If income is less than $60, 000 we are not approving the loan. This is a very simple example where we are trying to solve our main problem of whether to approve a loan or not by breaking it down into simple questions. This is what the decision tree is all about. You learnt it, go home and enjoy!!
In machine learning, decision trees are powerful tools for making decisions based on input features. The goal is to create a tree-like model that can predict outcomes by asking a series of questions. Now, let’s say we have a complex problem and we want to build a decision tree, if we keep asking a series of questions then we may soon run into a problem where we will struggle to decide what type of questions to ask. How do we decide which questions to ask and in what order? This is where information theory helps us. We will get back to the concept of information theory but here is an image of a decision tree and some definitions related to its structure:
From the picture above we can see we have a starting node (root) then internal nodes (split) and then we have terminal nodes called leaf. Some terms you should keep in mind:
- Root Node (In circle):
- The topmost node in the tree.
- Represents the entire dataset.
- It is the starting point for the decision-making process.
- Internal Node (In circle):
- Decision points in the tree.
- Nodes where the dataset is split into subsets based on a specific feature and condition.
- Leaf Node (Terminal Node) (In square):
- Endpoints of the tree branches.
- Represent the outcome or decision.
- No further splitting occurs beyond leaf nodes.
- Branch:
- The connection between nodes.
- Represents the outcome of a decision or a path to follow based on a specific condition.
- Splitting:
- Process of dividing a dataset into subsets at an internal node.
- Based on a specific feature and condition.
- Feature:
- The attribute or variable used for splitting the dataset.
- The decision tree selects the best feature to split the data, typically based on metrics like Information Gain or Gini Index.
- Condition:
- The rule or threshold is used to split the data at an internal node.
- For example, “Credit Score < 700” is a condition.
- Decision:
- Outcome or prediction assigned to a leaf node.
- Represents the final result based on the path taken through the tree.
- Decision Tree Depth:
- The length of the longest path from the root to a leaf node.
- A deeper tree can capture more complex patterns but may be prone to overfitting.
- Parent Node and Child Node:
- In a split, the node that is split is the parent node.
- The nodes created after the split are the child nodes.
Definition: According to Sklearn Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation. The deeper the tree, the more complex the decision rules and the fitter the model.
Now, let’s discuss the concepts related to information theory that are used in machine learning and how we can use information theory in decision trees as discussed earlier.
Information Theory In Machine Learning
Let’s imagine we’re on a treasure hunt! Imagine you have a friend who loves to play a game of hide-and-seek, but instead of hiding, they leave little notes that give hints about where the treasure is buried.
Now, think of these notes as messages. Sometimes, your friend might give you very straightforward hints, like “Go left,” “Look for a big tree,” or “Count to three and then turn right.” Other times, they might be a bit tricky and give you hints like “Find the spot where the birds sing the loudest.”
Information theory is like a special set of rules we use to understand these hints and messages. It helps us figure out how much information is in each message and how to make messages more efficient.
In our treasure hunt, we want to make sure that the hints are just right—not too long and not too short. If the hints are too short, you might get confused and not find the treasure. If they’re too long, it might take forever to follow all the instructions.
So, information theory helps us find the perfect balance between giving enough information to guide you to the treasure and keeping things short and sweet. It’s like finding the magic recipe for the best hints!
Information theory is a field of study that deals with the quantification of information and the fundamental limits of compressing, transmitting, and processing information.
Information theory was formalized by Claude Shannon in the 1940s. His groundbreaking work laid the foundation for understanding the fundamental limits and possibilities of information processing. Shannon’s contributions have had a profound impact on various fields, from telecommunications to computer science.
Information theory concepts, such as entropy and information gain, are employed in machine learning algorithms. Decision trees, for example, use information gain to determine the most informative features for classification. In ML, we use these two concepts and that’s what we are going to discuss in the notes below.
Now that we know the concept of entropy which is nothing but the measure of disorder or impurity in a dataset. In the context of decision trees, it is used to evaluate the homogeneity of a set of labels. A dataset with low entropy is more organized and contains more uniform labels, while high entropy indicates a more mixed dataset. We use the concept of entropy as splitting criteria for the nodes (means what questions to ask?). There are various splitting criteria but we mostly use information gain and Gini index. Here are their definitions and formulas:
I am avoiding the manual calculation of decision trees but In application, I will show you how the gain is preferred for further splits with codes. Manually, things may soon become cumbersome. If you are interested in knowing manual calculation, please visit this link here and see for yourself.
Parametric And Non-parametric Models
1. Parametric Models:
- Assumption: Parametric models assume a specific functional form for the underlying data distribution. In other words, these models make assumptions about the shape or structure of the data.
- Parameters: These models are characterized by a fixed number of parameters, regardless of the size of the dataset. Common examples include linear regression, logistic regression, and Gaussian Naive Bayes.
- Advantages:
- They are computationally efficient, especially with large datasets.
- With sufficient data and correct assumptions, parametric models can provide precise parameter estimates.
- Disadvantages:
- If the assumed form of the distribution is incorrect, the model may perform poorly.
- Limited flexibility to capture complex patterns in the data.
2. Non-parametric Models:
- Assumption: Non-parametric models make fewer assumptions about the underlying data distribution. Instead of assuming a specific form, they aim to learn the patterns directly from the data.
- Parameters: The number of parameters in non-parametric models can grow with the size of the dataset. Examples include decision trees, k-nearest neighbours, and support vector machines with non-linear kernels.
- Advantages:
- They can capture complex relationships and patterns in the data without assuming a specific distribution.
- More flexible and can adapt to various types of data.
- Disadvantages:
- They might require more data to accurately estimate the underlying patterns.
- Can be computationally intensive, especially with large datasets.
Since decision trees are non-parametric which means parameters are not defined before training. This means such models can adapt to the training data if given complete freedom and it may lead to overfitting as well. So, to regularize our decision trees we will restrict some of their parameters in advance. There are various hyperparameters that we can choose in Sklearn. It will become clear in practical examples.
Some Data Pre-Processing Tips
Handling Categorical Data in the Context of Decision Trees:
Techniques for Dealing with Non-Numeric Features: Decision trees inherently handle categorical data, but most implementations require numerical input. Some techniques to handle categorical data include:
- Label Encoding: Assign unique numerical labels to categories. However, be cautious as decision trees may misinterpret the ordinal relationships introduced by this encoding.
- Ordinal Encoding: Similar to label encoding, but used when there is a meaningful order among categories.
- One-Hot Encoding:
- Context for Decision Trees: Particularly useful for decision trees, as it avoids introducing ordinal relationships. Each category becomes a binary feature (0 or 1).
- Implementation: Libraries like scikit-learn provide functions (e.g., OneHotEncoder) to convert categorical variables to a one-hot encoded format.
Pruning Techniques in the Context of Decision Trees:
Pre-Pruning vs. Post-Pruning:
- Pre-Pruning (Early Stopping): Stop growing the tree before it reaches its maximum depth. It can prevent overfitting by limiting the tree’s complexity.
- Post-Pruning (Pruning After Tree Construction): Allow the tree to grow to its full depth and then prune branches. Pruning decisions are based on criteria like the minimum number of samples in a leaf or the maximum depth.
Cost-Complexity Pruning:
- Context for Decision Trees: Cost-complexity pruning optimizes decision trees by controlling their complexity while considering misclassification costs.
- Implementation: Often implemented using the cost-complexity parameter (ccp_alpha) in scikit-learn’s DecisionTreeClassifier.
Handling Missing Data in the Context of Decision Trees:
Strategies for Managing Missing Values:
- Imputation: Fill missing values with a statistic (mean, median, or mode) of the available values.
- Category for Missing Values: Create a separate category for missing values if it makes sense in the context.
Impact on Decision Tree Construction:
- Decision Trees’ Robustness: Decision trees can handle missing data naturally. When splitting a node, the algorithm considers all available features, and if a feature has missing values, it can still decide the split based on other features.
- Decision Rules: The decision rules might be influenced by the presence or absence of missing values, but decision trees inherently handle this without requiring imputation.
Advantages And Disadvantages Of Decision Trees
Based on Sklearn Documentation
Some advantages of decision trees are:
- Simple to understand and to interpret. Trees can be visualized.
- Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Some tree and algorithm combinations support missing values.
- The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
- Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.
- Able to handle multi-output problems.
- Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by Boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
- Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
- Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
The disadvantages of decision trees include:
- Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
- Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations. Therefore, they are not good at extrapolation.
- decision trees are not good at extrapolation:
- Step-like Structure: Decision trees partition the input space into regions based on feature thresholds. Within each region, the predicted value is constant. When you move from one region to another (due to a split), the prediction can jump abruptly, creating a step-like structure.
- Local Modeling: Decision trees focus on capturing local patterns in the data. Each split is based on the best separation at that particular node, without considering the overall structure of the entire feature space. As a result, when the model encounters values outside the training data range, it may not generalize well.
- Sensitivity to Training Data: Decision trees are sensitive to variations in the training data. If the training data does not cover certain regions of the input space, the tree might not make accurate predictions in those regions.
- For tasks that involve extrapolation or capturing smooth, continuous relationships, other models like linear regression, polynomial regression, or spline-based models might be more suitable. These models are designed to capture global patterns and can provide smoother predictions even outside the training data range.
- The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return of the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
- Some concepts are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset before fitting it with the decision tree.
Tree algorithms: ID3, C4.5, C5.0 and CART
Sklearn uses an optimised version of the CART algorithm but there are so many other algorithms used in decision tree development. Here are some you should know.
ID3, C4.5, And C5.0
- ID3 (Iterative Dichotomiser 3):
- Purpose: ID3 is one of the earliest decision tree algorithms, developed by Ross Quinlan. It was designed for classification tasks.
- Split Criterion: ID3 uses Information Gain as the splitting criterion. It aims to maximize the mutual information between features and the target variable.
- Tree Construction: The algorithm recursively selects the feature that provides the highest Information Gain at each node. It continues until a stopping criterion is met, such as reaching a predefined tree depth or creating a leaf node with pure (homogeneous) target values.
- Caveats: ID3 tends to create deep trees, which can lead to overfitting.
- C4.5 (Successor to ID3):
- Purpose: C4.5, also developed by Ross Quinlan, is an improvement over ID3 and is used for classification tasks.
- Split Criterion: C4.5 uses Gain Ratio as the splitting criterion, addressing some limitations of Information Gain. Gain Ratio accounts for the intrinsic information of a feature.
- Tree Construction: Similar to ID3, C4.5 constructs a decision tree recursively. It selects the feature with the highest Gain Ratio at each node until a stopping criterion is met.
- Handling Continuous Attributes: C4.5 can handle continuous attributes by converting them into discrete intervals.
- Pruning: C4.5 incorporates a pruning mechanism to avoid overfitting by removing branches that do not significantly improve predictive accuracy on a validation dataset.
- C5.0 (Successor to C4.5):
- Purpose: C5.0, also developed by Ross Quinlan, is an enhanced version of C4.5. It supports both classification and regression tasks.
- Split Criterion: C5.0 uses a proprietary metric based on the Information Gain and Gain Ratio. It also includes a different approach for continuous attributes, called “Stairs.”
- Tree Construction: Similar to its predecessors, C5.0 constructs a decision tree recursively, selecting features based on its proprietary metric.
- Handling Missing Values: C5.0 can handle missing values in the dataset during tree construction.
- Rule Generation: C5.0 can generate rules from the decision tree, providing a human-readable representation of the model.
- Performance: C5.0 is known for its improved efficiency and can handle larger datasets compared to C4.5.
For more algorithms check out the book Data Mining with decision trees by Lior Rokach and Oded Maimon. Let’s talk about the CART algorithm because it is implemented in Sklearn and it is also used in regression tasks.
CART – Classification and Regression Trees
CART (Classification and Regression Trees) is a decision tree algorithm that can be used for both classification and regression tasks. It was developed by Leo Breiman in the 1980s and has become one of the most widely used decision tree algorithms. CART constructs binary trees, meaning each node has at most two children.
Key Features of CART:
- Purpose:
- CART can be used for both classification and regression tasks, making it versatile.
- For classification, the algorithm predicts the class labels of instances.
- For regression, it predicts a continuous target variable.
- Split Criterion:
- Classification: CART uses Gini impurity as the default splitting criterion for classification. The Gini impurity measures the probability of misclassifying a randomly chosen element.
- Regression: For regression tasks, CART uses mean squared error (MSE) as the splitting criterion. It aims to minimize the variance within each node.
- Tree Construction:
- The construction of the tree is a recursive process.
- At each node, CART selects the predictor variable and split point that minimizes the impurity (for classification) or minimizes the mean squared error (for regression).
- This process continues until a stopping criterion is met, such as reaching a maximum depth, having a minimum number of samples in a leaf, or no further improvement in impurity or MSE.
- Handling Categorical and Continuous Variables:
- CART handles both categorical and continuous predictor variables.
- For categorical variables, CART performs multi-way splits.
- For continuous variables, CART identifies optimal split points based on impurity or MSE reduction.
- As of writing this post, Scikit-learn uses an optimized version of the CART algorithm; however, the scikit-learn implementation does not support categorical variables for now.
- Pruning:
- CART often constructs a large tree and then prunes it to prevent overfitting.
- Pruning involves removing branches of the tree that do not significantly improve predictive accuracy on a validation dataset.
- Tree Representation:
- The resulting tree is a binary tree with nodes representing decisions based on predictor variables.
- Leaf nodes represent the final predictions.
CART for regression
1. Objective:
- Task: Predict a continuous target variable.
- Example: Predict the house price based on various features such as size, number of bedrooms, and location.
2. Split Criterion (Impurity Measure for Regression):
- Mean Squared Error (MSE):
- MSE is used to measure the impurity or error in each node.
- It quantifies the average squared difference between the actual target values and the mean target value in a node.
3. Tree Construction:
- Recursive Binary Splitting:
- The algorithm starts with the entire dataset as the root node.
- At each node, it selects the predictor variable and split point that minimizes the MSE.
- The data is partitioned into two subsets based on the chosen split.
- This process is applied recursively to each subset until a stopping criterion is met.
4. Prediction:
- Leaf Node Predictions:
- For a given input, it traverses the tree to reach a leaf node.
- The predicted value for the input is the mean of the target values in the leaf node.
5. Tree Pruning:
- Pruning:
- After constructing a large tree, pruning is applied to prevent overfitting.
- Pruning involves removing branches of the tree that do not significantly improve predictive accuracy on a validation dataset.
- The decision to prune is based on measures like cost-complexity pruning.
If you are looking for mathematical formulation then visit the Sklearn library here.
Advantages of CART:
- Simple and Intuitive:
- Decision trees are easy to understand and interpret. The visual representation of the tree structure makes it accessible to non-experts.
- Handling Numerical and Categorical Data:
- CART can handle both numerical and categorical data, making it versatile for a variety of datasets.
- Handling Missing Values:
- The algorithm can handle missing values by using surrogate splits. Surrogate splits are alternative decision rules that are used when the primary split is not applicable due to missing data.
- Multi-Class Classification:
- CART can be extended to handle multi-class classification problems through a modification called the multi-class CART. It constructs a separate tree for each class, and the class with the highest probability in the leaf node is assigned.
Disadvantages of CART:
- Overfitting:
- Decision trees, including CART, tend to overfit the data, especially if the tree is allowed to grow too deep. Overfitting occurs when the model captures noise or specific patterns in the training data that do not generalize well to new, unseen data.
- Greedy Algorithm:
- CART is a greedy algorithm, meaning it makes locally optimal decisions at each node without considering the global optimal structure of the tree. This may lead to suboptimal trees.
- Bias Towards High Cardinality Predictors:
- CART may be biased towards predictors with many categories or high cardinality. It may favour splits on variables with more categories, potentially leading to trees that are sensitive to noise.
- Sensitivity to Small Changes or Noise:
- Decision trees, in general, can be sensitive to small changes or noise in the data. A small perturbation in the training data may result in a significantly different tree structure.
Application Of Decision Trees Using Sklearn
In this section, you will learn about the application of decision trees in real-world problems. I have explained everything in the file below. Please download it and start coding. Ask your queries in the forum. I will help you as soon as possible.
Things you will learn:
1. Classification On Iris Dataset – Sklearn DecisionTreeClassifier On Default Mode
2. Hyperparameters in Sklearn DecisionTreeClassifier
3. Decision Trees Classification On Breast Cancer Dataset
4. Regularization Techniques Such As Minimal Cost-Complexity Pruning
5. Decision Tree For Digits Classification
6. Decision Tree For Regression Datasets – California Housing Price
Questions:
What Is Homogeneity In The Context Of Decision Trees?
Homogeneity of class in the context of decision trees refers to the extent to which the instances within a node share the same target class or label. It is a measure of the purity or uniformity of the class distribution within a node. The primary goal when constructing a decision tree is to create nodes that are as homogeneous as possible, meaning that the majority of instances in a node belong to the same class.
- Node Homogeneity:
- A node in a decision tree represents a subset of the training data. The homogeneity of a node is a measure of how pure or impure the node is in terms of class labels.
- A node is considered homogeneous if all the instances within that node belong to the same class. In other words, there is no mixing of different class labels.
- Impurity Measures:
- Decision tree algorithms use impurity measures to quantify the homogeneity of a node. Common impurity measures include Gini impurity and entropy.
- Gini impurity measures the probability of incorrectly classifying an instance in a node. Lower Gini impurity values indicate higher homogeneity.
- Entropy measures the average amount of information needed to identify the class of an instance in a node. Lower entropy values indicate higher homogeneity.
- Splitting Criteria:
- When growing a decision tree, the algorithm selects the best feature and threshold to split the data in a way that maximizes homogeneity in the resulting child nodes.
- The split is chosen to reduce impurity, leading to subsets (child nodes) that are more pure or homogeneous than the parent node.
- Decision Tree Growth:
- The recursive nature of decision tree growth involves continually splitting nodes until a stopping criterion is met. This criterion may include reaching a maximum depth, having nodes with a minimum number of instances, or achieving a certain level of homogeneity.
- Interpretability and Predictive Power:
- Homogeneous nodes contribute to the interpretability of the decision tree because the decision rules are clear and concise.
- High homogeneity within nodes generally leads to a more accurate predictive model, as decisions based on pure or nearly pure nodes are more reliable.
Explain Cost Functions Associated With Decision Trees Tasks.
- Minimizing Impurity (Gini Impurity):
- Gini Impurity: Gini impurity is a measure of how often a randomly chosen element would be incorrectly classified. It ranges from 0 to 1, where 0 represents perfect homogeneity (all elements belong to the same class), and 1 represents maximum impurity (elements are evenly distributed across all classes).
- Objective: In decision tree algorithms using Gini impurity, the goal is to minimize the Gini impurity at each node.
- Splitting Criteria: When selecting a feature and a threshold to split the data, the algorithm looks for the split that minimizes the weighted sum of Gini impurities in the resulting child nodes.
- Tree Growth: The algorithm recursively applies this process to create a tree with nodes that have minimal impurity.
- Maximizing Homogeneity (Minimizing Entropy):
- Entropy: Entropy measures the average amount of information needed to identify the class of an element in a node. It ranges from 0 to positive infinity, where 0 represents perfect homogeneity, and higher values represent higher impurity.
- Objective: In decision tree algorithms using entropy, the goal is to maximize homogeneity or minimize entropy at each node.
- Splitting Criteria: Similar to Gini impurity, the algorithm selects the split that minimizes the weighted sum of entropies in the resulting child nodes.
- Tree Growth: The tree is grown recursively by choosing splits that lead to nodes with minimal entropy.
- Minimizing Mean Squared Error (MSE):
- Mean Squared Error (MSE): For regression tasks, MSE is a common measure of the average squared difference between predicted values and actual values.
- Objective: In decision tree regression, the goal is to minimize the MSE at each node.
- Splitting Criteria: When selecting a feature and a threshold to split the data, the algorithm looks for the split that minimizes the weighted sum of MSE in the resulting child nodes.
- Tree Growth: The algorithm recursively applies this process to create a tree with nodes that have minimal MSE.
Explain The Alpha Parameter (ccp_alpha)
The parameter alpha (referred to as ccp_alpha
) is used to control the trade-off between the goodness of fit and the complexity of the tree. Cost-complexity pruning is a technique that penalizes complex trees to avoid overfitting.
Cost-Complexity Pruning for Classification: Decision trees are prone to overfitting, especially when they are deep and capture noise in the training data. Cost-complexity pruning is a technique used to address overfitting by penalizing complex trees. It is explained in the notebook.
- The parameter alpha is a non-negative regularization parameter that controls the trade-off between tree complexity and goodness of fit.
- As alpha increases, the penalty for the number of terminal nodes in the tree becomes more significant, favouring simpler trees.
- The pruning process involves iteratively removing branches of the tree based on the cost-complexity criterion. As alpha increases, more branches are pruned, creating a simpler tree.
In the next tutorial, we will learn some more advanced concepts related to decision trees such as random forest and ensembling methods. Feel free to ask your questions in the forum.
Resources
- https://www.cl.cam.ac.uk/teaching/0809/InfoTheory/InfoTheoryLectures.pdf
- Data Mining with Decision Trees Theory and Applications by Lior Rokach and Oded Maimon
- Hands-on Machine Learning Sklearn, Keras, And Tensforflow.
- Python Data Science Handbook.
Footnotes:
- Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning By Antonio Criminisi, Jamie Shotton, and Ender Konukoglu ↩︎