In this post, we will learn about our next machine learning algorithm called support vector machine or SVM or support vector networks. This is a crucial concept and a powerful algorithm that has an advantage over neural networks when it comes to finding the optimum solution. We use SVM mainly for classification and regression tasks and sometimes even for clustering tasks. It is a supervised learning algorithm.
But what makes SVM truly special? How does it work its magic in discerning patterns within data? In this tutorial, we embark on a journey into the world of SVM, unraveling its concepts, breaking down its mathematical underpinnings, and exploring its applications across various domains. This post is everything you need to learn SVM from scratch to applications. I will explain the fundamentals and show you how you can apply them to real-world problems. So, let’s get started.
Table of Contents
Prerequisites:
- Python, Numpy, Sklearn, Pandas and Matplotlib.
- Linear Algebra For Machine Learning.
- Statistics And Probability Theory.
- Advanced Calculus For Machine Learning – Lagrangian, Gradients etc.
You can purchase the notes here for LA and S&P:
What You Will Learn:
- Concept of Support Vector Machine
- Convex Sets and Convex Functions
- Concept of Duality
- Linear SVM
- Hard Margin Classifier
- Soft Margin Classifier
- Kernels
- Kernel Tricks
- Non-Linear SVM
- SVM for regression
- Implementation of SVM and more
Concepts And Definitions Of Support Vector Machine:
Imagine we have a dataset containing two classes and we wish to develop an algorithm that classifies them successfully and generalizes well on unseen data. That’s where the SVM comes in. No doubt other algorithms out there can perform the same task but what sets SVM apart is its ability to offer unique solutions to the problem as it uses the concept of margin. What it does is offer us an optimum hyperplane that maximizes the margin between these two classes.
We will come back to these concepts later but first, let’s consider the possible representations of our dataset. In one case, we may have a dataset where classes are easily separable by a hyperplane which means data points from each class will be on either side of the hyperplane. In the next case, we will have a few data points that are not on the right side which means some of them will overlap. In the final case, our datasets will not be linearly separable at all. So, in a nutshell, we have to figure out the solution of these three cases so that we can develop our algorithm. Here we have three scenarios:
We are going to solve these three cases one by one. Let’s look at the definitions of SVM:
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. In other terms, Support Vector Machine (SVM) is a classification and regression prediction tool that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. Support Vector machines can be defined as systems which use hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory.
Vikramaditya Jakkula1
The support vector machine (SVM) is a supervised learning method that generates input-output mapping functions from a set of labelled training data. The mapping function can be either a classification function, i.e., the category of the input data, or a regression function. For classification, nonlinear kernel functions are often used to transform input data to a high-dimensional feature space in which the input data becomes more separable compared to the original input space. Maximum-margin hyperplanes are then created. The model thus produced depends on only a subset of the training data near the class boundaries. Similarly, the model produced by Support Vector Regression ignores any training data that is sufficiently close to the model prediction. SVMs are also said to belong to “kernel methods”.
– Lipo Wang, 2005
Support Vector Machines are a relatively new learning method used for binary classification. The basic idea is to find a hyperplane which separates the d-dimensional data perfectly into its two classes. However, since example data is often not linearly separable, SVMs introduce the notion of a “kernel-induced feature space” which casts the data into a higher dimensional space where the data is separable. Typically, casting into such a space would cause problems computationally, and with overfitting. The key insight used in SVMs is that the higher-dimensional space doesn’t need to be dealt with directly (as it turns out, only the formula for the dot-product in that space is needed), which eliminates the above concerns. Furthermore, the VC-dimension (a measure of a system’s likelihood to perform well on unseen data) of SVMs can be explicitly calculated, unlike other learning methods like neural networks, for which there is no measure. Overall, SVMs are intuitive, theoretically well-founded, and have shown to be practically successful. SVMs have also been extended to solve regression tasks (where the system is trained to output a numerical value, rather than a “yes/no” classification).
Dustin Boswell2
Support Vector Machines, are supervised learning machines based on statistical learning theory that can be used for pattern recognition and regression. Statistical learning theory can identify rather precisely the factors that need to be taken into account to learn successfully certain simple types of algorithms, however, real-world applications usually need more complex models and algorithms (such as neural networks), which makes them much harder to analyse theoretically. SVMs can be seen as lying at the intersection of learning theory and practice. They construct models that are complex enough (containing a large class of neural networks for instance) and yet that are simple enough to be analysed mathematically. This is because an SVM can be seen as a linear algorithm in a high-dimensional space.
Steven Busuttil3
As we can see from the above definitions SVM is all about finding an optimum hyperplane (decision boundary) that maximizes the margin in such a way that this hyperplane stays as far as possible from the closest members of both classes. Now what is margin? Imagine one hyperplane that is close to class 1 and another close to class 2. The margin is nothing but the distance between these two hyperplanes and in SVM we try to maximize this distance.
More technically, the margin is the distance between the positive and negative hyperplanes. It is calculated as the perpendicular distance from any point on the decision boundary to the nearest data point ( called support vector). The larger the margin, the more confident the SVM is in its classification. We can calculate the margin between these two hyperplanes by finding their distances from the origin and then subtracting them. Here is how you can visualize them:
This is how we formulate the problem: We take an optimum hyperplane then we take one negative hyperplane from one class and a positive hyperplane from another class (see fig above). These negative and positive hyperplanes are attached to the nearest data points (called support vectors) and their perpendicular distance from the optimum hyperplane is 1/∥w∥. Since we have two such hyperplanes we multiply by 2 which gives us the margin 2 × 1/∥w∥ = 2/∥w∥.
Maximizing this margin is equivalent to minimizing ∥w∥2/2. We do this for mathematical simplicity as it makes our problem quadratic optimization programming. This means by doing so we will be able to use various concepts such as lagrangian, KKT conditions and the concept of duality.
Now the constraint we choose in a manner so that data points from each class should lie on the correct side which is our goal here. In binary classification, there are two class labels, typically denoted as +1 and -1. If class = +1, then wTx + b should be greater than or equal to 1 for correct classification. If class = -1, then wTx + b should be less than or equal to -1 for correct classification.
We then find the lagrangian of the QP problem, then we calculate the KKT conditions and then we formulate the dual. We can then find our optimum hyperplane using a quadratic programming solver. These concepts are explained in the notes ahead please read them carefully and ask your questions in the forum if any.
Understanding Convex Sets, Convex Functions, Duality And More Concepts:
In SVM we formulate our problems to make it a quadratic convex optimization problem and you must know the concepts of convex sets and convex functions because convexity is preferred for various reasons:
- Convex optimization problems have desirable mathematical properties, making them more amenable to efficient algorithms and optimization techniques. Many convex optimization problems can be solved globally and efficiently.
- Convex optimization problems often have unique solutions.
- Unlike non-convex optimization problems, convex optimization problems do not suffer from the presence of multiple local minima. This characteristic ensures that optimization algorithms converge to a global minimum. This is why SVM is different from neural nets because neural nets suffer from local minima.
- Convex functions have the property that any local minimum is also a global minimum. This makes the optimization process more reliable and provides a guarantee that the solution obtained is the best possible.
- Convex optimization problems often have associated dual problems that provide additional insights and optimization opportunities.
- Convexity provides a level of robustness to optimization problems.
Here is the theory, please read it before moving ahead:
Concept Of Duality:
Convex Hull And Error Minimizing LPP:
We also need to understand these concepts in the context of the binary classification problem.
Now that we have these concepts. Let’s derive the SVM for binary class classification.
Support Vector Machines For Binary Class Classification (SVC)
In this section, we will derive SVM for binary class classification for three different scenarios as discussed in the beginning.
Linear Support Vector Machine
Scenario 1: When Data Is Linearly Separable (Hard Margin Classifier)
When both the classes are fully linearly separable by a hyperplane we call it a hard margin classifier because the hyperplane maintains a strict margin between the classes. The term “hard margin” indicates that there is no tolerance for misclassification, and the goal is to find a hyperplane that allows no data points to fall within the margin or on the wrong side of the decision boundary. Here is the derivation of the Hard Margin Classifier:
You don’t need to memorize the derivation rather try to understand and see how it is done. The dual problem is solved using various algorithms that come pre-implemented in the SKlearn library.
When to use:
- Hard margin classifiers are suitable when the data is linearly separable.
- If the application demands zero tolerance for misclassification and the assumption of perfect separability holds, a hard margin classifier might be appropriate.
- Hard margin classifiers are sensitive to noise and outliers. Therefore, they are most effective when the dataset is clean, and outliers are either minimal or have been appropriately handled.
- Hard margin classifiers are designed for binary classification problems where there are two classes to be distinguished.
- In situations where having a clear and robust separation between classes is critical, a hard-margin classifier might be preferred.
Scenario 2: When Data Is Not Fully Linearly Separable (Soft Margin Classifier)
In the second scenario, we had some data points that were not entirely separated from the opposite class. In this case, we introduce a positive slack variable ξi (“ksi”) to relax the initial constraints a little bit so that it can allow for a certain degree of misclassification or data points to fall within the margin.
What this means is our constraint now becomes: yi (wTxi+b) ≥ 1−ξi. We will now allow a certain degree of misclassification depending on the value of our slack variable. In summary:
- If ξi =0, the data point is correctly classified, and it lies on the correct side of the decision boundary.
- If 0 < ξi ≤1, the data point is within the margin or on the wrong side, but the violation is within the allowed tolerance.
- If ξi >1, the data point is misclassified or violates the margin constraint by more than the allowed tolerance.
Additionally, our optimization problem also changes a little bit (See notes below). We add our slack variable term or penalty term which contains a user-defined regularization parameter C in it. The role of C is very important because it determines the importance of our penalty term. Please note the regularization term is the first part of the objective function and C is the regularization parameter.
Small C :
- A small value of C places more emphasis on achieving a larger margin, even if it results in a higher number of misclassifications or violations of the margin constraint.
- The optimization problem is more tolerant of misclassifications, and the model may have a wider margin.
- Tends to produce a simpler model with a larger margin but potentially more misclassifications.
Large C :
- A large value of C imposes a stricter requirement on minimizing misclassifications, even if it leads to a smaller margin.
- The optimization problem is less tolerant of misclassifications, and the model may have a narrower margin.
- Tends to produce a more complex model that may have a smaller margin but fewer misclassifications.
So, how to choose C?
- The optimal value of C is often determined through techniques like grid search or cross-validation.
- Grid search involves trying out a range of values for C and selecting the one that results in the best performance on a validation set.
- The choice of C depends on the specific characteristics of the dataset and the goals of the machine learning task.
- It’s common to start with a broad range of values and narrow them down based on experimentation.
- In the presence of imbalanced classes, the choice of C may need to be adjusted to account for the class imbalance.
Here is the derivation of the soft margin classifier:
When To Use:
- When the dataset contains noise or outliers that may affect the ability to find a perfectly separable hyperplane, a soft margin classifier is more suitable. It allows for a certain degree of misclassification to handle such variations.
- In cases where the classes are not perfectly separable, a soft margin allows for a compromise by allowing some data points to fall within the margin.
- In real-world scenarios, achieving perfect separation may be impractical or unnecessary. A soft margin classifier allows for a trade-off between a larger margin and some degree of misclassification.
- When it’s important to balance the margin size with the level of acceptable misclassification, a soft margin SVM provides a controlled way to adjust this trade-off through the regularization parameter C.
- In cases where there is a significant imbalance between the number of samples in different classes, a soft margin classifier can be more robust and avoid overfitting to the majority class.
- A soft margin classifier can be less prone to overfitting, especially in situations where the training data may have inherent variability.
Non-Linear Support Vector Machines
Scenario 3: When The Data Is Not Linear
Now what about the third case? Non-linear support vector machines are used when the relationship between the input features and the target variable is not linear and cannot be adequately modeled by a linear decision boundary. In such cases, we transform the data into a higher dimensional feature space where it becomes linearly separable.
Now, to transform the dataset we will have to first perform the transformation. Once, we have performed the transformation we will be able to use this transformation in our dual equation (Check the dual equation in our notes, you will notice the term xiTxj we transform them using Φ(xi) and Φ(xj) and then we take the dot product of Φ(xi)TΦ(xj) and replace it in the equation of dual thus fixing the equation for non-linear cases).
Now, the problem here this whole transforming data to a new dimension and then performing calculations can soon become very computationally inefficient. So, we use the kernel trick to perform the calculation that allows us to ignore this whole transformation step. Here is the concept of kernels and kernel tricks that you should know.
Kernels And Kernel Trick:
Kernel trick is used in various ML algorithms so make sure to read the notes carefully and ask your doubts if any.
When To Use:
- When the decision boundary in the data is nonlinear or exhibits intricate patterns, a linear SVM may not capture the complexities effectively. Nonlinear SVMs, often implemented using kernel methods, can model more intricate decision boundaries.
- If transforming the input features into a higher-dimensional space might reveal a more linearly separable structure, nonlinear SVMs with appropriate kernels can implicitly perform this transformation.
- In real-world datasets, relationships between features and the target variable are often nonlinear. Nonlinear SVMs provide a more flexible approach to capturing these nonlinearities.
- In computer vision tasks such as image recognition, the relationships between pixel values can be highly nonlinear. Nonlinear SVMs, especially with RBF kernels, are commonly used in image classification problems.
- In bioinformatics and genetics, datasets often exhibit complex relationships that are not linear. Nonlinear SVMs can be useful for tasks such as gene expression analysis or protein structure prediction.
- In speech recognition, the relationships between acoustic features and phonemes may be nonlinear. Nonlinear SVMs can be applied to model such intricate relationships.
Please note: For complex tasks such as image classification and speech recognition, it’s better to use neural networks instead of SVM. Additionally, Support Vector Machines are inherently binary classifiers, meaning they are designed for binary classification problems. However, they can be extended to multiclass classification as well but may become very inefficient on complex datasets.
A Different Approach To Optimization:
Until now we have seen the QPP approach to finding the solution which was contrained over w and ξi. We can also reframe the problem by replacing the term ξi with ξi = max (0, 1 − yi(wTxi + b)). We can derive this by solving the constraints in QPP. Now our problem is unconstrained optimization over w, this will still be a convex optimization problem and you can use the gradient (or steepest) descent algorithm here. The term max (0, 1 − yi(wTxi + b)) is nothing but Hinge Loss, and the square of this is called squared hinge loss. This formulation directly uses the hinge loss to penalize misclassifications. This means SVM uses hinge loss. Just note that there are many ways to solve by framing the objective function according to your mathematical convenience.
Support Vector Machine For Regression (SVR)
Support vector machines can also be used for regression just by changing the problem statement. In this case, we find the best-fitted hyperplane using ε-insensitive loss function. Here are its concepts and derivation:
If you want to learn more about SVM for regression then I recommend reading this paper here.
Support Vector Machine For Outlier Detection
SVM can also be used for outlier detection. However, you should note that it is a supervised binary classifier mainly. Several approaches have been proposed but the most popular one is one-class svms. The idea is to consider one class (the normal class) and detect instances that do not conform to it. We are not going to cover this but here is what Mennatallah Amer, Markus Goldstein, and Slim Abdennadher say about it in their paper Enhancing One-class Support Vector Machines for Unsupervised Anomaly Detection:
In contrast to traditional SVMs, one-class SVMs attempt to learn a decision boundary that achieves the maximum separation between the points and the origin. Interestingly this was the initial idea from which traditional supervised SVMs emerged. Its origin date back to the earliest work of Vapnik et al. in 1963. The idea was hindered by the inability to learn non-linear decision boundaries as well as the inability to account for outliers. Both of these problems were solved by the introduction of kernels and the incorporation of soft margins. A one-class SVM uses an implicit transformation function φ(·) defined by the kernel to project the data into a higher dimensional space. The algorithm then learns the decision boundary (a hyperplane) that separates the majority of the data from the origin. Only a small fraction of data points are allowed to lie on the other side of the decision boundary: Those data points are considered as outliers.
Mennatallah Amer, Markus Goldstein, and Slim Abdennadher say about it in their paper Enhancing One-class Support Vector Machines for Unsupervised Anomaly Detection
If you want to learn more about it please find their paper and read it or you can check the resources below for its derivation. The process is almost the same. The derivation involves introducing Lagrange multipliers and solving the dual optimization problem, leading to the final set of equations.
SVM Tips – When To Use?
After learning all these complex mathematical concepts, a natural question arises when to use SVM and on what kind of datasets? So, the basic guidelines that you can follow:
When to Use SVM:
- Binary Classification with Clear Margin:
- SVM performs exceptionally well in binary classification tasks where there’s a clear margin between the classes. It excels when there’s a distinct separation between different groups.
- High-Dimensional Spaces:
- SVM is effective in high-dimensional spaces, making it suitable for tasks like text classification or image recognition where each feature may represent a specific aspect or pixel.
- Small to Medium-sized Datasets:
- SVM can work well with small to medium-sized datasets. For larger datasets, training time may become a consideration, and other algorithms like Random Forest or Neural Networks might be more suitable.
- Complex Decision Boundaries:
- When the decision boundary is complex and non-linear, SVM with kernel functions (e.g., Radial Basis Function kernel) can capture intricate patterns in the data.
- Handling Outliers:
- SVM is robust against outliers. If your dataset contains noisy data points or outliers, SVM can provide a stable solution.
- When Interpretability is Essential:
- SVM models, particularly linear ones, can be more interpretable compared to complex models like Neural Networks. If interpretability is a priority, SVM may be a good choice.
And On What Kind Of Datasets?
You need to pre-process the datasets before you apply SVM. Here are some prerequisites for the datasets:
- Feature Scaling:
- SVM is sensitive to the scale of features. Ensure that features are scaled properly, so one feature doesn’t dominate others.
- Handling Missing Data:
- SVMs don’t handle missing data well. Impute or preprocess the dataset to address any missing values before applying SVM.
- Handling Categorical Data:
- SVM typically works with numerical data. If your dataset contains categorical features, they may need to be encoded or transformed into a format suitable for SVM.
- Sufficient Data Separation:
- SVM relies on the existence of a clear margin between classes. Ensure that classes in your dataset are well-separated for SVM to perform effectively.
- Parameter Tuning:
- SVM has parameters like the choice of the kernel, regularization parameter (C), and kernel-specific parameters. Proper parameter tuning is essential for optimal performance.
- Data Balance:
- If your dataset is highly imbalanced, where one class significantly outnumbers the other, SVM might need adjustments to handle this imbalance properly.
- Computational Resources:
- SVM can be computationally intensive, especially with large datasets. Ensure you have sufficient computational resources for training, or consider using approximations or optimizations.
SVM Vs Neural Networks
We have not covered the neural networks but you can revisit the post in future to understand the difference between SVM and neural networks:
Aspects | SVM | Neural Networks |
---|---|---|
Use Cases | Effective for smaller datasets with clear class separation. Commonly used in image classification, text categorization, and bioinformatics. | Excel in tasks with large datasets, complex relationships, and unstructured data. Widely used in image recognition, natural language processing, speech recognition, and more. |
Nature of the Algorithm | Primarily designed for classification and regression tasks. Focuses on finding the optimal hyperplane to separate classes in feature space. | Versatile and can be used for classification, regression, and complex pattern recognition tasks. Consists of interconnected layers of nodes, allowing for hierarchical feature learning. |
Handling Non-Linearity | Initially designed for linear separation but can handle non-linear relationships through kernel tricks. Effective for moderate non-linearity. | Inherently capable of modelling complex non-linear relationships. Excel in tasks with high non-linearity and intricate patterns. |
Data Size | Suitable for small to medium-sized datasets. Can become computationally expensive with large datasets. | Scale well with large datasets. Require substantial data for training deep models effectively. |
Interpretability | Generally more interpretable, especially with linear kernels. Decision boundaries are easier to understand. | Complex architectures may lack interpretability. Understanding how individual features contribute can be challenging. |
Robustness to Outliers | Robust to outliers due to the nature of the hinge loss. Outliers have a limited impact on the decision boundary. | Sensitive to outliers, especially in deep networks. Outliers may affect the learning process. |
Training Speed | Faster training for smaller datasets. Can become slower for larger datasets or complex kernels. | Slower training, especially for deep architectures. Benefit from parallelization and GPU acceleration. |
Hyperparameter Sensitivity | Fewer hyperparameters to tune, but kernel choice and parameter tuning are critical. | Many hyperparameters, including the number of layers, nodes, learning rate, etc. Sensitive to architecture and hyperparameter choices. |
Limitations
Though a powerful tool it also comes with some limitations that you should know:
- SVMs can be sensitive to noisy data and outliers, potentially affecting the positioning of the decision boundary.
- Training SVMs on large datasets can be computationally expensive and time-consuming, especially when using complex kernels.
- The performance of SVMs can heavily depend on the choice of the kernel function. Selecting the appropriate kernel requires domain knowledge and experimentation.
- SVMs were originally designed for binary classification, and extending them to multiclass classification involves using techniques like one-vs-one or one-vs-all, which can be less efficient.
- SVMs have parameters like the choice of kernel and regularization parameter (C). Proper tuning of these parameters is crucial for optimal performance but can be challenging and requires expertise.
- SVMs are sensitive to the scale of features, and data preprocessing, including feature scaling, is often necessary for optimal performance.
- SVMs may become impractical for very large datasets due to their computational demands.
- SVMs may not perform well on highly imbalanced datasets, where one class significantly outnumbers the other.
- SVMs generate binary outcomes, making them less suitable for problems requiring probabilistic predictions.
SVM In Practice – Examples And Implementation
Topics covered.
1. Linear SVM For Binary Classification
2. Non-Linear SVM
3. Face Recognition (Multiclass Classification) Example – SVM
4. IRIS Datasets SVM Classification End-to-end Example
5. Support Vector Regression
6. SVM for outlier Detection
Important Questions
Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?
This question applies only to linear SVMs since kernelized SVMs can only use the dual form. The computational complexity of the primal form of the SVM problem is proportional to the number of training instances m, while the computational complexity of the dual form is proportional to a number between m square and m cube. So if there are millions of instances, you should definitely use the primal form, because the dual form will be much too slow.
Can an SVM classifier output a confidence score when it classifies an instance? What about probability?
An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score. However, this score cannot be directly converted into an estimation of the class probability. If you set probability=True when creating an SVM in Scikit-Learn, then after training it will calibrate the probabilities using Logistic Regression on the SVM’s scores (trained by an additional five-fold cross-validation on the training data). This will add the predict_proba() and predict_log_proba() methods to the SVM.
Say you’ve trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease γ (gamma)? What about C?
If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. You need to increase gamma or C (or both) to decrease it.
– The above Questions Are From Hands-On Machine Learning With Sklearn, Keras, Tensorflow, Book
What is Platt Scaling In SVM?
- In binary classification, a classifier often provides a decision function or decision score rather than a probability. This decision function outputs a score for each instance, and the sign of the score determines the predicted class (positive or negative).
- Platt scaling applies a logistic (sigmoid) transformation to the decision scores to obtain calibrated probabilities.
- The logistic function is fitted to the decision scores, and the parameters are adjusted to maximize the likelihood of the true class labels.
- Once the parameters are estimated, the logistic function can be used to convert decision scores into calibrated probabilities. The calibrated probability represents the model’s confidence or belief that an instance belongs to the positive class.
Discuss the trade-off between allowing some misclassifications and achieving a wider margin.
Talk about C and margin here.
Discuss considerations when selecting linear, polynomial, or radial basis function (RBF) kernels.
Considerations when Selecting Kernels:
a. Linear Kernel:
- Use Case: Suitable for linearly separable data or when the decision boundary is expected to be close to linear.
- Advantages: Computationally efficient and interpretable.
- Considerations: This may not perform well on non-linearly separable data.
b. Polynomial Kernel:
- Use Case: Useful when the decision boundary is expected to be polynomial.
- Advantages: Can capture non-linear patterns.
- Considerations: Sensitive to the choice of hyperparameters (degree, coef0). Higher degrees may lead to overfitting.
c. Radial Basis Function (RBF) Kernel:
- Use Case: Effective for capturing complex, non-linear decision boundaries.
- Advantages: Versatile and powerful for a wide range of data distributions.
- Considerations: Sensitive to the choice of the hyperparameter γ. Higher γ values may lead to overfitting.
d. Sigmoid Kernel:
- Use Case: Suitable for data that exhibits a sigmoidal shape.
- Advantages: Can be used for non-linear problems.
- Considerations: Sensitive to the choice of hyperparameters (gamma, coef0).
e. Custom Kernels:
- Use Case: When the data exhibits a specific non-linear pattern not captured by standard kernels.
- Advantages: Flexibility to define a kernel tailored to the problem.
- Considerations: May require a deep understanding of the problem and expertise in kernel engineering.
Considerations:
- Data Exploration:
- Understand the underlying distribution and structure of the data.
- Computational Efficiency:
- Linear kernels are often computationally more efficient compared to non-linear kernels.
- Model Interpretability:
- Linear kernels provide a clear interpretation of the decision boundary.
- Hyperparameter Tuning:
- Hyperparameters like C, γ, degree, and coef0 need to be tuned for optimal performance.
- Cross-Validation:
- Use cross-validation to assess the performance of different kernels on the specific dataset.
- Overfitting:
- Non-linear kernels may lead to overfitting, especially with a small amount of data.
Discuss how the decision function output is used to make predictions.
The decision function in SVM provides a signed distance of a data point from the decision boundary (hyperplane). For a binary classification problem, the sign of the decision function output determines the predicted class:
- Decision Function Output f(x)):
- f(x) > 0: The predicted class is positive (class +1).
- f(x) < 0: The predicted class is negative (class -1).
- Thresholding for Binary Classification:
- A threshold (often 0) is applied to the decision function output to determine the predicted class.
- If f(x) ≥ 0, predict class +1; otherwise, predict class -1.
- Probabilistic Interpretation (SVC with Probability=True):
- In addition to the decision function, SVM can provide class probabilities using the
predict_proba
method. - Decision function output is transformed into probabilities using a logistic function.
- In addition to the decision function, SVM can provide class probabilities using the
- Multi-Class Decision Function:
- For multi-class problems, decision functions are computed for each pair of classes in one-vs-one or one-vs-the-rest strategies.
- Voting or aggregation of decision functions determines the final predicted class.
- Support Vectors Influence Decision Function:
- Support vectors are crucial in determining the decision function and, consequently, the predicted class.
- Decision Boundary:
- The decision boundary is where the decision function equals the threshold. It separates instances of different classes.
Explain one-vs-one and one-vs-the-rest strategies.
- OvO:
- Constructs many binary classifiers, one for each pair of classes.
- Suitable for moderate numbers of classes.
- OvR:
- Constructs one binary classifier for each class versus the rest.
- More computationally efficient for a large number of classes.
In practice, try to use SVM according to the need of the problem and requirements. If you have any questions, please ask in the forums.
- Tutorial on Support Vector Machine, Vikramaditya Jakkula, Washington State University ↩︎
- Introduction to Support Vector Machines, Dustin Boswell, August 6, 2002 ↩︎
- C19 Machine Learning lectures Hilary 2015, Andrew Zisserman ↩︎
- Same as 2 ↩︎
Further Reading & Resources:
- SVM [ https://www.csie.ntu.edu.tw/~cjlin/talks/itri.pdf ]
- Tutorial on Support Vector Machine (SVM) by Vikramaditya Jakkula
- Support Vector Machines Explained, Tristan Fletcher
- Enhancing One-class Support Vector Machines for Unsupervised Anomaly Detection
- CS229 Lecture notes, Andrew Ng, Support Vector Machines