In this tutorial, we will learn our next machine-learning model called Naive Bayes. Naive Bayes is widely recognized as a sophisticated and effective tool known for its simplicity. Rooted in Bayesian probability theory, it has proven invaluable in diverse applications such as text classification, spam filtering, and medical diagnosis. Its strength lies in its capacity to make precise predictions with minimal computational complexity, making it a foundational concept in both introductory machine-learning courses and real-world applications.
Table of Contents
Prerequisites
- Linear Algebra
- Calculus
- Probability & Statistics
What You Will Learn:
- Concepts And Derivation
- Application
If you are familiar with Bayes theorem you can easily grasp the idea of Naive Bayes. It calculates the probability of class given the probability distribution of features but it makes very strong assumptions about the data that features are independent of each other that is why it is called “Naive”.
More formally, Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. The different naive Bayes classifiers differ mainly by the assumptions they make regarding the probability distribution of features. It’s very easy to understand and fast. Before we go further, you should read this definition.
Naïve Bayes classifiers are a family of machine learning classification methods that use Bayes’ theorem to probabilistically categorize data. They are called naïve because they assume independence between the features. The main idea is to use Bayes’ theorem to determine the probability that a certain data point belongs in a certain class, given the features of that data. Despite what the name may suggest, the naïve Bayes classifier is not a Bayesian method.
This is because naïve Bayes is based on likelihood rather than Bayesian inference. While naïve Bayes classifiers are most easily seen as applicable in cases where the features have, ostensibly, well-defined probability distributions (such as classifying sex given physical characteristics), they are applicable in many other cases. While it is generally a bad idea to assume independence naïve Bayes classifiers are still very effective, even when we can be confident there is nonzero covariance between features.
https://acme.byu.edu/
Because of the naive assumption of treating features independently, these methods are extremely fast to train. Essentially, it treats each feature distribution independently, simplifying the estimation process by considering each feature in isolation. This decoupling has a particularly beneficial impact when dealing with what’s known as the “curse of dimensionality.” In high-dimensional spaces, where the number of features is large, traditional methods may struggle due to the increased complexity and data sparsity.
Naive Bayes, by independently estimating one-dimensional distributions for each feature, navigates through these challenges more gracefully. In simpler terms, Naive Bayes takes a smart shortcut. Instead of dealing with the entire set of features all at once, it breaks down the problem into smaller, more manageable pieces. This not only speeds up the learning and classification processes but also helps overcome potential issues associated with having lots of features. It’s a bit like tackling a big task by breaking it down into smaller, easier-to-handle parts.
However, you should note that Naive Bayes are not suitable where these assumptions are not held which rarely happens in real-world data. You can use naive Bayes as a baseline classifier to compare its result with more complex models but this does not mean Naive Bayes can’t be used in practice. It has shown to work quite well in many real-world situations, famously document classification and spam filtering.
Naive Bayes belongs to the category of generative classifiers because it learns the joint probability distribution P(x,y) of inputs x and label y, then uses the Bayes rule to calculate the conditional probability P(y ∣ x) and predicts the most likely label. Generative classifiers Model the features of a class and understand how a particular class would generate input data.
On the other hand, discriminative classifiers learn which features in the input are most useful for distinguishing between different classes. Discriminative classifiers are often more accurate in many classification tasks compared to generative classifiers. Discriminative classifiers directly solve the classification task, while generative classifiers try to solve a general problem as an intermediate step.
Likelihood & MLE
Before we get further and derive naive Bayes, you should know the concept of likelihood and maximum likelihood estimation (MLE). Likelihood is a measure of how well a certain set of parameters explains the observed data. It quantifies the compatibility between the observed outcomes and the parameters of a statistical model. These concepts are also discussed in this tutorial, please refer for more in depth understanding.
Imagine you have a bag of coloured balls, and you want to estimate the probability of drawing a red ball. The likelihood of a particular probability (parameter) tells you how likely that probability explain the observed outcomes (drawing red balls). If you believe there are mostly red balls in the bag, a probability that aligns with that belief will have a high likelihood.
MLE is a method for finding the most plausible values (parameters) that make the observed data most probable. It seeks to maximize the likelihood function, identifying the parameter values that best explain the given data. In the bag of coloured balls, MLE helps you determine the probability of drawing a red ball that maximizes the likelihood of your observed outcomes. It’s like finding the most probable explanation for your data, the set of parameters that make your observations most reasonable.
There is a difference between likelihood and probability. Likelihood is more about fitting parameters to data, while probability is about predicting data given parameters. Read these notes to understand the mathematics behind them:
Naive Bayes Derivation
The derivation of Naive Bayes involves modelling the probability distribution of the features given the class label. Depending on the type of features, different distributions can be assumed. Here, I’ll describe the theory behind the derivation of Naive Bayes for different distributions. In the notes, I have provided the derivation of naive Bayes, you can replace the distribution based on your choice.
1. Gaussian Naive Bayes:
Gaussian Naive Bayes is a variant of the Naive Bayes classifier that assumes the features within each class follow a Gaussian (normal) distribution. This assumption allows the model to work well with continuous data. To derive this model, we start by calculating the mean and standard deviation of each feature for each class in the dataset. When a new instance needs to be classified, the model uses the Gaussian probability density function to compute the likelihood of each feature belonging to a particular class. By applying Bayes’ theorem, the model then calculates the posterior probability for each class and selects the class with the highest probability as the predicted class.
2. Multinomial Naive Bayes:
The Multinomial Naive Bayes model is particularly well-suited for discrete data, making it a popular choice for text classification tasks, where features often represent word counts. This model assumes that the features follow a multinomial distribution within each class. To derive predictions, the model calculates the probability of observing a specific term count given the class. Since it’s common for some words not to appear in certain classes, Multinomial Naive Bayes employs Laplace smoothing to avoid zero probabilities, ensuring that every word has a non-zero probability of occurring in each class. This approach is critical in applications like spam detection or sentiment analysis. We will discuss more on this topic in NLP tutorial.
3. Bernoulli Naive Bayes:
Bernoulli Naive Bayes is designed for binary feature data, where each feature represents the presence or absence of a term. This makes it another strong candidate for text classification, particularly when we’re interested in whether a word is present rather than how many times it appears. The model assumes a Bernoulli distribution for each feature within each class. For each class, it calculates the probability that a given feature is present or absent. Like the Multinomial variant, Bernoulli Naive Bayes also uses Laplace smoothing to handle cases where certain terms are absent from training examples in a class, ensuring robust performance.
General Steps in Derivation:
- Assumption of Feature Independence: Naive Bayes assumes that the features are conditionally independent given the class label. This simplifying assumption allows the model to calculate the joint probability of the features as the product of individual probabilities, significantly simplifying the computation.
- Parameter Estimation: For each feature and class, the model estimates the parameters of the assumed distribution. For Gaussian Naive Bayes, this involves calculating the mean and standard deviation of each feature within each class. For Multinomial and Bernoulli Naive Bayes, the model calculates the probability of each feature occurring within each class.
- Likelihood Calculation: When a new instance is introduced, the model calculates the likelihood of each feature belonging to each class based on the parameters estimated in the previous step. This involves using the Gaussian probability density function for Gaussian Naive Bayes, and the multinomial or Bernoulli probabilities for the respective models.
- Bayesian Framework: The model then applies Bayes’ theorem to compute the posterior probability of each class given the observed features. This step combines the prior probability of each class with the likelihood calculated for the features.
- Decision Rule: Finally, the model classifies the instance by selecting the class with the highest posterior probability. This decision rule ensures that the instance is assigned to the most likely class based on the observed data.
Laplace Smoothing is often used to handle situations where certain feature values have zero probabilities. Laplace smoothing involves adding a small constant (usually 1) to the observed counts of all possible outcomes before calculating probabilities. This “pseudocount” ensures that no probability becomes zero. Let’s say you are estimating the probability of each word in a vocabulary occurring in a text document. If a particular word has not been observed in the training data, its probability would be zero. Laplace smoothing helps avoid this by adding a pseudo count, ensuring that even unseen words get a non-zero probability.
All the theoretical concepts will be clear in the notebook.
When To Use Naive Bayes?
Naive Bayes is a valuable algorithm for certain scenarios, especially when simplicity, efficiency, and interpretability are priorities. However, its suitability depends on the nature of the data and the specific requirements of the problem at hand. It’s often a good idea to experiment with multiple algorithms to find the one that best fits the characteristics of the dataset. Here are some things to keep in mind.
Advantages of Naive Bayes:
- Simplicity: Naive Bayes is easy to understand and implement. It’s a straightforward algorithm, making it suitable for quick model building.
- Efficiency: The algorithm is computationally efficient and works well with large datasets. Training and prediction times are typically faster compared to more complex models.
- Low Resource Requirements: Naive Bayes doesn’t require a lot of memory or computational resources. It’s suitable for applications with resource constraints.
- Effective with High-Dimensional Data: Naive Bayes performs well when dealing with datasets with a large number of features or dimensions.
- Good for Text Classification: Particularly effective for text classification tasks such as spam filtering and sentiment analysis due to its ability to handle high-dimensional, sparse data.
- Robust to Irrelevant Features: Naive Bayes is robust to irrelevant features. It can handle irrelevant attributes without significantly impacting performance.
- Works with Small Datasets: It can work well with small datasets, making it suitable for situations where the amount of training data is limited.
Disadvantages of Naive Bayes:
Assumption of Feature Independence: Naive Bayes assumes that all features are conditionally independent given the class label. However, this assumption often doesn’t hold in real-world datasets where features can be correlated. When this independence assumption is violated, the model may produce inaccurate probability estimates, leading to suboptimal classification performance. This limitation can significantly affect the model’s accuracy, particularly in complex datasets where interactions between features are important.
Sensitivity to Outliers: Naive Bayes can be sensitive to outliers, as extreme values can disproportionately influence the probability estimates. Since the model calculates the likelihood based on the entire dataset, outliers can skew the results, especially in Gaussian Naive Bayes, which assumes a normal distribution. This sensitivity may result in incorrect classifications, particularly if the outliers do not represent the underlying data distribution.
Limited Expressiveness: Due to its simplicity and the assumption of feature independence, Naive Bayes may struggle to capture complex relationships within the data. This limitation restricts the model’s ability to account for interactions between features, which are often crucial for accurate predictions in more sophisticated tasks. Consequently, while Naive Bayes can be effective for simpler problems, it may underperform in scenarios where the data exhibits intricate patterns or dependencies among features.
Comparison to Other Models:
Logistic Regression
- Advantages of Naive Bayes: Simplicity and efficiency make Naive Bayes a strong candidate for basic classification tasks, especially when computational resources are limited.
- Disadvantage of Naive Bayes: When dealing with more complex relationships in the data, Logistic Regression may outperform Naive Bayes by capturing interactions between features that Naive Bayes might miss.
Decision Trees
- Advantages of Naive Bayes: The simplicity and efficiency of Naive Bayes allow for quick model training and prediction, particularly in scenarios with less complex data.
- Disadvantage of Naive Bayes: Decision Trees have the ability to capture more complex relationships and interactions within the data, which can lead to better performance in certain tasks.
Support Vector Machines (SVM)
- Advantages of Naive Bayes: Naive Bayes requires fewer computational resources, making it more practical in environments with limited processing power.
- Disadvantage of Naive Bayes: SVMs excel in handling non-linear relationships, often providing more accurate predictions in cases where data exhibits complex boundaries between classes.
Random Forests
- Advantages of Naive Bayes: The simplicity of Naive Bayes contributes to faster training times, which can be beneficial in time-sensitive applications.
- Disadvantage of Naive Bayes: Random Forests generally offer better predictive performance by combining multiple decision trees, making them more robust in various scenarios.
K-Nearest Neighbors (KNN)
- Advantages of Naive Bayes: Naive Bayes is faster to train compared to KNN, especially as the dataset size increases.
- Disadvantage of Naive Bayes: KNN can sometimes provide more accurate predictions, particularly when the importance of local patterns in the data is high.
When to Choose Naive Bayes:
Naive Bayes is a good choice for several types of problems. First, it’s really effective for text classification tasks. This includes things like sorting emails into spam folders, analyzing customer sentiment, or filtering out irrelevant content. It works well for these tasks because it handles a lot of features, such as individual words in a document, very efficiently.
When you have high-dimensional data, meaning your dataset has a large number of features or dimensions, Naive Bayes is also a strong option. Its approach simplifies the calculations needed for classification, making it practical for datasets with many variables. In situations where you’re dealing with resource constraints, such as limited computational power or memory, Naive Bayes is beneficial. It doesn’t require much in terms of resources, which makes it suitable for applications on devices with limited processing capabilities.
Finally, if you need a baseline model for a classification task, Naive Bayes is useful. It’s simple to implement and gives you a quick way to measure how well your more complex models perform. It helps you get a sense of how your classification task is doing before you dive into more sophisticated techniques.
Implementation In Python
Please don’t share the pdf without our permission. It’s only for enrolled students.
If you have any questions, feel free to ask in the forum.
Further Reading And Sources:
- The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm, Michael Collins
- https://www3.cs.stonybrook.edu/~cse634/19Bayes2.pdf
- https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
- https://web.stanford.edu/~jurafsky/slp3/4.pdf
- https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf
- https://scikit-learn.org/stable/modules/naive_bayes.html