In our last tutorial, we discussed naive Bayes, a nifty tool for categorizing things. Now, let’s delve into logistic regression, another heavyweight in the world of classification.
So, why go with logistic regression instead of naive Bayes? Picture this: naive Bayes can get a bit puzzled when it encounters practically inseparable features. It treats them like independent entities and tends to be overly optimistic in its predictions. Now, logistic regression steps in as the problem-solver. It manages these closely tied features more effectively by sharing the workload between them, providing more accurate results.
When dealing with loads of data or hefty documents, logistic regression is like a dependable friend you can count on. It’s the default choice because it simply gets the job done. While naive Bayes can still yield correct classification decisions and, in some cases, outperform logistic regression, especially in scenarios involving small datasets or brief documents, the latter’s adaptability and default performance make it an indispensable choice for machine learning practitioners.
In this tutorial, we’re diving into the basics of logistic regression. We’ll see why it’s great in certain situations, keeping it simple and clear.
Table of Contents
Prerequisites
- Linear Algebra
- Calculus
- Probability & Statistics
- Python
- Naive Bayes And Linear Regression
Definition And Basic Concepts:
In the previous tutorial, we talked about two types of classifiers (generative and discriminative) now imagine we have the task of classifying images of apples and oranges. In a generative approach, the model would strive to understand the unique characteristics of both apples and oranges. It might learn about the colour, texture, and shape of each fruit. If you asked this model to ‘generate’ an image, it would attempt to create representations of both an apple and an orange. When faced with a new image, the model would assess whether the features align more closely with its learned concept of an apple or an orange, ultimately choosing the label that makes the most sense. Naive Bayes follows the generative approach.
On the other hand, a discriminative model wouldn’t bother much with understanding the essence of apples or oranges. It might be noticed that apples are often red and oranges are typically orange. If this single feature effectively distinguishes between the two classes, the model is content. When asked what it knows about apples, it might only be able to say that they are red, as that’s the feature it found most useful for classification. Logistic regression follows a discriminative approach. It is a discriminative classifier.
A generative model comprehensively grasps the features of each class and aims to ‘generate’ instances of them, while a discriminative model concentrates on drawing clear boundaries between classes without delving deeply into what defines each class.
So, what is logistic regression? Is it a regression just like linear regression?
At its core, logistic regression is a powerful tool used in machine learning for classification tasks. But what’s classification? Imagine you have a bunch of things, like emails or images, and you want a computer to automatically sort them into different categories. That’s where classification comes in, and logistic regression is like a smart assistant helping the computer figure out which category each thing belongs to. We already know this from the previous tutorial.
Logistic regression is particularly handy when we’re dealing with things that have two possible outcomes, like spam or not spam, cats or dogs, apples or oranges. It helps us predict the probability of an item belonging to one of these two categories.
Now, let’s get a bit technical. Logistic regression starts by establishing a linear relationship between the features of our things (let’s call them ‘X’) and the log odds of belonging to a particular category. This log-odds thing might sound complex, but it’s essentially a way of expressing probabilities. To convert these log odds into probabilities between 0 and 1, logistic regression uses a special function called the sigmoid function. This function squashes the values into the desired range, making it perfect for our probability predictions.
Now, we need to teach our logistic regression model to get things right. We do this by showing it a bunch of examples where we know the correct category. The model adjusts its parameters (those are the things controlling the linear relationship) so that it gets better at predicting the right categories. The logistic regression model learns to draw a line (or a boundary) that best separates the two categories. This line helps us make decisions – is this email spam or not? Is this image of a cat or a dog?
Once our logistic regression model is trained, we can use it to make predictions on new, unseen things. We plug in the features of the thing into our trained model, and it tells us the probability of it belonging to one of the categories. We can then make a decision based on this probability – if it’s more than 0.5, we might say it belongs to one category; if less, it belongs to the other.
The name “logistic regression” might be a bit confusing, but here’s the trick: even though it has “regression” in its name, it’s not exactly what you might think. Back in the history of statistics, logistic regression was developed as an extension of linear regression. Linear regression is indeed a type of regression, and logistic regression borrowed some concepts from it. Both linear regression and logistic regression share a similar formal structure in their equations. The word “regression” kind of stuck from the earlier days of statistical modelling, and logistic regression inherited it.
Traditional regression, like linear regression, predicts continuous values. For example, predicting the price of a house or the temperature. Logistic regression, on the other hand, is all about predicting probabilities and dealing with classification problems, like spam or not spam, yes or no. While linear regression is concerned with linear relationships (straight lines), logistic regression uses the sigmoid function to transform this linear relationship into a non-linear one for probabilities. In essence, logistic regression captures a linear relationship in terms of log odds, which translates into a non-linear relationship in terms of probabilities.
So, in a way, logistic regression has the term “regression” in its name because it has some historical ties to linear regression. However, it’s not your typical regression task – it’s more of a classification superhero, helping us decide which category things belong to. Think of it as a regression model that does classification or becomes a classification model when we bring a threshold into the picture. Usually value above 0.5 belongs to the positive class and a negative class for a value below 0.5. So, yeah, in the way it’s a regression because it predicts probability but it is not your typical regression.
Some Math Basics:
To understand logistic regression, you must know these concepts:
1. Odds:
- Odds are a way to express the likelihood of an event happening, and they are calculated as the ratio of the probability of the event occurring to the probability of it not occurring.
- Mathematically, if P is the probability of an event, the odds (O) are given by O = P / (1− P).
- Odds provide a convenient way to represent probabilities and offer a more interpretable measure, especially when dealing with events that may or may not happen.
- If the probability of winning a game is 0.7, the odds of winning would be 0.7/(1−0.7) = 2.33. This means you have a 2.33 times higher chance of winning than losing.
2. Log Odds ( Logit Or Logit Function )
- Log odds is the logarithm of the odds and is denoted as ln (O) or log (O), where ln represents the natural logarithm.
- Mathematically, if O is the odds, log odds are given by ln(O) or log(O).
- Log odds are often preferred over simple probabilities because they transform the odds (which can range from 0 to infinity) into a continuous scale from negative to positive infinity.
- In logistic regression, log odds are utilized to linearize the relationship between input features and the likelihood of an event. This transformation simplifies the modelling process.
- In logistic regression, the goal is often to predict the probability of an event occurring. However, probabilities can become skewed towards extreme values (0 or 1), making them hard to work with. The log odds, being on a continuous scale, provide a more linear and manageable relationship with input features.
- If the odds of winning are 2.33, the log odds would be ln(2.33) or log(2.33). This transformation allows us to express the odds on a logarithmic scale.
3. (Use Of Concepts 1 & 2 ) in Logistic Regression
- In logistic regression, we often deal with binary outcomes, like predicting if an email is spam or not. The probability (P) of an event occurring is transformed into odds (O) using the formula O = P / (1− P).
- For example, if the probability of an email being spam (P) is 0.7, the odds (O) become 0.7/(1−0.7) = 2.33. This means the odds of the email being spam are 2.33 times the odds of it not being spam.
- Now, we take the natural logarithm (log) of the odds to get the log odds. Using the previous example, if the odds are 2.33, the log odds would be ln(2.33). This transformation turns the odds into a continuous scale, making it more manageable for modelling. In logistic regression, the goal is to model the relationship between input features and the log odds of an event occurring. This relationship is often expressed as:
ln( P/ 1 - P)= β0 + β1 x1 + β2x2+…+βnxn where β0,β1,…,βn are coefficients, and x1,x2,…,xn are the input features.
- The logistic regression model essentially linearizes the log odds, making it a linear combination of input features with some weights (coefficients).
- During the training phase of logistic regression, the model’s parameters (coefficients) are iteratively adjusted to minimize the difference between the predicted log odds and the actual log odds of the observed outcomes in the training data.
- This optimization process is typically achieved using techniques like Maximum Likelihood Estimation (MLE). The goal is to find the coefficients that best fit the observed data.
- After training, when you want to make predictions on new, unseen data, the model provides the log odds of the event occurring based on the input features. To convert these log odds into probabilities, you apply the inverse transformation using the sigmoid function. You will learn about the sigmoid function in notes.
- This sigmoid function maps the log odds back to the probability scale, producing a probability value between 0 and 1.
- The logistic regression equation includes coefficients that represent the impact of each input feature on the log odds of the event occurring.
- Each coefficient βi represents the change in the log odds for a one-unit increase in the corresponding input feature xi. For instance, if β1 is 0.5, it means that a one-unit increase in x1 is associated with a 0.5 increase in the log odds of the event occurring.
- To make the interpretation more intuitive and relatable, we exponentiate the coefficients to obtain odds ratios. The odds ratio (eβi ) tells us how much the odds of the event occurring change for a one-unit increase in the corresponding input feature. Mathematically, if the odds ratio is 2, it means the odds of the event occurring are twice as likely for each one-unit increase in the input feature. Let’s say β1 is 0.5. Exponentiating it (e0.5) gives us an odds ratio of approximately 1.65. An odds ratio of 1 suggests no effect – a one-unit increase in the feature doesn’t impact the odds. An odds ratio greater than 1 indicates an increase in the odds of the event occurring with a one-unit increase in the feature. An odds ratio of less than 1 indicates a decrease in the odds of the event occurring with a one-unit increase in the feature.
- Understanding the odds ratios helps in identifying which features have a substantial impact on the likelihood of the event, providing valuable insights for decision-making. However, in machine learning, we are mainly concerned with the classification metrics or classification ability of a model.
Cost Function In Logistic Regression:
The logistic regression model is trained using Maximum Likelihood Estimation (MLE), which estimates the parameters by maximizing the likelihood of the observed data. This process can be optimized using gradient descent, which we will explore in more detail in a later tutorial when we cover neural networks.
In logistic regression, we minimize log loss (also called cross-entropy loss ). Various terms are used interchangeably, such as “loss function,” “cost function,” or “objective function” to represent the term we are minimizing using our optimization algorithm. Here our cost function is log loss.
In logistic regression, the model estimates the probability P of an instance belonging to the positive class. Let y be the true class label (0 or 1).
- The likelihood function L measures how well the model’s estimated probabilities match the true labels. For one instance, it is given by L = py ⋅ (1−p)1−y.
- Taking the logarithm of the likelihood function simplifies computations and is a common practice. The log-likelihood function ℓ is defined as ℓ = log( L ). For one instance, ℓ = y ⋅ log(p) + (1−y) ⋅ log(1−p).
- To convert the optimization problem into a minimization task (as optimization algorithms typically minimize), we consider the negative log-likelihood (−ℓ). – [ y ⋅ log(p) + (1−y) ⋅ log(1−p) ].
- For N instances, the average negative log-likelihood is given by: − 1/N (∑Ni =1 [ yi log( pi ) + (1−yi ) log(1−pi ) ]). This is the final form of the log loss equation for binary logistic regression.
- The term yi log( pi ) penalizes the model when it confidently predicts the wrong class and (1−yi ) log(1−pi ) penalizes the model when it fails to assign a high probability to the correct class. The overall log loss is the average of these penalties across all instances.
- Log loss has a close connection to information theory, specifically through the concepts of entropy and Kullback-Leibler (KL) divergence. In information theory, entropy measures uncertainty or disorder in a system, and KL divergence quantifies the difference between two probability distributions. Log loss can be viewed as a form of KL divergence between the true distribution (yi) and the predicted distribution (pi ). You will learn about information theory in the decision trees tutorial.
- Minimizing log loss during training implies that the model is becoming more certain and informative about the true distribution. The optimization algorithm adjusts the parameters to minimize the discrepancy between predicted probabilities and actual outcomes. Minimizing this loss function is a convex optimization problem.
- One practical implication of log loss is that it encourages the model to provide well-calibrated probabilities. Calibrated probabilities are crucial in scenarios where decision-makers rely on the confidence of the model’s predictions, such as in medical diagnosis or fraud detection.
- However, log loss is not without its trade-offs. The metric can be sensitive to outliers and extreme predictions, and finding the optimal balance between avoiding overfitting and underfitting is something you should keep in mind.
Mathematical Derivation:
Read these before you move ahead:
Just like linear regression, we can also regularize the logistic regression models using L1 or L2 penalty or a combination of both (elastic net). It is the same as the linear regression all you have you do is add them to the cost function and perform training. The choice between L1 and L2 regularization depends on the specific characteristics of the dataset and the desired properties of the model. L1 regularization is favoured when feature selection is crucial, whereas L2 regularization is preferred for preventing large coefficients.
L2 regularization, or ridge regression, is computationally more convenient due to its straightforward derivative (the derivative of θ² is simply 2θ). Contrastingly, L1 regularization introduces complexity as the derivative of |θ| is non-continuous at zero.
While L2 regularization encourages weight vectors with many small weights, L1 regularization prefers sparse solutions. In practical terms, L1 regularization results in weight vectors with fewer features, as it tends to set many weights to zero. This implies that L1 regularization favors models with a reduced set of influential features, promoting sparsity in the solution.
Daniel Jurafsky & James H. Martin, Standford, Speech and Language Processing, logistic regression using SGD
Logistic Regression In Practice
There are various steps we follow in practice before we apply logistic regression to a problem. We will see how we can apply logistic regression in practice but first have a look at some important steps and considerations while working on any ML classification problem, especially logistic regression:
- Understanding the Data: Before diving into modelling, thoroughly understand your dataset. Examine summary statistics, identify missing values, and explore the distribution of features.
- Relevant Features: Choose features that are relevant to your prediction task. Consider using domain knowledge or feature importance techniques to identify key variables.
- Handling Missing Values: Decide on a strategy for handling missing values, whether through imputation or removal.
- Categorical Variables: Encode categorical variables appropriately, either through one-hot encoding or label encoding.
- Feature Scaling: Depending on the algorithm and features, consider scaling numerical features to ensure uniform influence.
- Train-Test Split: Split your dataset into training and testing sets to evaluate model performance on unseen data.
- Stratified Sampling: If dealing with imbalanced classes, use stratified sampling to maintain class distribution in both training and testing sets.
- Class Imbalance: If your dataset has imbalanced classes, explore techniques like oversampling, undersampling, or using class weights to address this issue.
- Choose Appropriate Model: Ensure logistic regression is suitable for your problem. For complex relationships, consider other algorithms.
- Hyperparameter Tuning: Experiment with hyperparameters like regularization strength (
C
) to find the optimal model. - Regularization: Be mindful of overfitting or underfitting. Adjust regularization strength to balance model complexity. Decide on the type of regularization (L1 or L2) based on your objectives.
- Metrics Selection: Choose appropriate evaluation metrics based on the nature of your problem (accuracy, precision, recall, F1-score).
- Cross-Validation: Use cross-validation to get a more robust estimate of your model’s performance.
- Logistic regression provides interpretable coefficients. Consider the interpretability of your model in the context of your analysis.
- Linearity: Logistic regression assumes a linear relationship between features and the log odds of the response variable. Check for linearity in your data.
- No Multicollinearity: Avoid strong correlations between independent variables (multicollinearity), as it can affect the stability of coefficients.
- Scalability: Consider the scalability of your model if deploying for real-time predictions. Regularly monitor model performance in a production environment and retrain as needed.
- Fairness and Bias: Be aware of potential biases in your data and model predictions. Address fairness concerns if applicable.
All practical concepts are discussed in the notebook below. Please note: you don’t have permission to share sell or distribute this file to anyone. We reserve all rights and we expect you to respect our intellectual property.
When to Use Logistic Regression:
So, when should you use logistic regression?
- Binary Outcome:
- Use logistic regression when your dependent variable is binary (two classes). If you have more than two classes, you might consider other models like multinomial logistic regression.
- Linear Relationship:
- When there is a linear relationship between the independent variables and the log odds of the dependent variable, logistic regression is appropriate. Ensure this assumption is met or consider alternative models.
- Interpretability Matters:
- If interpretability is crucial and you need to understand the impact of each variable on the outcome, logistic regression is a good choice. The coefficients have a straightforward interpretation in terms of odds ratios.
- Limited Features:
- Logistic regression is suitable when the number of features is moderate, and there is a concern about overfitting. It’s especially useful in situations where the “curse of dimensionality” may be problematic for more complex models.
- Probabilistic Output:
- When the output in terms of probabilities is important for decision-making, such as in risk assessment or marketing, logistic regression provides a natural way to estimate probabilities.
Sources & Recommended Readings:
- https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
- https://home.iitk.ac.in/~shalab/regression/Chapter16-Regression-Generalized%20Linear%20Models.pdf
- https://sscnars.icar.gov.in/socialsci/2-logistic%20regression.pdf
- https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf