Imagine you are a real estate agent trying to estimate the price of a house based on its size. You’ve probably noticed that larger houses tend to have higher prices, but how can you quantify this relationship? This is where linear regression, a fundamental concept in machine learning, comes into play.
Linear regression is a powerful tool that helps us understand and predict the relationship between two variables. In our real estate example, the size of the house is the independent variable (also known as the predictor or feature), and the price is the dependent variable (the target or outcome). Linear regression aims to find the best-fitting line through the data points that can be used to make predictions.
To delve into the world of linear regression and how we use linear regression in machine learning, it’s essential to understand the core concepts involved.
Linear regression is a stepping stone for more advanced machine learning techniques and is widely employed in diverse fields such as finance, healthcare, and marketing. So, let’s discover how it can be a valuable asset in your machine-learning toolkit.
- Introduction to Linear Regression
- How to fit a Linear Model to Data
- Understanding the Line of Best Fit
- Evaluation Metrics for Linear Regression
- Linear Regression Implementation in Scikit-Learn
Table of Contents
Before you proceed, I highly recommend you read these notes because most of the basic concepts related to probability and statistics are explained. Additionally, you should know Calculus as well. Also, check out this tutorial before you move ahead!
Definitions And Basics
When two variables are associated, changes in one variable tend to be related to changes in the other. Correlation is a measure of association, indicating the strength and direction of a linear relationship between two variables. However, association alone does not imply causation. Just because two variables are correlated does not mean that one causes the other. Causation on the other hand implies a cause-and-effect relationship between two variables. Establishing causation is more complex than identifying an association.
While correlation provides a measure of how closely two variables move together, it doesn’t reveal the direction of causation or rule out the influence of a third variable. This is where regression comes into play. Linear regression goes a step further than correlation by providing a predictive model that can be used for making informed predictions and understanding the impact of one variable on another.
Consider our housing example:
Size (sq. ft) | Number of Bedrooms | Price ($)
----------------------------------------------
1500 | 3 | 250,000
2000 | 4 | 320,000
1200 | 2 | 200,000
1600 | 3 | 275,000
1800 | 4 | 300,000
Code language: JavaScript (javascript)
In this dataset, we have house size and number of bedrooms and then we have our price. We use some terminology to denote them in a dataset.
- Features: The characteristics or attributes that we use to predict the house price. These are the independent variables. Here, Size (sq. ft) and Number of Bedrooms are our features of the dataset. Each row in the dataset represents a house, and the corresponding values for size and number of bedrooms are the features associated with that house. We use the term “feature” interchangeably with “predictor,” “independent variable,” “x“, or “input,”
- Label (Target): The variable we want to predict. In this case, it’s the house price. This is the dependent variable. In regression, we use the features to predict the target. We use the term “label” with “target,” “dependent variable,” “y” or “output.”
- Observations (or Data Points): Each row in the dataset is an observation or a data point. For example, the first row with a size of 1500 sq. ft, 3 bedrooms, and a price of $250,000 is one observation.
- Dataset: The entire set of data containing both features and labels is the dataset. In our example, it’s the table with size, number of bedrooms, and price information for each house.
Linear regression is a supervised learning machine learning model. In supervised learning, we have a dataset where each example comes with input features and an associated output label. The algorithm learns to map the input features to the correct output label during the training phase. The goal is to make accurate predictions on new, unseen data. Supervised learning problems can be mainly categorised into prediction or classification.
In regression, the task is to predict a continuous value. In the context of our house dataset, we’re dealing with predicting the price of a house, which is a continuous and numeric variable. In our example data above, we want the algorithm to learn a relationship between the size of a house, the number of bedrooms, and the price of the house. The output is a numerical value that can vary across a range. For instance, the algorithm might learn that, on average, a house with more square footage and more bedrooms tends to have a higher price. The trained model can then make predictions about the price of a new house based on its size and number of bedrooms.
On the other hand, in classification, the task is to predict which category or class an input belongs to. In our house dataset, we might want to categorize houses into different price ranges or classes. Here, we’re not predicting an exact price; instead, we’re placing houses into discrete categories.
Size (sq. ft) | Number of Bedrooms | Price ($) | Category (Target)
-------------------------------------------------------
1500 | 3 | 250,000 | Affordable
2000 | 4 | 320,000 | Moderate
1200 | 2 | 200,000 | Affordable
1600 | 3 | 275,000 | Affordable
1800 | 4 | 300,000 | Moderate
Code language: JavaScript (javascript)
We might define these categories based on price ranges, like “Affordable” for houses below a certain price, “Moderate” for houses in an intermediate price range, and “Expensive” for high-priced houses. The classification algorithm learns the patterns in the data to assign each house to the correct category. For example, it might be learned that houses with smaller sizes and fewer bedrooms tend to fall into the “Affordable” category, while larger houses with more bedrooms tend to be “Expensive.”
Depending on the nature of the problem, we choose our machine learning models. In our future tutorials, we are going to learn different models and how to use them for different applications. However, In this tutorial, we are going to focus on regression only. In the next tutorial, we will learn about Naive Bayes a classification model followed by logistic regression which is another classification model based on regression.
Now coming back to why linear regression is preferred over correlation. Linear regression is preferred:
- Directional Causation: Linear regression helps identify the direction of causation. In a predictive model, the independent variable (predictor) is explicitly used to estimate the dependent variable (outcome). This directional information is crucial for understanding the cause-and-effect relationship.
- Quantitative Predictions: Linear regression provides a quantitative prediction of the dependent variable based on the values of the independent variable(s). This is particularly valuable in scenarios where you want to estimate or forecast an outcome.
- Control for Confounding Variables: Linear regression allows for the inclusion of multiple independent variables, enabling the control for potential confounding factors, which are variables that might affect both the predictor and outcome. This helps in isolating the true effect of the independent variable on the dependent variable, ensuring a more accurate analysis of the relationship.
Now let’s frame the problem of linear regression based on our example dataset. We want to predict the price of a house based on its size (in square feet) and the number of bedrooms it has. Our goal is to find a formula or equation that, given the size of the house and the number of bedrooms, can estimate the price of the house. In the context of our problem, the linear regression equation looks like this:
Price = m×Size + n×Bedrooms + b
Here:
1. Price is the predicted house price.
2. Size is the size of the house.
3. Bedrooms is the number of bedrooms.
4. m is the weight (or coefficient) associated with the size of the house. This tells us how much the predicted price changes for each additional square foot increase in the size of the house. If m is, for example, 100, it means that for every additional square foot, the predicted price increases by $100.
5. n is the weight (or coefficient) associated with the number of bedrooms. Similarly, this tells us how much the predicted price changes for each additional bedroom. If
n is, for example, 20, it means that for every additional bedroom, the predicted price increases by $20.
6. b is the y-intercept, representing the base price of a house when both size and number of bedrooms are zero.
Code language: JavaScript (javascript)
To solve this problem or to build our model, our objective is to find the best values for m, n, and b that minimize the difference between our predicted prices and the actual prices in our dataset. When we say “best values,” we mean the values for m, n, and b that make our predictions as close as possible to the actual prices of the houses in our dataset. Our goal is to make the predictions as accurate as possible. We want the predicted prices to be very close to the actual prices for the houses we already know. To achieve this, we need to minimize the difference between our predicted prices and the actual prices.
But you may ask, how do we find these best values? 🤔 We use mathematical “optimization” techniques to find the best values. In simple terms, optimization is like adjusting the knobs on our formula until we get the best results. We have a “cost function” that measures how far off our predictions are from the actual prices. The goal is to minimize this cost. Gradient descent is a technique used to minimize the cost function. We will learn about it in another tutorial but for the time being, you don’t need to know what it is. It’s like a step-by-step process of adjusting the values of m, n, and b to make our predictions better and better.
We repeat this process iteratively until we find values for m, n, and b that result in the smallest possible difference between our predicted prices and the actual prices in our dataset. Once we have these optimal values for m, n, and b the ones that give us the most accurate predictions across all the houses in our dataset, we can use our linear regression equation to predict the price of a house based on its size and number of bedrooms for new unseen data points.
That’s all that we do in regression tasks. Now let’s put it in mathematical form. We mainly have three variants:
Simple linear regression: Here we have one independent variable (feature) predicting a single dependent variable (label). The formula for simple linear regression for our example is:
Price = m × Size + b
Multiple linear regression: Here we extend the simple linear regression to include multiple independent variables. For our housing dataset, with both size and number of bedrooms as features, the formula becomes:
Price = m × Size + n × Bedrooms + b
Polynomial Regression: Here we capture nonlinear relationships between the features and the target variable by introducing polynomial terms. For example, a quadratic polynomial regression with one feature (size) would look like:
Price = a×Size^2 + b×Size + c
In general, the formal formulation of linear regression, including multiple and polynomial regression, can be represented as:
Y = β0 + β1 × Feature1 + β2 × Feature2 +… + βn × Featuren
Where, β0 is the y-intercept. β1, β2, βn are the coefficients associated with each feature.
Geometrically, we are trying to find a line that best fits the dataset and we are trying to minimize the distance from the points to the line.
I hope these basic concepts are clear. Now let’s get into the mathematical derivations and then we will move to more advanced concepts. I have provided the mathematical derivations in the notes below, please read them before you proceed further.
Mathematics Behind Linear Regression
Before you proceed ahead I am assuming that you have already followed my notes on probability and statistics because a detailed derivation of linear regression is provided in that note. This note here provides some basics and advance concepts:
In the above note in lasso it is l1 penality for lasso and l2 for ridge regression. Feel free to ask your questions in the forum here.
Application Of Linear Regression In Python
Now that you have little understanding of linear regression fundamentals, let’s see how we can apply it to real-world problems. Here is the pdf that you can download for your own purpose or read it online. Try to code along for better understanding. Additionally, please don’t share files online anywhere without our permission as it takes a lot of efforts to create tutorial. Open your jupiter notebook and start coding.
Important Questions
These are some important questions that I have designed and maybe helpful in your interview. Some concepts maybe advanced but if you read the sklearn documentation you will know that everything has been covered. I have provided the links to important topics and sources in the notebook and the tutorial please read them for further detail. You don’t have to learn everything in one go. As you continue to practice questions, you will know how to use them.
- Assumptions Challenge:
- Explain the assumptions of linear regression. How might violation of these assumptions affect the reliability of your model?
- The assumptions of linear regression include linearity, independence, homoscedasticity, and normality of residuals. Violating these assumptions can affect the reliability of the model. For example, if the assumption of linearity is violated, the model may not accurately capture the true relationship between variables, leading to biased predictions. Violating independence assumptions might result in correlated errors, affecting the statistical tests. Homoscedasticity assumption implies constant variance of residuals, and violating it may lead to inefficient coefficient estimates. Normality of residuals is necessary for accurate confidence intervals and hypothesis tests, and its violation may impact the validity of statistical inferences.
- Coefficient Interpretation:
- If the coefficient of a predictor variable is 0.05, how would you interpret this in the context of linear regression?
- If the coefficient of a predictor variable is 0.05, it means that for a one-unit increase in the predictor variable, the response variable is expected to increase by 0.05 units, assuming all other variables are held constant. The coefficient represents the change in the mean response for a one-unit change in the predictor, providing insight into the strength and direction of the relationship between the predictor and the response.
- Multicollinearity Awareness:
- What is multicollinearity, and why is it a concern in multiple linear regression? How can you detect and handle multicollinearity in your data?
- Multicollinearity occurs when predictor variables in a regression model are highly correlated. This can lead to instability in coefficient estimates. Detection methods include variance inflation factor (VIF) analysis, where VIF values above a certain threshold (commonly 5 or 10) suggest multicollinearity. To address multicollinearity, one can consider removing one of the correlated variables, combining them, or using dimensionality reduction techniques like principal component analysis (PCA).
- R-squared Critique:
- You have two models with R-squared values of 0.75 and 0.85. Which model is better, and why? Are there any limitations to relying solely on R-squared for model evaluation?
- R-squared measures the proportion of variance in the dependent variable explained by the independent variables. A higher R-squared generally indicates a better fit. However, it should not be the sole criterion for model evaluation. Models with higher R-squared values may overfit the data. Adjusted R-squared, which accounts for the number of predictors, and other evaluation metrics, such as Mean Squared Error (MSE), should be considered to provide a more comprehensive evaluation of model performance.
- Outlier Impact:
- Discuss the impact of outliers on linear regression models. How can you identify and address outliers in your dataset?
- Outliers can disproportionately influence coefficient estimates, leading to biased models. They can also affect assumptions like normality and homoscedasticity. Identifying outliers can be done through visual inspection of residual plots or using statistical methods like the Z-score or leverage. Addressing outliers may involve transforming the data, using robust regression techniques, or excluding extreme values based on domain knowledge.
- Heteroscedasticity Challenge:
- Define heteroscedasticity in the context of linear regression. How might it affect the reliability of your regression model, and what methods can be employed to address it?
- Heteroscedasticity is the unequal variance of residuals across different levels of the independent variable. It violates the assumption of homoscedasticity. To address heteroscedasticity, transformations like log transformations may be applied to the dependent variable, or weighted least squares regression can be used. Residual plots can help visually identify heteroscedastic patterns.
- Interaction Terms Understanding:
- What are interaction terms in the context of linear regression? Provide an example of when and why you might include an interaction term in your model.
- Interaction terms capture the combined effect of two or more variables on the dependent variable. For instance, in a model predicting house prices, an interaction term between square footage and neighborhood type could account for different price effects based on both factors. Including interaction terms allows the model to capture nuanced relationships. Care should be taken to include only relevant interactions, as including too many may lead to overfitting.
- Gradient Descent Knowledge:
- Explain the concept of gradient descent as it relates to linear regression. How does it help in finding the optimal coefficients?
- Gradient descent is an optimization algorithm used to minimize the error (cost) function in linear regression. It iteratively adjusts the coefficients to find the minimum of the cost function. The algorithm calculates the gradient (partial derivatives) of the cost function with respect to each coefficient and updates the coefficients in the opposite direction of the gradient. This process is repeated until convergence. Gradient descent is computationally efficient and particularly useful for large datasets.
- Residual Analysis:
- Describe the process of residual analysis in the context of linear regression. How can residual plots be used to evaluate the assumptions of the model?
- Residual analysis involves examining the differences between observed and predicted values (residuals) to assess the model’s assumptions. Residual plots, such as scatterplots or histograms, help identify patterns like non-linearity, heteroscedasticity, and outliers. A well-fitted model should have residuals that are randomly distributed around zero, showing no systematic patterns. Residual analysis is crucial for validating the assumptions and diagnosing potential issues with the model.
- Model Comparison:
- Compare and contrast the differences between simple linear regression and multiple linear regression. In what scenarios would you prefer one over the other?
- Simple linear regression involves predicting a dependent variable using only one independent variable, while multiple linear regression involves two or more independent variables. Simple models may be preferred when the relationship is clear and additional predictors do not add significant explanatory power. Multiple regression allows for more complexity, capturing interactions and adjusting for confounding variables. The choice depends on the nature of the relationship and the research question. Model selection criteria, such as AIC or BIC, can aid in deciding between models.
Recommended Reading:
- https://www.uoguelph.ca/lang/system/files/Regression.pdf
- https://www.westga.edu/academics/research/vrc/assets/docs/linear_regression_notes.pdf
- https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_lars_ic.html
- https://scikit-learn.org/stable/modules/linear_model.html#bayesian-regression