In this tutorial, we will learn our next machine-learning model called nearest neighbors (NN), known as k-nearest Neighbors (k-NN). It belongs to the family of supervised learning algorithms, specifically within the category of instance-based learning (like SVM). Unlike many other models involving complex mathematical equations and intricate training processes, k-NN relies on a beautifully simple principle – that similar instances are likely to share common characteristics.
Traditional machine learning models go through a training phase where they learn patterns and relationships from the provided data. In contrast, k-NN doesn’t have a distinct training phase. It memorizes the entire training dataset, essentially storing it for later use in making predictions. This is the reason why it is called a lazy learning algorithm. Nearest neighbors can be used both for supervised and unsupervised tasks.
Let’s get started and learn about our next model called nearest neighbors.
Table of Contents
Prerequisites:
- Python, Numpy, Sklearn, Pandas and Matplotlib.
- Linear Algebra For Machine Learning.
- Statistics And Probability Theory.
You can purchase the notes here for LA and S&P:
Concepts Of Nearest Neighbors
Imagine you have a big box full of different coloured balls, each representing a data point with various features (like size, weight, or color intensity). Now, let’s say you have a new ball, and you want to figure out which group it belongs to or how similar it is to the balls you already have in the box. That’s where Nearest Neighbors (NN) comes into play!
- In our example, each ball in the box is a “neighbor” to the new ball. Neighbors are simply the existing data points in your dataset.
- Neighbors are determined based on how close they are to the new data point in terms of their features. Think of closeness as a measure of similarity or distance.
- The “k” in k-Nearest Neighbors (k-NN) represents the number of neighbors we want to consider when making a decision. If we choose “k = 3,” for instance, we’re saying, “Hey, let’s look at the three closest balls to our new ball and see what group they belong to.” If k = 1, we call it NN otherwise KNN.
- For classification tasks (like sorting balls into different coloured groups), we see which group the majority of the nearest neighbors belong to and for regression tasks (like predicting a numerical value), we take the average value of the nearest neighbors’ target.
- To figure out which balls are closest to our new ball, we need a way to measure the distance between them. This could be based on the features of the balls, like their colours or sizes. Common distance metrics include Euclidean distance (straight-line distance) and Manhattan distance (distance along grid lines). We have already seen these in our notes on Linear Algebra.
- What’s cool about NN is its flexibility. It doesn’t make any assumptions about the underlying structure of the data; it simply finds the most similar data points. This flexibility makes it great for both classification and regression tasks, and it can handle data with complex patterns.
Learning in k-NN:
Imagine you have a collection of objects, each described by different characteristics. In our case, let’s say these objects are like different types of fruits, and the characteristics are things like colour, shape, and size. The learning process in k-NN is super simple – it memorizes all these fruits and their characteristics.
The algorithm just remembers the entire dataset. Unlike some other algorithms, k-NN doesn’t go through a formal training phase where it adjusts internal parameters (that makes it a nonparametric model) or builds a specific model. It just stores all the data in its memory. Now, let’s say you find a new fruit, and you want to know what type it is based on its characteristics. This is where the prediction process begins.
You look at the new fruit’s features (colour, shape, size), and you find the k-NN algorithm asking, “Hey, which fruits in my memory are closest to this new one in terms of features?” These closest fruits are the “neighbors.” If you’re doing a classification task (like sorting fruits into categories), you might ask the neighbors, “What category do you belong to?” Then, you classify the new fruit based on the majority opinion of its neighbors.
If you’re doing a regression task (like predicting the sweetness of a fruit), you ask the neighbors, “What’s your sweetness level?” Then, you predict the new fruit’s sweetness by taking the average sweetness level of its neighbors.
And there you have it – the algorithm makes a decision based on what its nearest neighbors say. If they’re mostly apples, it says the new fruit is likely an apple. If they have an average sweetness of 8, it predicts the sweetness of the new fruit as 8.
So, What Are The Similarity Measure We Use In k-NN?
In nearest neighbors we use various similarity measures to find the nearest neighbors. The underlying idea is that points (or instances) that are close to each other in the feature space share similar characteristics. If you’re trying to predict whether an email is spam or not, similar words and phrases in the email’s content may indicate its class.
In our daily lives, we often make decisions based on what’s similar or close to what we know. By considering closeness, k-NN adapts to local decision boundaries. It doesn’t make global assumptions about the data; it simply looks at what’s nearby. This flexibility allows it to capture intricate patterns that might be missed by more rigid algorithms.
Mathematically, we use various similarity measures, also known as a distance metric, to calculate the distance between data points. Here are some important measures that we are already familiar with:
- Euclidean Distance: Euclidean distance is the straight-line distance between two points in space. Euclidean distance is widely used when the magnitude and scale of features matter, and you want to calculate the straight-line distance between two points in space. It’s suitable for scenarios where features are measured in the same units. Make sure to scale your data.
- Manhattan Distance: Manhattan distance (L1 norm) is the sum of the absolute differences along each dimension. Manhattan distance is suitable when the scale of features is not as important as their relative differences. It’s often used when dealing with features that are measured in different units or when you want to compute distance along grid lines. Robust to outliers.
- Minkowski Distance: Minkowski distance is a generalized form that includes both Euclidean and Manhattan distances. It’s parameterized by a variable p.
- Cosine Similarity: Cosine similarity measures the cosine of the angle between two vectors. It’s often used when the magnitude of the vectors is not crucial, but the direction matters. Suitable for scenarios where the absolute values of features are not as important as their directions or where the relationship between features matters more than their absolute values. Commonly used in text mining and natural language processing.
It’s often a good practice to experiment with multiple distance metrics and see which one performs best for your specific task. Grid search or cross-validation can help in the evaluation process. You can find their formulas in the notes below.
Commonly Used Nearest Neighbors:
Various algorithms have been designed to find the nearest neighbors. These were developed mainly due to the inefficiency of the core algorithm in higher dimensional space. We will talk about that later on but first, let’s see two types of nearest neighbors implementations we commonly use in practice:
1. K-Nearest Neighbors (KNN):
- Overview: KNN is the foundational and most widely used Nearest Neighbors algorithm. It classifies or predicts a new data point by considering the majority class or average of the k-nearest neighbors.
- How it Works:
- Given a new data point, the algorithm identifies its k-nearest neighbors based on a specified distance metric (e.g., Euclidean distance).
- For classification, it assigns the class label that is most common among the k-neighbors.
- For regression, it predicts the average of the target values of the k-neighbors.
- Parameters:
- k (Number of Neighbors): The user-defined parameter that specifies how many neighbors to consider.
- Distance Metric: The metric used to measure the distance between data points.
- Considerations:
- KNN is sensitive to the choice of distance metric and the value of k.
- It can be computationally expensive for large datasets as it needs to compute distances to all data points.
- Feature scaling is often recommended to ensure all features contribute equally to the distance calculation.
2. Radius Neighbors: Radius Neighbors is an extension of KNN that considers all neighbors within a specified radius around a data point, rather than a fixed number (k).
- How it Works:
- For each data point, it identifies all neighbors within a user-defined radius.
- The algorithm can be used for both classification and regression tasks.
- In classification, the majority class among the neighbors is assigned.
- In regression, the average of the target values of the neighbors is used for prediction.
- Parameters:
- Radius: The maximum distance within which neighbors are considered.
- Considerations:
- It can be more flexible than KNN in situations where the number of relevant neighbors varies across the dataset.
- Like KNN, it is sensitive to the choice of distance metric.
Nearest Neighbors Decision Boundary
There is a very simple way to understand the information boundary of KNN. We use the Voronoi diagram or Voronoi tessellation to visualize the decision boundary. Let’s say all our data points are coming from distinct classes in 1-NN. In this case, the decision boundary will be each cell boundary means if we have a new test point then it will belong to a particular cell that it is closest to. This is done by connecting the perpendicular bisectors of line segments between corresponding data points. However, if we have data points that are similar to each other then the decision boundary will be a combination of the boundaries. Here is the illustration to make things clear:
The decision boundary gets smoother as you increase the number of nearest neighbors because more and more training points contribute to the decision.
“Curse of Dimensionality” In k-Nearest Neighbors (k-NN)
k – NN may sound like a simple model but when you have higher dimensional data things may get very complicated when it comes to computational efficiency. Imagine you have a dataset in two dimensions, like a scatter plot where each point represents a data instance. Think of these instances as houses, and the two dimensions as features like square footage and number of bedrooms. In this 2D space, it’s easy to visualize and understand the relationships between the houses. You can easily find neighbors by looking at the proximity of points.
Now, let’s say you want to add more features, such as the distance to the nearest school, the crime rate in the neighborhood, and the year the house was built. These features add more dimensions to your dataset, turning it into a multi-dimensional space. As you add more and more dimensions, the space becomes increasingly “sparse,” meaning that data points become more spread out.
In high-dimensional spaces, the concept of “closeness” becomes less intuitive. This is because, in a high-dimensional space, the majority of points are far away from each other. Consider a cube in 3D space – most points are concentrated near the surface, and the volume of the cube remains mostly empty.
As you move to higher dimensions, this sparsity issue becomes more pronounced. In k-NN, the algorithm relies on the proximity of points to make predictions. As the dimensionality increases, finding truly close neighbors becomes challenging. Imagine trying to find the five nearest houses in a 10-dimensional space – the chances of having houses with similar values across all features become lower, making the concept of “neighborhood” less meaningful. This is called the curse of dimensionality.
The curse of dimensionality affects k-NN’s ability to accurately find neighbors and make predictions. More dimensions lead to a sparser distribution of data points, making it harder to identify a sufficient number of neighbors with similar characteristics. As a result, the predictions become less reliable, and the algorithm’s performance may suffer. So, how do we solve this problem?
There are various ways we can follow to solve the problem of the curse of dimensionality. We can change the way we select nearest neighbors or we can change the way we look for nearest neighbors in the search space. We can also perform feature selection, feature engineering and data preprocessing steps like Z-score normalization. Additionally, we can also perform dimensionality reduction on our dataset.
In addition to the above solutions, there are different data structures that have been developed to improve the computational performance of k-NN.
Brute-Force Algorithm
Basic k-NN uses the brute-force approach, In the brute-force approach, k-Nearest Neighbors (k-NN) calculates the distance between the new data point and every single point in the training dataset. It then selects the k-nearest neighbors based on the smallest distances. It’s straightforward to implement. Works well for small datasets or low-dimensional feature spaces. However, it is computationally expensive for large datasets or high-dimensional feature spaces, and very Inefficient as it checks every data point.
There are other algorithms that have been proposed and implemented to overcome this.
KD-Tree Algorithm
A KD-Tree (K-Dimensional Tree) is a data structure used for efficient multidimensional searching, in k-Nearest Neighbors (k-NN). It is designed to organize points in a k-dimensional space and facilitate fast retrieval of nearest neighbors.
An early approach to taking advantage of this aggregate information was the KD tree data structure (short for K-dimensional tree), which generalizes two-dimensional Quad-trees and 3-dimensional Oct-trees to an arbitrary number of dimensions. The KD tree is a binary tree structure which recursively partitions the parameter space along the data axes, dividing it into nested orthotropic regions into which data points are filed.
The construction of a KD tree is very fast: because partitioning is performed only along the data axes, no D-dimensional distances need to be computed. Once constructed, the nearest neighbor of a query point can be determined with only O[log(N)] distance computations. Though the KD tree approach is very fast for low-dimensional ( N < 20) neighbors searches, it becomes inefficient as grows very large: this is one manifestation of the so-called “curse of dimensionality”.
Sklearn
This is how it works:
Construction of the KD-Tree:
- Recursive Partitioning:
- Objective: Divide the data into subsets along specific dimensions to create a binary tree structure.
- Procedure:
- Select a dimension for the current partition.
- Choose a value along that dimension to split the data into two subsets.
- Recursively apply the same process to each subset until a stopping condition is met.
- Dimension Selection:
- Criteria: The dimension for partitioning is selected based on a cycling order or by choosing the dimension with the maximum variance.
- Example: If the tree is being constructed in 3D space, the dimensions might be selected in the order X, Y, Z, X, Y, Z, and so on.
- Stopping Condition:
- Criteria: The tree construction stops when a certain condition is met, such as when the subset becomes small enough or a maximum depth is reached.
- Leaf Nodes: The final subsets become the leaf nodes of the tree.
Querying the KD-Tree In K-NN: Once the previous step has been done we move to this step.
- Initialization:
- Start at the Root: Begin at the root node of the KD-Tree.
- Traversing the Tree:
- Comparison with Query Point: At each node, compare the query point with the splitting value along the selected dimension.
- Choosing a Child: Move to the left or right child node based on the comparison result.
- Recursion: Recursively apply the process until a leaf node is reached.
- Backtracking:
- Backtrack to Previous Nodes: After reaching a leaf node, backtrack to the parent nodes, checking if there are closer points in other regions of the tree.
- Distance Computation:
- Distance Calculation: Compute the distance between the query point and the points in the leaf node.
- Update Best Candidates: Keep track of the closest points found during the backtracking process.
- Pruning:
- Bounding Box Checks: During the backtracking, check if it’s necessary to explore the other child node based on bounding box checks.
- Eliminate Subtrees: Prune subtrees that cannot contain closer points than those already found.
Example:
Let’s imagine our dataset to be {(2,3),(5,4),(9,6),(4,7),(8,1),(7,2)}. Now these are the steps we follow in structure preparation:
- Initial Setup:
- Start with the entire dataset.
- Choose the first dimension (X) for the root node.
- Root Node (X-Dimension):
- Sort the dataset based on the X-coordinate: (2,3), (4,7), (5,4), (7,2), (8,1), (9,6)
- Choose the median point along the X-dimension as the root node: (7,2).
- Left Subtree (Lower X):
- Points to the left of the root have lower X-values.
- Recursively apply the KD-Tree construction to the left subset: (2,3),(4,7),(5,4).
- Choose the second dimension (Y) for the left subtree.
- Left Subtree (Y-Dimension):
- Sort the left subset based on the Y-coordinate: (2,3), (5,4), (4,7).
- Choose the median point along the Y-dimension as the left child of the root: (5,4).
- Right Subtree (Higher X):
- Points to the right of the root have higher X-values.
- Recursively apply the KD-Tree construction to the right subset: (8,1),(9,6).
- Choose the second dimension (Y) for the right subtree.
- Right Subtree (Y-Dimension):
- Sort the right subset based on the Y-coordinate: (8,1),(9,6).
- Choose the median point along the Y-dimension as the right child of the root: (9,6).
Essentially, we are building a data structure like this tree:
(7,2)
/ \
(5,4) (9,6)
/ \ / \
(2,3) (4,7) (8,1)
Now it’s time to find the k-NN. Let’s say we are performing K-NN for a new data point (6, 5):
- Start at the Root:
- Begin at the root node (7,2).
- Traverse to the Left (X-Dimension):
- Compare 6<7, so move to the left child.
- Traverse to the Left (Y-Dimension):
- Compare 5<7, so move to the left child (5,4).
- Backtrack and Explore the Right Subtree:
- Backtrack to the root (7,2) and check if the right subtree needs exploration.
- Calculate the distance between the query point (6,5) and the current best point (5,4).
- Explore the Right Subtree:
- Check if there could be closer points in the right subtree.
- If yes, move to the right child (9,6).
- Backtrack to the Root:
- Backtrack to the root and check if there are closer points in the other subtree.
- Explore the Left Subtree (Y-Dimension):
- Move to the left child (5,4) and check if there could be closer points.
- Final Result:
- The nearest neighbor to the query point (6,5) is (5,4).
Visual:
(7,2)
/ \
(5,4) (9,6)
/ \ / \
(2,3) (4,7) (8,1)
Start at Root (7,2)
↓
Traverse Left (X-Dimension) 6 < 7
↓
Traverse Left (Y-Dimension) 5 < 7
↓
Backtrack to Root (7,2)
↓
Explore Right Subtree (9,6)
↓
Backtrack to Root (7,2)
↓
Explore Left Subtree (5,4)
↓
Final Result: Nearest Neighbor (5,4)
Code language: PHP (php)
KD-Tree significantly reduces the search space during the nearest neighbor search, making it more efficient than brute-force methods. The tree tends to be balanced, ensuring that the search process covers a well-defined space. KD-Tree is well-suited for low to moderate-dimensional spaces where it can outperform brute-force search methods. KD-Tree also struggles with the curse of dimensionality, where its efficiency degrades as the dimensionality increases. Building the KD-Tree can be computationally expensive, especially for large datasets. Apart from k-NN KD-Tree is also used in other applications.
Ball Tree Algorithm
The Ball Tree algorithm offers an alternative approach to the KD Tree. Instead of partitioning along individual dimensions, it organizes the data into “balls,” each containing multiple data points. This grouping helps address the sparse distribution of points in high-dimensional spaces. Imagine we have a dataset with houses represented by their square footage, number of bedrooms, and distance to the nearest school. In a high-dimensional space, the Ball Tree organizes these houses into clusters or “balls,” grouping together houses that are close to each other.
The Ball Tree dynamically chooses representative balls during its construction. It selects dimensions based on the maximum spread or variance, allowing it to adapt to the specific characteristics of the data. When a k-NN query is made, the Ball Tree is traversed, moving to balls that are closer to the query point. This avoids exhaustive searches through all data points, especially beneficial in high-dimensional spaces where the curse of dimensionality is more pronounced.
The Ball Tree excels in situations where the KD-Tree struggles. By grouping similar points into balls, the algorithm reduces the search space, leading to more efficient k-NN queries. This efficiency is particularly valuable in machine learning tasks dealing with datasets where the number of features is large.
If you want to know more about Ball Tree construction read this paper. Also, note that you don’t need to remember these algorithms because they are used within KNN. You won’t be directly interacting with them while using KNN.
Although not important, If you are interested in understanding the Bayesian perspective of KNN I recommend reading this paper.
Application In Python:
Now let’s look at how we can implement it in real-world problems.
Pros and Cons of Nearest Neighbors:
Pros:
- Simplicity: NN is easy to understand and implement. The underlying idea is straightforward – predictions are based on the majority class or average of the k-nearest neighbors.
- No Training Phase: Nearest Neighbors doesn’t require a training phase, which can be beneficial in scenarios where the data is constantly changing or when labeled data is limited.
- Non-parametric: NN is a non-parametric algorithm, meaning it doesn’t make any assumptions about the underlying data distribution. This makes it more flexible and applicable to various types of data.
- Adaptability to Data: NN can handle complex decision boundaries and is effective when dealing with non-linear relationships in the data.
- Useful for Anomaly Detection: Nearest Neighbors can be used for anomaly detection since outliers are likely to have fewer neighbors and, therefore, stand out in the data.
- No Model Training Overhead: Since there is no model training phase, the computational overhead is shifted to the prediction phase, making it quicker to implement.
Cons:
- Computational Complexity: Predicting a new instance’s class involves finding the nearest neighbors, which can be computationally expensive, especially with large datasets.
- Sensitive to Noise and Outliers: NN can be sensitive to noise and outliers in the data, as they can significantly influence the nearest neighbors’ selection.
- Curse of Dimensionality: As the number of features (dimensions) increases, the distance between points becomes less meaningful, which can negatively impact the performance of NN. This is known as the “curse of dimensionality.”
- Memory Usage: The algorithm needs to store the entire training dataset in memory, making it memory-intensive, especially for large datasets.
- Choosing Optimal ‘k’: The performance of NN is highly dependent on the choice of the parameter ‘k’ (number of neighbors). Selecting an optimal ‘k’ value is a subjective task and might require experimentation.
- Imbalanced Data: NN can struggle with imbalanced datasets, where one class significantly outnumbers the others, as it tends to predict the majority class more often.
Aspect | Nearest Neighbors (NN) | Decision Trees | Support Vector Machines (SVM) |
---|---|---|---|
Complexity | Simple | Medium | Medium to High |
Interpretability | Low | High | Low |
Non-linearity Handling | Good | Good | Excellent |
Computational Complexity | High | Medium | High |
Handling Outliers | Sensitive | Less sensitive | Less sensitive |
Imbalanced Data | Challenging | Biased toward majority class | May require parameter tuning |
Good Points |
|
|
|
Bad Points |
|
|
|
When to Use |
|
|
|
Tips for Improving the Performance of Nearest Neighbors Models:
Feature Selection and Preprocessing:
- Feature Scaling:
- Standardize or normalize features to ensure that all variables contribute equally to the distance calculation.
- Dimensionality Reduction:
- Use techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the dataset, especially if it’s high-dimensional.
- Remove Irrelevant Features:
- Identify and remove features that do not contribute significantly to the classification or regression task.
Model Tuning:
- Optimal ‘k’:
- Experiment with different values of ‘k’ to find the optimal number of neighbors for your specific dataset. Perform cross-validation to evaluate the model’s performance with different ‘k’ values.
- Distance Metric:
- Try different distance metrics (e.g., Euclidean, Manhattan, Minkowski) based on the characteristics of your data. Some metrics may be more suitable for certain types of features.
- Use Weighted Voting:
- Consider using weighted voting, where closer neighbors have a higher influence on the prediction.
Handling Imbalanced Data:
- Resampling Techniques:
- Apply resampling techniques like under-sampling or over-sampling to balance the class distribution.
- Adjust Class Weights:
- If the imbalance is severe, adjust class weights to penalize misclassifications of the minority class more.
Troubleshooting and Common Issues:
- Data Quality Check:
- Ensure that your dataset is clean and there are no missing or incorrect values. Nearest Neighbors can be sensitive to outliers and noisy data.
- Curse of Dimensionality:
- If you experience poor performance with high-dimensional data, consider dimensionality reduction techniques to mitigate the curse of dimensionality.
- Optimal Distance Metric:
- Experiment with different distance metrics, as the default may not be suitable for your specific data distribution.
- Memory Constraints:
- Be mindful of the algorithm’s memory requirements, especially with large datasets. Consider using approximate nearest neighbors methods for efficiency.
- Grid Search:
- Use grid search for hyperparameter tuning, exploring different combinations of parameters to find the best model configuration.
- Visualization:
- Visualize the decision boundaries and distribution of classes to gain insights into the behavior of the algorithm.
- Check Class Distribution:
- Ensure that each class has a sufficient number of samples for the algorithm to make meaningful predictions.
- Feature Importance Analysis:
- Conduct feature importance analysis to understand which features contribute most to the model’s predictions.
- Cross-Validation:
- Perform cross-validation to assess the model’s generalization performance and identify potential overfitting issues.
- Optimize Computational Efficiency:
- For large datasets, consider using approximate nearest neighbors methods or other algorithms better suited for scalability.
- Evaluate Results:
- Carefully evaluate performance metrics, such as precision, recall, F1 score, and ROC-AUC, to understand the strengths and weaknesses of the model.
Recommended Readings And Sources:
- https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/02_knn_notes.pdf ↩︎
- https://www.inf.ed.ac.uk/teaching/courses/inf2b/lectureSchedule.html ↩︎
Sources:
- K-nearest neighbor
- Nearest Neighbors
- https://www.stat.cmu.edu/~cshalizi/dm/22/lectures/11/lecture-11.pdf
- http://ciml.info/dl/v0_99/ciml-v0_99-ch03.pdf
- Explaining the Success of Nearest Neighbor Methods in Prediction by George H. Chen, and Devavrat Shah