In the previous post, we learned about SVD and how to use SVD for low-rank approximation. Building upon the concepts of SVD now let’s learn about principal component analysis and how to use this advanced tool for machine learning problems. Imagine you are working with a dataset with many features or a higher-dimensional dataset. The first problem you will face is figuring out what these features say about the data and how to find the important features.
Additionally, It becomes really challenging to visualize such features since they exist in a higher dimension. That’s where the PCA comes in to help us. One thing you should note here since we are working with features only, not the target this makes PCA an unsupervised learning algorithm. Here are some definitions that will help you understand what it is. But before we start let’s see what you need to know before you learn about PCA.
- Concept of Principal Component Analysis (PCA)
- How to perform PCA on datasets
- Dimensionality Reduction and Feature Extraction
- PCA Implementation Using Scikit-Learn
Table of Contents
What is Principal Component Analysis?
Here are some definitions that will help you get an idea of what PCA is:
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.
– Wikipedia1
Principal component analysis (PCA) is a multivariate technique that analyzes a data table in which observations are described by several inter-correlated quantitative dependent variables. Its goal is to extract the important information from the table,
to represent it as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps.
– Herve Abdi ´and Lynne J. Williams2
Principal components analysis (PCA) is one of a family of techniques for taking high-dimensional data and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information. PCA is one of the simplest and most robust ways of doing such dimensionality reduction. It is also one of the oldest and has been rediscovered many times in many fields, so it is also known as the Karhunen-Loève transformation, the Hotelling transformation, the method of empirical orthogonal functions, and singular value decomposition.
– Cosma Shalizi3
Principal component analysis (PCA) has been called one of the most valuable results from applied linear algebra. PCA is used abundantly in all forms of analysis – from neuroscience to computer graphics – because it is a simple, non-parametric method of extracting relevant information from confusing data sets. With minimal additional effort, PCA provides a roadmap for how to reduce a complex data set to a lower dimension to reveal the sometimes hidden, simplified structure that often underlies it.
– Jonathon Shlens4
I hope the above definitions gave you an idea of what PCA is all about. Now let’s move on to learning more about PCA starting with some basic concepts and then moving on to the complex ones and finally applying PCA to our datasets.
Goals Of PCA:
- Extract the most important information from the data table;
- Compress the size of the data set by keeping only important information
- Simplify the description of the data set; and
- Analyze the structure of the observations and the variables.
The Concept Of Covariance Matrix (Σ)
To understand the covariance matrix you need to know these basic terms:
Variance: variance measures the spread or dispersion of a set of values in a dataset. For a one-dimensional dataset, the variance is calculated as the average of the squared differences between each data point and the mean of the dataset.
Var(X) = Σ (Xᵢ - X̄)² / n
Code language: PHP (php)
Covariance: Covariance is a measure of how much two random variables change together. The sign of the covariance shows the tendency of the linear relationship between variables.
Cov(X, Y) = Σ (Xᵢ - X̄)(Yᵢ - Ȳ) / n
Also, you need to note one most important thing that the covariance of one variable with itself is nothing but its variance. You can try putting it in the equation, you will get the formula of variance.
cov(x, x) = var(x)
Code language: JavaScript (javascript)
Additionally, you need to note this as well:
cov(x, y) = cov(y, x)
Correlation: Correlation on the other hand is also a measure of how two variables change together, but it’s more standardized and provides information about the strength and direction of the relationship. The correlation coefficient always ranges between -1 and 1. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. Just know that they only talk about linear relationships, nothing else.
correlation (X, Y) = covariance (X, Y) / (std_dev (X) * std_dev (Y))
Difference Between Covariance And Correlation:
You need to understand the difference between these two terms before we move further:
Interpretation:
- Covariance: Covariance doesn’t have a clear interpretation of strength or direction. A positive covariance indicates a positive relationship, while a negative covariance indicates a negative relationship. However, the magnitude is not standardized, making it challenging to compare the strengths of relationships between different pairs of variables.
- Correlation: The correlation coefficient provides a standardized measure of both the strength and direction of a linear relationship. A correlation of 0 means no linear relationship, 1 indicates a perfect positive linear relationship, and -1 indicates a perfect negative linear relationship.
Units:
- Covariance: The units of covariance are the product of the units of the two variables. This makes it difficult to compare covariances between variables measured in different units.
- Correlation: Being a dimensionless quantity, correlation is not affected by the units of measurement of the variables. This allows for more straightforward comparisons between different pairs of variables.
Range:
- Covariance: The range of covariance is unbounded.
- Correlation: The range of correlation is always between -1 and 1.
Independence:
- Covariance: Covariance can be influenced by the scale of the variables. Therefore, it may not be a suitable measure for comparing the strength of relationships between different pairs of variables.
- Correlation: Correlation is not influenced by the scale of the variables, making it a more robust measure for comparing the strength and direction of relationships between different pairs of variables.
Correlation is often preferred in practice due to its normalization and ease of interpretation.
Now that we have an idea of what these terms are let’s build our intuition for higher dimensional datasets so that we can understand the covariance matrix:
Imagine a vector v1 of size n * 1 ( means it has n rows and 1 column) whose elements are given by v11 v21 v31 v41 … vn1. Similarly, you can visualize p such vectors and their elements. Now, let’s consider a dataset (here p =3, n = 4) and see how they fit:
v1 v2 v3 0 1.764052 0.400157 0.978738 1 2.240893 1.867558 -0.977278 2 0.950088 -0.151357 -0.103219 3 0.410598 0.144044 1.454274
The columns (or features) of a dataset can be imagined as vectors ( v1, v2 and v3) and their elements (v11 v21 v31 v41, v12 v22 v32 v42, v13 v23 v33 v43) as rows for each column. When we are dealing with such a high-dimensional dataset we need a more simplified way to represent their variance and covariance, that’s where the concepts of covariance matrix appear. If we want to calculate the covariance of the above dataset we can perform the calculation for each element and represent them using a sample covariance matrix. For above dataset,
v1 V2 v3 v1 1.571125 0.788607 -0.609854 v2 0.788607 1.434861 -0.242534 v3 -0.609854 -0.242534 1.160703
Notice that in the above covariance matrix:
- The diagonal elements (e.g., 1.571125, 1.434861, 1.160703) represent the variances of individual variables.
- The off-diagonal elements (e.g., 0.788607, -0.609854, etc.) represent the covariances between pairs of variables.
The diagonal elements will always give you the variances of individual variables because they are nothing but covariance between the same elements. Now, if we have two variables entirely uncorrelated, in that case, we will have a covariance matrix with non-diagonal elements as 0 and diagonal elements will be their variances. This is what it will look like, you can extend it for the p variables and n rows:
Symbolic Covariance Matrix for Uncorrelated Variables:
σ1² 0 0
0 σ2² 0
0 0 σ3²
Now that we have the basic intuition of a higher dimensional dataset and covariance matrix. Let’s define it formally:
Definition of Covariance Matrix
The covariance matrix or variance-covariance Matrix is a symmetric and positive semi-definite matrix that summarizes the variances and covariances between different variables in a multivariate dataset. It is often denoted by Σ and has dimensions p × p, where p is the number of variables.
For a set of n observations on p variables, the elements of the covariance matrix Σ are calculated as follows:
- Diagonal Elements (i = j):
- Σii: Represents the variance of the ith variable.
- Off-diagonal Elements (i ≠ j ):
- Σij: Represents the covariance between the ith and jth variables.
Here:
- represents the variance of the th variable.
- for represents the covariance between the th and th variables.
Properties of Covariance Matrix:
Symmetry:
- Property: Σ = ΣT (The matrix is symmetric).
- Reason: The covariance between variables Xi and Xj is the same as the covariance between Xj and Xi . This symmetry arises from the definition of covariance as the average product of deviations from the means.
Diagonal Elements:
- Property: Diagonal elements represent variances.
- Reason: The diagonal elements represent the covariance of a variable with itself, which is equivalent to its variance. Covariance with itself captures the spread or variability of each variable.
Non-Negativity of Off-Diagonal Elements:
- Property: Off-diagonal elements are non-negative means they are greater than or equal to zero, never negative.
Invertibility (Under Certain Conditions):
- Property: Σ is invertible (non-singular) if variables are linearly independent.
- Reason: Invertibility ensures that there is no perfect linear relationship between variables. If the variables are linearly dependent, the matrix becomes singular and non-invertible.
Normalization:
- Property: The correlation matrix (ρ) is obtained by normalizing the covariance matrix.
- Reason: Normalizing by the standard deviations converts covariances to correlations, making the values unitless and facilitating comparison across different scales.
Orthogonality Of Eigenvectors:
- For a symmetric matrix (including a symmetric covariance matrix), the eigenvectors corresponding to distinct eigenvalues are orthogonal. We will use this property in PCA!!
Positive Semidefinite Property:
- The covariance matrix Σ is positive semidefinite, denoted as Σ ⪰ 0, which means that for any non-zero vector v, the expression vTΣv is nonnegative.
- The positive semidefinite property ensures that the covariance matrix is a non-negative, symmetric matrix. Each entry in the covariance matrix contributes positively to the overall measure of variability.
- This means it guarantees that the covariances and variances in the matrix are consistent with the inherent variability and relationships among the variables.
- In the context of statistical analysis and machine learning, this property contributes to the stability of algorithms involving covariance matrices such as PCA that we are about to learn.
- All eigenvalues of a positive semidefinite matrix are nonnegative, providing insights into the variability along different dimensions.
Note: If you are a beginner you may ask but the elements (covariances) of the covariance matrix can be negative too so what does it mean by positive semidefinite? Just know that this property doesn’t imply that the covariances of the covariance matrix are nonnegative or can’t be negative; rather, it refers to its mathematical characteristics. A matrix Σ is positive semidefinite if and only if all its eigenvalues are nonnegative. This is equivalent to saying that, when the matrix is used in a quadratic form, the result is nonnegative for any non-zero vector v. The elements represent covariances and variances. Covariances can indeed be negative, indicating a negative linear relationship between variables. Variances (diagonal elements) are nonnegative. The term “semidefinite” is used because positive semidefinite matrices allow for the possibility of having zero eigenvalues. If all eigenvalues are strictly greater than zero, the matrix is called positive definite. If some or all eigenvalues are zero, it is positive semidefinite. So, when we say a covariance matrix is positive semidefinite, we mean that its mathematical properties guarantee certain stability and well-behavedness in statistical and machine learning contexts, and it doesn’t imply that all elements are nonnegative.
I hope this was clear to you. Now, let’s see what is a correlation matrix. We use the correlation matrix in PCA as well.
Difference Between Normalization And Standardization
- Normalization is the process of scaling individual samples to have a unit norm. This is often done using the L2 norm, also known as the Euclidean norm. To calculate – simply divide the vector by its L2 norm.
- Standardization, also known as Z-score normalization, involves scaling the features so that they have a mean of 0 and a standard deviation of 1. To calculate – first subtract the mean of the vector from the vector and then divide the output by the standard deviation of the vector.
Correlation Matrix (ρ):
A correlation matrix is nothing but the standardized version of the covariance matrix that provides a measure of the linear relationship between variables while accounting for differences in their scales.
Properties Of (ρ):
- Symmetry:
- The correlation matrix is symmetric, similar to the covariance matrix.
- Diagonal Elements:
- The diagonal elements are always equal to 1, as a variable has a perfect correlation with itself.
- Range of Values:
- Correlation coefficients lie in the range[−1,1], just like the covariance matrix.
- Unit Diagonal Matrix:
- The correlation matrix has ones on the diagonal and correlation coefficients off the diagonal, making it a unit diagonal matrix.
- Positive Semidefinite: The correlation matrix is positive semidefinite.
- Eigenvalues: All eigenvalues of a correlation matrix are non-negative.
With these basic concepts in mind, we can now proceed further to understand principal component analysis. Please make sure to understand the above concepts clearly before moving ahead. You can ask your questions in the forum if any.
Key Terms 💡
- Semidefinite Matrix:
- A symmetric n×n real matrix M for which zTMz ≥ 0 for all non-zero vectors z with real entries.
- Positive Semidefinite Matrix:
- A semidefinite matrix with non-negative eigenvalues.
- Negative Semidefinite Matrix:
- A semidefinite matrix with non-positive eigenvalues.
- Invertible Matrix:
- A square matrix A for which there exists another matrix B such that AB = BA = I, where I is the identity matrix.
- Singular Matrix:
- A square matrix that is not invertible, i.e., its determinant is zero.
- Positive Definite Matrix:
- A symmetric n×n real matrix M for which zTMz > 0 for all non-zero vectors z with real entries.
- Negative Definite Matrix:
- A symmetric n×n real matrix M for which zTMz < 0 for all non-zero vectors z with real entries.
- Eigenvalues:
- Scalars λ for which the equation Av = λv has a non-zero solution v, where A is a square matrix.
- Eigenvectors:
- Non-zero vectors v that satisfy the equation Av = λv, where A is a square matrix and λ is an eigenvalue.
- Characteristic Equation:
- The equation det(A – λI) = 0, used to find the eigenvalues of a matrix A.
- Diagonalization:
- The process of finding a diagonal matrix similar to a given matrix.
- Linearly Independent:
- A set of vectors {v1, …, vn} is linearly independent if the equation c1v1 + … + cnvn = 0 is only satisfied when all ci = 0.
- Linearly Dependent:
- A set of vectors that is not linearly independent.
- Basis:
- A linearly independent set of vectors that spans a vector space.
- Span:
- The set of all linear combinations of a set of vectors.
- Standardization:
- A preprocessing technique that transforms features to have a mean (μ) of 0 and a standard deviation (σ) of 1, using the formula: z = (x – μ) / σ.
- Z-score Normalization:
- Another term for standardization, referring to the resulting standard normal distribution.
- Normalization:
- A preprocessing technique that scales features to a fixed range, typically [0, 1], using the formula: x’ = (x – min(x)) / (max(x) – min(x)).
Principal Component Analysis: Derivation And Concepts
We use the principal component analysis technique to primarily work with higher dimensional datasets. We project the data onto a lower dimension by preserving as much information as possible. We will lose some information in the process and this difference is called reconstruction error which means if we wish to do this we need to minimize the amount of information we are losing or the reconstruction error. In other words, we need to maximize the variance of the data. So, here we have two ways to approach this problem:
- Minimize the reconstruction error.
- Maximize the variance.
Notice that we don’t need to prove both the points above separately because both are the same thing. Additionally, we can also derive PCA using other methods that I will talk about shortly but let’s see an example image:
Notice how 3D points are projected to 2D. The first and second principal components have the maximum information or variation. Any matrix A can be expressed as a linear combination of its standard basis vectors in this way: A = a11⋅I1+a12⋅I2+a21⋅I3+a22⋅I4. The main goal behind PCA is to find another basis, which is a linear combination of the original basis that will best represent our dataset. This is what we are set to find out.
Let X be an m × n matrix representing our dataset and P be a matrix that transforms X into Y (Y = PX). The rows of P, {p1, . . . , pm}, are a set of new basis vectors for expressing the columns of X. Here the rows of P are a new set of basis vectors for representing columns of X because this equation is nothing but projection transformation. The row vectors {p1, . . . , pm} in this transformation will become the principal components of X.
Now, once we have transformed the dataset to the new basis, what we want here is to maximize the variation captured. PCA assumes that all basis vectors {p1, . . . , pm} are
orthonormal and the direction in which variation is maximized is the most important. So, here is how it goes:
- Choose a unit vector in an m-dimensional space that maximizes the variance in X. Save this vector as p1.
- Find another direction in which variance is maximized but restrict the search to all directions perpendicular to all previously selected directions.
- and keep going likewise.
This is the basic algorithm we use for PCA. Principal components are the key vectors obtained through Principal Component Analysis. Each principal component is a linear combination of the original variables, capturing the maximum variance in the data. The first principal component (PC1) captures the maximum variance present in the data.
Subsequent components (PC2, PC3, etc.) capture decreasing amounts of variance, ensuring an ordered representation of variability. The sum of the variances of all principal components equals the total variance in the data. This conservation property ensures that no information is lost during the transformation.
Assumptions Of PCA for calculation Simplicity:
PCA assumes the following things for the sake of simplicity and to utilize the linear algebra tricks best:
- Variance and mean sufficient statistics – only hold for Gaussian distribution.
- Linearity – can be extended to non-linearity using kernel PCA
- Large variances have important dynamics.
- The principal components are orthogonal.
Why Orthogonality Is Preferred Between All The Principal Components?
There are various reasons for choosing orthogonality:
- Variation captured by one principal component is not duplicated by others.
- Independence simplifies the interpretation of each principal component.
- In the absence of orthogonality, the variance captured by one component would be redundant or correlated with the variance captured by another component.
- We will see that PCA involves the computation of the covariance matrix of the standardized data. Diagonalizing the covariance matrix (making it diagonal) results in orthogonal eigenvectors, which become the principal components. This process ensures that the principal components are uncorrelated and orthogonal.
- Orthogonal basis vectors simplify mathematical computations, such as the eigenvalue decomposition of the covariance matrix.
- Orthogonality helps mitigate multicollinearity issues that can arise when variables are highly correlated. In the principal component space, each component is uncorrelated, reducing the risk of multicollinearity.
- Orthogonal matrices are computationally efficient. Eigen-decomposition of orthogonal matrices involves simpler algebraic operations compared to non-orthogonal matrices.
Derivation Of PCA
In this section, we will derive PCA and show you how you can calculate the principal components. One thing you should keep in mind is we first centre the data before doing any calculation and if the scales of data are too different then we can even standardize the dataset as well.
Why Do We Centre The Data?
- Centering the data involves subtracting the mean of each variable from the corresponding observations. This centres the data around the origin. By centring the data, we ensure that the origin (0, 0, …, 0) is the mean of the data. This is important because PCA is sensitive to the location of the data points.
- If the data is not centred, the origin may not coincide with the mean of the data. In such cases, the principal components might be influenced more by the location of the data points than by their dispersion.
- The covariance between two variables involves the product of deviations from their means. If the data is not centred, these deviations include the overall location of the data, affecting the covariance.
- Centring removes translation effects, allowing the principal components to capture the intrinsic variability and relationships among variables, rather than being influenced by the overall position of the data.
- Centring ensures that the mean of each variable is zero. If the data is not centred, the means of variables can introduce biases in the covariance matrix, affecting the orthogonality of the principal components.
- Centering allows for meaningful comparisons across datasets. If datasets have different means, their covariance structures might differ, making it challenging to compare principal components directly.
PCA Derivation Using Covariance Matrix Or By Diagonalizing The Covariance Matrix
Our goal is to find a new basis by maximizing the variance. We already know that the covariance matrix contains information about the direction of maximal variance. So, we will make a non-diagonal covariance matrix diagonal by rotating the coordinate system which is done using diagonalization.
The eigenvectors of the covariance matrix align with the directions of both maximum and minimum variance. The corresponding eigenvalues represent the variances along these directions. Optimal linear dimensionality reduction is achieved by projecting the data onto the eigenvectors with the largest eigenvalues.
Here is the detailed derivation using the covariance matrix. Please read and understand carefully and ask your queries in the forums.
PCA Derivation Using Reconstruction or Projection Error:
We can also derive PCA in another way. When we project our data our goal here is to minimize the projection error. Here is the intuition for this:
Here is the derivation:
PCA Vs Regression:
the projection direction is always vertical; whereas in PCA, the projection direction is orthogonal to the projection hyperplane. In both methods, however, we minimize the sum of the squares of the projection distances.7
PCA Derivation Using SVD And Why Principal Components Analysis (PCA) Is A Special Case Of The SVD
Here is why:
If you want to know more about SVD, check out my previous post on SVD.
Things To Know:
- The total variance of the data is the sum of the eigenvalues of its covariance matrix.
- The Ith principal component accounts for a proportion of variation, in the original data.
- You can also perform spectral decomposition on the covariance matrix to know the contribution of respective PCs.
- There is no one way to choose the important PCs but researchers usually suggest discarding less than one and selecting only those PCs that can account for a large proportion of the total variance. This is the serious drawback of this method. We will see how we can select important PCs in practice shortly.
- The modern tendency is to view Principal Component Analysis as a mathematical technique with no underlying statistical mode.
- Principal components are artificial variables and often it is not possible to assign physical meaning to them.
- If the original variables are uncorrelated, then there is no point in carrying out the Principal Components Analysis.
- Principal components depend on the scale of measurement. A conventional way of getting rid of this problem is to use standardized variables with unit variances.
Scaling Problem In PCA:
Principal components are generally changed by scaling therefore, not a unique characteristic of the data. If one variable has way more variability than others, it will dominate the first principal component in the covariance matrix, no matter the correlation structure. But, if we scale all variables to have the same variance, the first principal component becomes quite different. Because of these issues, PCA is often seen as less important unless variables have roughly similar variances, like when they are percentages or measured in the same units.
To tackle scaling problems, a common approach is to look at the correlation matrix instead of the covariance matrix. Scaling in Principal Component Analysis (PCA) can still be a bit arbitrary and depends on the data. It doesn’t completely solve the scaling issue but helps avoid it.
If the variables aren’t considered equally important, it’s not recommended to use the correlation matrix for analysis. Using the correlation matrix makes it harder to compare PCA results between different samples. When you transform variables back to their original form, the principal components of the correlation matrix won’t be perpendicular. This happens because changing two lines at right angles in Euclidean space doesn’t always result in two new perpendicular lines. This is part of why the scaling problem shows up in the first place.
PCA Applications: What We Can Do With It?
There are so many things we can do with PCA. Here are a few:
- One can then look for outliers or groups or ‘clusters’ of individuals. This is an important use of Principal Component Analysis and often groupings of variables which would not be found by other means.
- When you have more than three variables, making plots gets tricky. However, if the first two components grab a big chunk of the overall variation, it’s handy to plot the scores of these two components for each individual. This is called a Biplot. A Biplot is like a scatter plot showing both the row and column factors of two-way data on the same graph.
- Multiple regression can be dangerous if the so-called independent variables are highly correlated. In this case, we regress on principal components which is known as principal components regression.
- Cutting down the number of things we’re looking at (dimensionality reduction) can be super useful in discriminant analysis, especially when we have a bunch of related things we’re measuring (p variables) but not many instances of them (n observations, where n is less than p). Having fewer observations than variables can cause a problem, but if we shrink the number of things we’re measuring a lot (thanks to Principal Component Analysis), it can fix that issue.
Practical Applications:
PCA is a useful tool for handling complex data across different fields. For image compression, it helps shrink image files while keeping important details, which makes them easier to store and share. In face recognition, PCA picks out the key facial features that make each face unique, speeding up and improving the accuracy of identifying people.
In bioinformatics, PCA simplifies data from gene or protein studies by highlighting important patterns, helping researchers understand biological processes better. For speech recognition, PCA makes it easier for systems to process and understand spoken words by focusing on the most relevant features of the speech signal.
In financial analysis, PCA helps to make sense of complex market data by identifying the main factors that drive trends, which makes financial models more effective. In chemistry, PCA is used to analyze complicated data from experiments, making it easier to understand and classify different substances.
In remote sensing, PCA reduces the amount of data needed to interpret satellite or aerial images, helping to find important patterns in the data more quickly. In manufacturing, PCA helps to monitor and improve production processes by identifying key factors that affect product quality, making it easier to spot and fix problems.
In brain imaging, PCA helps to simplify data from studies on brain activity, making it easier to understand how the brain works. Finally, in marketing, PCA helps businesses understand customer behavior by focusing on the most important factors, which helps in targeting marketing efforts and segmenting customers effectively.
and the list goes on!
PCA Implementation Using SkLearn And Numpy
I have added the codes here. You will have to type these codes in your notebook to see the results and read my explanation accordingly. I would recommend writing the codes but if you want to see the outputs, you can visit my notebook below as well.
Example 1 – Basic Calculation Of PCA using Numpy
# create a dataset
import numpy as np
X = np.array([[1, 3, 5, 7, 9, 13, 20, 20, 21, 24, 26],
[5, 7, 11, 14, 15, 17, 18, 19, 21, 22, 26]])
X.shape
Code language: PHP (php)
# let's take the transpose to make our dataset
X = X.T
Code language: PHP (php)
# Goal - Apply PCA and reduce the dataset from 2-D to 1-D
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1]);
Code language: CSS (css)
# step - 1 -- Normalize the data
X_normalized = X - np.mean(X, axis = 0)
print(X_normalized)
Code language: PHP (php)
# plot the normalized data
plt.scatter(X_normalized[:, 0], X_normalized[:, 1])
Code language: CSS (css)
# let's plot both the datasets - Notice the difference by checking the cordinates.
plt.scatter(X[:, 0], X[:, 1]);
plt.scatter(X_normalized[:, 0], X_normalized[:, 1]);
Code language: PHP (php)
# Step - 2 Calculate the covariance matrix
C = np.cov(X_normalized, rowvar = False)
C
Code language: PHP (php)
# Step - 3 - Calculate the eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(C)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:")
print(eigenvectors)
Code language: PHP (php)
# Step - 4 - Sort the eigenvalues in desc. order
sorted_index = np.argsort(eigenvalues)[::-1]
sorted_eigenvalues = eigenvalues[sorted_index]
sorted_eigenvalues
Code language: PHP (php)
# let's check the variance explained by eigenvalues
total_variance = np.sum(sorted_eigenvalues, axis = 0)
# it should come out 80.87272727 + 40.69090909 = 121.56363636363636
total_variance
Code language: PHP (php)
# percentage of variance explained by each eigenvalues
first = sorted_eigenvalues[:1]/total_variance
second = sorted_eigenvalues[1:]/total_variance
print(first)
print(second)
Code language: PHP (php)
As we can see our first or the largest eigenvalue is capturing 98% of the variance in the data and the second one is capturing less than 1. We usually avoid values less than 1 in practice when we are working with larger datasets
sorted_eigenvectors = eigenvectors[:, sorted_index]
sorted_eigenvectors
# Step - 5 select the eigenvectors
n = 1 # since we want to go from 2-D to 1- D
eigenvectors_subset = sorted_eigenvectors[:,0:n]
eigenvectors_subset
Code language: PHP (php)
# Step - 6 Transform data or project the data onto a subspace give by eigenvectors_subset
X_reduced = np.dot(eigenvectors_subset.transpose(), X_normalized.transpose()).transpose()
X_reduced
Code language: PHP (php)
The above output is the one-dimensional representation of the original data. We call the outputs here factor scores along each component. Each row corresponds to an observation, and the column contains the factor score for the first principal component.
Interpretation: The factor scores indicate how much each observation contributes to the first principal component.
Positive scores suggest that the observation has a positive influence on the first principal component.
Negative scores suggest a negative influence.
Data Reconstruction: You can reconstruct the data using the factor scores and the eigenvectors. The reconstructed data in the original feature space can be obtained by multiplying the factor scores by the transpose of the eigenvectors and adding back the mean of the original data.
Data Visualization:
- Plot the factor scores to visualize the distribution of data along the first principal component. This can help you identify patterns or groupings in your data.
- Clusters or groups of observations with similar patterns along the first principal component may suggest subpopulations or distinct patterns in your data. You can explore whether these clusters correspond to known categories or characteristics.
- Observations with extreme factor scores (either very high or very low) might be considered outliers. Investigate these cases to understand if they represent unusual patterns or if there are data issues.
# The factor scores we obtained
factor_scores = X_reduced
# Additional analysis or visualization can be performed based on the factor scores.
# For example, you can plot the factor scores to visualize the distribution.
import matplotlib.pyplot as plt
plt.plot(factor_scores, 'o-')
plt.xlabel('Observations')
plt.ylabel('Factor Score (PC1)')
plt.title('Distribution of Observations along PC1')
plt.show()
Code language: PHP (php)
PCA Using Sklearn
# let's use sklearn to calculate PCA
from sklearn.decomposition import PCA
pca = PCA(n_components =1) # we are going from 2D to 1D and choosing 1 PC
pca.fit(X_normalized) # calculation
Xr = pca.transform(X_normalized) #projection
print(Xr)
Code language: PHP (php)
As you can see the results are the same
# here is our covariance matrix same as before
pca.get_covariance()
Code language: CSS (css)
pca<strong>.</strong>get_feature_names_out()
Code language: Python (python)
# this will transform the reduced data back to the original feature space
pca.inverse_transform(Xr)
Code language: PHP (php)
Example 2: PCA For Noise Filtering or Data Compression
We will use the digits dataset for this example and show you how you can use PCA for noise filtering and also how you can choose the number of PCs
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns; sns.set()
digits = load_digits()
digits.data.shape
Code language: JavaScript (javascript)
Let’s first the PCA without selecting n_components. This will include all PCs. We can plot this to know how much variation is being explained by how many PCs
How to choose the n_components
pca = PCA().fit(digits.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Explained variance - commulative');
Code language: JavaScript (javascript)
From the graph above we can see that we need at least 20 or above components to describe the 90-95% variance in the data. Reducing the dataset to 3D or 2D may lose a lot of variance. This is how you can choose how many components to include in n_components.
Another way of doing PCA in sklearn to include maximum variance is by describing n_components according to percentage you can input values between 0 to 1 in n_components for example n_components = 0.95 to describe 95% of the variance.
Next, we are going to include some noise in our data and then use PCA to filter out those noise. This is just to show you how you can use PCA to filter out noise from noisy data.
# create a function to plot our datset
def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest',
clim=(0, 16))
plot_digits(digits.data)
Code language: PHP (php)
# let's introduce some noise
np.random.seed(42)
noisy_data = np.random.normal(digits.data, 4)
plot_digits(noisy_data)
Code language: PHP (php)
pca = PCA().fit(noisy_data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Explained variance - commulative');
Code language: JavaScript (javascript)
# let's see how many components preserve the 60% of variance
pca = PCA(0.60).fit(noisy_data)
pca.n_components_
Code language: PHP (php)
# let's apply and transform it back to recover the filtered data
components = pca.transform(noisy_data)
filtered_data = pca.inverse_transform(components)
plot_digits(filtered_data)
Code language: PHP (php)
As we can see that it is a bit better than noisy data we can further apply PCA to make things a lot clear. As you can see the transformed data is not equivalent to the original data. This is what is called the reconstruction error.
pca = PCA(0.45).fit(noisy_data)
components = pca.transform(noisy_data)
filtered_data = pca.inverse_transform(components)
plot_digits(filtered_data)
Well, it’s better than noise but the idea here is to learn how to perform PCA to reduce noise in any data. This is how you can do it. One thing you should note here you can also use PCA to compress data as well by using the amount of variance you want to preserve. Alternatively, this means you can use PCA for dimensionality reduction while using higher dimension dataset
Example 3 – Dimensionality Reduction or Data Compression
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
Code language: JavaScript (javascript)
faces.data.shape
Code language: CSS (css)
# When the dataset is too big we use randomized PCA in sklearn
rnd_pca = PCA(n_components = 150, svd_solver="randomized").fit(faces.data)
data_reduced = rnd_pca.fit_transform(faces.data)
Code language: PHP (php)
plt.plot(np.cumsum(rnd_pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
Code language: JavaScript (javascript)
# faces after PCA
fig, axes = plt.subplots(3, 8, figsize=(9, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(rnd_pca.components_[i].reshape(62, 47), cmap='bone')
Code language: PHP (php)
projected_data = rnd_pca.inverse_transform(data_reduced)
projected_data.shape
Code language: CSS (css)
# Plot the results
fig, ax = plt.subplots(2, 10, figsize=(10, 2.5),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i in range(10):
ax[0, i].imshow(faces.data[i].reshape(62, 47), cmap='binary_r')
ax[1, i].imshow(projected_data[i].reshape(62, 47), cmap='binary_r')
ax[0, 0].set_ylabel('full-dim\ninput')
ax[1, 0].set_ylabel('150-dim\nreconstruction');
Code language: PHP (php)
As we can see above we have reduced the dimensions of our data down to 150 and in a way we can compress it down. This is how we can reduce the dimensionality and feed the data_reduced to the further algorithm we are trying to use for other cases.
What this means is in our pipeline we can first perform PCA which will allow us to preserve as much variance as we want or the essential information in our data and then we can use the reduced dimsn into another algorithm to perform our task efficiently without having to worry about bigger data size.
Other forms of PCA in Practice:
Incremental PCA
Incremental PCA is useful when you have a large dataset that doesn’t fit into memory, and you want to perform PCA in smaller batches
import numpy as np
from sklearn.decomposition import IncrementalPCA
# Generate a small dataset for demonstration
np.random.seed(42)
X = np.random.rand(100, 5)
# Specify the batch size for Incremental PCA
batch_size = 10
# Create an Incremental PCA object
ipca = IncrementalPCA(n_components=3, batch_size=batch_size)
# Incrementally fit the model on batches of the data
for i in range(0, len(X), batch_size):
batch = X[i:i + batch_size]
ipca.partial_fit(batch)
# Transform the entire dataset using the fitted Incremental PCA model
X_transformed = ipca.transform(X)
# Inverse transform to obtain an approximation of the original data
X_approximated = ipca.inverse_transform(X_transformed)
# Print the transformed data
print("Original Data Shape:", X.shape)
print("Transformed Data Shape:", X_transformed.shape)
print("Approximated Data Shape:", X_approximated.shape)
Code language: PHP (php)
Sparse PCA
It introduces sparsity in the loadings (coefficients) of the principal components. The objective is to find a sparse representation of the data, meaning that most coefficients are zero. This can be useful when you suspect that only a small number of features are relevant.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import SparsePCA
from sklearn.datasets import make_multilabel_classification
from sklearn.preprocessing import StandardScaler
# Generate a synthetic dataset for demonstration
X, _ = make_multilabel_classification(n_samples=100, n_features=20, n_classes=2, n_labels=1, random_state=42)
# Standardize the data (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create a Sparse PCA object
spca = SparsePCA(n_components=3, alpha=0.1) # Adjust alpha for desired sparsity level
# Fit the Sparse PCA model on the scaled data
spca.fit(X_scaled)
# Transform the data using the fitted Sparse PCA model
X_transformed = spca.transform(X_scaled)
# Inverse transform to obtain an approximation of the original data
X_approximated = spca.inverse_transform(X_transformed)
# Print the transformed data
print("Original Data Shape:", X_scaled.shape)
print("Transformed Data Shape:", X_transformed.shape)
print("Approximated Data Shape:", X_approximated.shape)
Code language: PHP (php)
Truncated PCA
Truncated SVD (Singular Value Decomposition) is often used as an approximation of PCA, especially when dealing with sparse data. The term “truncated” indicates that only the top-k singular values and their corresponding singular vectors are retained, leading to a reduced-dimensional representation of the data. This can be useful for tasks like dimensionality reduction and matrix factorization
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import make_sparse_coded_signal
from sklearn.preprocessing import normalize
# Generate a small synthetic sparse dataset for demonstration
n_samples, n_features = 100, 20
# Specify the number of components and nonzero coefficients
n_components, n_nonzero_coefs = 5, 3
X, _, _ = make_sparse_coded_signal(n_samples=n_samples, n_features=n_features, n_components=n_components,
n_nonzero_coefs=n_nonzero_coefs, random_state=42)
# Normalize for Truncated SVD
X = normalize(X, axis=0)
# Specify the number of components (k) for Truncated SVD
n_components = 5
# Create a Truncated SVD object
svd = TruncatedSVD(n_components=n_components)
# Fit the Truncated SVD model on the data
X_transformed = svd.fit_transform(X)
# Inverse transform to obtain an approximation of the original data
X_approximated = svd.inverse_transform(X_transformed)
# Print the explained variance ratio for each component
print("Explained Variance Ratio:", svd.explained_variance_ratio_)
# Print the transformed data
print("Original Data Shape:", X.shape)
print("Transformed Data Shape:", X_transformed.shape)
print("Approximated Data Shape:", X_approximated.shape)
Code language: PHP (php)
Kernel PCA
Kernel PCA (Kernel Principal Component Analysis) is an extension of PCA that uses kernel methods to perform non-linear dimensionality reduction.
In standard PCA, the principal components are obtained by linearly transforming the data into a new coordinate system. Kernel PCA, on the other hand, implicitly maps the data into a higher-dimensional space using a kernel function, making it possible to capture non-linear relationships between variables. We will study this in another tutorial.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_circles
# Generate a synthetic dataset for demonstration (non-linear data)
X, _ = make_circles(n_samples=100, factor=0.5, noise=0.05, random_state=42)
# Apply Kernel PCA with the radial basis function (RBF) kernel
kpca = KernelPCA(kernel='rbf', gamma=15, n_components=2)
X_kpca = kpca.fit_transform(X)
# Plot the original and transformed data
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c='b', marker='o', edgecolors='k', s=50)
plt.title('Original Data')
plt.subplot(1, 2, 2)
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c='r', marker='o', edgecolors='k', s=50)
plt.title('Kernel PCA Transformed Data')
plt.show()
Code language: PHP (php)
Example 4: PCA on IRIS Dataset Using Plotly
This data set consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal lengths.
The rows are the samples and the columns are: Sepal Length, Sepal Width, Petal Length and Petal Width.
# import the data and plot it
import plotly.express as px
df = px.data.iris()
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]
fig = px.scatter_matrix(
df,
dimensions=features,
color="species"
)
fig.update_traces(diagonal_visible=False)
fig.show()
Code language: PHP (php)
# let's perform PCA and then plot using all PCs
from sklearn.decomposition import PCA
pca = PCA()
components = pca.fit_transform(df[features])
labels = {
str(i): f"PC {i+1} ({var:.1f}%)"
for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}
fig = px.scatter_matrix(
components,
labels=labels,
dimensions=range(4),
color=df["species"]
)
fig.update_traces(diagonal_visible=False)
fig.show()
Code language: PHP (php)
# plot using 2D
pca = PCA(n_components=2)
components = pca.fit_transform(df[features])
fig = px.scatter(components, x=0, y=1, color=df['species'])
fig.show()
Code language: PHP (php)
# using 3D
pca = PCA(n_components=3)
components = pca.fit_transform(df[features])
total_var = pca.explained_variance_ratio_.sum() * 100
fig = px.scatter_3d(
components, x=0, y=1, z=2, color=df['species'],
title=f'Total Explained Variance: {total_var:.2f}%',
labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
fig.show()
Code language: PHP (php)
# plot the explained variance of the dataset
import numpy as np
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
px.area(
x=range(1, exp_var_cumul.shape[0] + 1),
y=exp_var_cumul,
labels={"x": "# Components", "y": "Explained Variance"}
)
Code language: PHP (php)
PCA For Outlier Detection
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs
# Generate a synthetic dataset with outliers for demonstration
np.random.seed(42)
X, _ = make_blobs(n_samples=300, centers=1, cluster_std=1.0, random_state=42)
outliers = np.array([[15, -4], [14, -3], [13, -2]]) # Add outliers
X = np.vstack([X, outliers])
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Calculate reconstruction errors (squared Mahalanobis distances)
reconstruction_errors = np.sum((X - pca.inverse_transform(X_pca))**2, axis=1)
# Set a threshold for detecting outliers (adjust as needed)
threshold = np.percentile(reconstruction_errors, 95)
# Plot the data and outliers
plt.figure(figsize=(10, 6))
# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c='b', marker='o', edgecolors='k', s=50, label='Inliers')
# Highlight the outliers
outliers_mask = reconstruction_errors > threshold
plt.scatter(X[outliers_mask, 0], X[outliers_mask, 1], c='r', marker='o', edgecolors='k', s=100, label='Outliers')
# Plot the principal components (eigenvectors)
origin = pca.mean_
components = pca.components_.T * 3 # Scale for better visualization
plt.quiver(*origin, *components[:, 0], color='orange', scale=1, scale_units='xy', angles='xy', label='PC1')
plt.quiver(*origin, *components[:, 1], color='green', scale=1, scale_units='xy', angles='xy', label='PC2')
plt.title('PCA for Outlier Detection')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
Code language: PHP (php)
You can then remove the outliers from the data using the outliers_mask variable.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = load_iris()
X = iris.data
# Standardize the data before applying PCA
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)
# Calculate reconstruction errors (squared Mahalanobis distances)
reconstruction_errors = np.sum((X_standardized - pca.inverse_transform(X_pca))**2, axis=1)
# Set a threshold for detecting outliers (adjust as needed)
threshold = np.percentile(reconstruction_errors, 95)
# Plot the data and outliers
plt.figure(figsize=(12, 6))
# Plot the data points
plt.scatter(X_pca[:, 0], X_pca[:, 1], c='b', marker='o', edgecolors='k', s=50, label='Inliers')
# Highlight the outliers
outliers_mask = reconstruction_errors > threshold
plt.scatter(X_pca[outliers_mask, 0], X_pca[outliers_mask, 1], c='r', marker='o', edgecolors='k', s=100, label='Outliers')
# Plot the principal components (eigenvectors)
origin = [0, 0]
# Fix the unpacking of components into separate arguments
plt.quiver(*origin, components[0, 0], components[1, 0], color='orange', scale=1, scale_units='xy', angles='xy', label='PC1')
plt.quiver(*origin, components[0, 1], components[1, 1], color='green', scale=1, scale_units='xy', angles='xy', label='PC2')
plt.title('PCA for Outlier Detection on Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Code language: PHP (php)
If you have any questions please ask in the forum/community support.
Footnotes:
- PCA Wikipedia – Source ↩︎
- 2010 John Wiley & Sons, Inc ↩︎
- https://www.stat.cmu.edu/~cshalizi/uADA/12/ ↩︎
- A Tutorial on Principal Component Analysis, Jonathon Shlens, December 10, 2005; Version 2 ↩︎
- Jonathan Richard Shewchuk – Unsupervised Learning and Principal Components Analysis ↩︎
- Laurenz Wiskott , PCA Standford Notes ↩︎
- Same as point 5 ↩︎
Other Sources & Further Readings:
- IIT, Roorkee, ML Mathematics [https://www.iitr.ac.in/]
- IGNOU [ Link ]
- https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
- https://www.oreilly.com/library/view/python-data-science/9781491912126/