The singular value decomposition (SVD) is a crucial concept for machine learning that builds the foundation of many algorithms you will use throughout your machine learning journey. It’s not only used in ML or data science but in other fields too. At its core, SVD breaks down complex datasets into simpler, more interpretable components, helping us get some deeper insights. It also allows you to find solutions to various problems such as image compression, text mining, and multidimensional data analysis.
That’s why it’s so important to understand. For machine learning, you don’t need to learn how to implement SVD in Python because so many libraries allow you to perform SVD. Another reason why you don’t need to know its implementation is because it can be a very complex project which will deviate you from the main goal of actually learning how to use it. However, to use SVD you need to understand the basics of SVD so that you can make the best of it and understand more advanced concepts such as PCA which I will explain in another post. So, let’s get started and learn SVD.
- Concept of Singular Value Decomposition (SVD)
- How to perform SVD on matrices
- Low Rank Approximation
- SVD & Low Rank Approximation Using Numpy
Table of Contents
What is SVD?
You must have heard about the eigendecomposition which is explained in our notes on linear algebra and how we use eigendecomposition to understand various underlying features. The problem with eigendecomposition is the matrix should be square but in the real world, our data points are not in a square matrix form all the time.
So, we need a way to decompose such matrices so that we can know more about the underlying features of such matrices. So, SVD or singular value decomposition is a way to dissect any matrices further down so that we can work with such real-world matrices.
A formal definition can be: The singular value decomposition of a matrix is the factorization of A into the product of three matrices A = UΣVT where the columns of U and VT (V transpose) are orthonormal and the matrix Σ is diagonal with positive real entries. σ1,…,σr being the singular values of A satisfying σ1 ≥ σ2 ≥ ··· ≥ σr > 0.
1. Matrix U:
- U is an orthogonal matrix, meaning its columns are unit vectors (length 1) that are perpendicular to each other. This property is expressed as UTU =I, where I is the identity matrix.
- U represents an orthogonal transformation that reorients the coordinate system of the rows of A. It aligns the data with a new basis where the rows are positioned along the principal directions determined by VT.
- The columns of U are the left singular vectors. They describe the relationships between rows in the original matrix A and the new coordinate system.
2. Matrix Σ:
- Σ is a diagonal matrix, meaning all its elements outside the main diagonal are zero. The diagonal elements are the singular values of A, denoted as σ1,σ2,…,σr (where r is the rank of A).
- The singular values in Σ represent the scaling factor by which the corresponding directions in U and V are stretched or shrunk during the transformation. Larger singular values indicate more significant importance.
- Geometrically, Σ stretches or shrinks the data along the principal axes determined by VT.
3. Matrix VT:
- Similar to U, V or its transpose is an orthogonal matrix. Its columns are unit vectors and VVT=I.
- Geometrically, VT reorients the coordinate system for the columns of A, aligning them with the directions (principal axes) that capture the most variance.
- The columns of VT are the right singular vectors. These vectors provide a new orthonormal basis for the columns of A. They describe the directions along which the data is most spread out in the column space.
4. Rank of A:
- The rank of A is the number of non-zero singular values in Σ. In other words, the rank of a matrix is the number of linearly independent rows or columns. It indicates the number of dimensions or directions in which the data has non-zero variance.
What Does Singular Value Decomposition Tell Us?
We can deduce a few things about SVD based on our above definition:
- The matrices U and VT in the SVD are orthogonal. This orthogonality implies that the columns of U and VT form orthonormal bases. It also suggests that the transformation represented by U and VT preserves the length and angles between vectors.
- The matrix Σ is diagonal, and its diagonal elements are the singular values of the original matrix. These singular values are non-negative and represent the importance or magnitude of the corresponding singular vectors. The singular values are arranged in descending order. What that means is by using these singular values we can work out a way to compress data without losing much information.
- The rank of the matrix is equal to the number of non-zero singular values in Σ. If the matrix has a rank r, then only the first r singular values are non-zero, and the rest are zero.
- The singular values of a matrix A are the square roots of the eigenvalues of ATA (or AAT). This connection to eigenvalues provides a link between SVD and the spectral properties of the matrix. Here “spectral properties” means the properties related to its eigenvalues and eigenvectors.
- The SVD provides the best low-rank approximation of a matrix. The best k-rank approximation is achieved by retaining the first k singular values and their corresponding singular vectors. This property is useful in data compression and noise reduction. This means by using SVD we will be able to find a matrix that takes low storage instead of storing the big file size.
Let’s understand the above concepts using some intuition based example to make them more clear:
To appreciate the utility of SVD, it’s helpful to start by thinking about what a matrix represents. We have already talked about this in our tutorial on mathematical intuition. In many applications, a matrix is a way to store data, where the rows and columns correspond to different variables or observations. For example, in image processing, a matrix might represent pixel intensities, while in other tasks, it might represent user ratings for a set of movies. However, not all data in a matrix is equally informative; some of it might be noise, redundancy, or irrelevant details.
Let’s consider a simple, concrete example: imagine a 2D dataset where each data point is represented by a vector in a 2D space, like coordinates on a graph. The data might look like a cloud of points, with some directions (or axes) showing more spread or variation than others. Suppose we have a set of points on a 2D plane, and we arrange these points into a matrix A, where each row of A represents a point, and each column represents a dimension (in this case, just two dimensions: x and y coordinates). The matrix A might look something like this:
A = [[1 2], [3, 4], [5, 6]]
Code language: plaintext (plaintext)
Here, each row [1, 2] etc. corresponds to a point in the plane. The question is, what patterns or structures can we extract from this matrix? This is where SVD comes in. SVD is a way to decomposes a matrix A into three matrices: A = UΣVT. U is an orthogonal matrix containing the left singular vectors. Σ is a diagonal matrix containing the singular values. VT is the transpose of orthogonal matrix (V) containing the right singular vectors. But what does this mean intuitively?
In a 2D or 3D space, vectors are orthogonal if they are at right angles to each other. For example, in the 2D plane, the x-axis and y-axis are orthogonal because they meet at a 90-degree angle. Orthogonality in higher dimensions generalizes this idea: two vectors are orthogonal if their dot product is zero, meaning they are independent of each other in a geometric sense.
In the context of SVD, orthogonal matrices like U and V consist of vectors (or columns) that are mutually orthogonal. This orthogonality is crucial because it means that the singular vectors provide a set of independent directions that span the space where our data lives. These directions are like the principal axes of an ellipse that best describes the shape of our data cloud.
VT captures the directions in the original space (columns of A) that explain the most variance in the data. It provides the axes along which the data is most “spread out.” These are the principal directions or eigenvectors of the data covariance matrix. Σ is a diagonal matrix where the diagonal entries are the singular values. These values tell us how much variance or information is contained along each direction identified by VT. The larger the singular value, the more important that direction is. Smaller singular values indicate directions that contribute less to the overall structure of the data. U tells us how to combine the principal directions given by VT to reconstruct the original data. It provides the transformation to align the rows of A with these principal directions.
Now, let’s go back to our example matrix A and apply SVD to it. Suppose after applying SVD, we obtain the following matrices:
U:
- [-0.23, 0.88]
- [-0.52, 0.45]
- [-0.81, -0.12]
Σ:
- [9.51, 0]
- [0, 0.77]
V^T:
- [-0.40, -0.91]
- [-0.91, 0.40]
Code language: plaintext (plaintext)
The rows of VT represent the directions in the 2D space that capture the most variance in the data. These directions are orthogonal, meaning they are at right angles to each other, just like the principal axes of an ellipse. The values 9.51 and 0.77 in Σ represent how much “stretching” occurs along each of these directions. The larger singular value (9.51) indicates that most of the data’s variance is along the first direction given by VT. The smaller value (0.77) indicates less variance along the second direction. The columns of U are the left singular vectors, which provide an orthonormal basis for the row space of A. They represent how the rows of A are oriented in the transformed space defined by principal directions (from VT).
Essentially, When reconstructing the original matrix A, U helps in aligning the transformed data (from Σ and VT) back to the original row space, while Σ scales the data along the principal directions and VT provides the principal directions in the column space.
If you think in terms of constructing A from these three matrices rather than decomposing it, it will make more sense but be careful this is not the right approach to think!
This is what SVD is all about. We can exploit this property for other application. One such application is Low-Rank Approximation.
By keeping only the most significant singular values and vectors, we can represent our data much more compactly. This is also the basis for many image and signal compression techniques. A matrix is considered low-rank if its rank r (the number of non-zero singular values) is much smaller than the dimensions of the matrix. This means matrix might have many rows and columns, but if the rank (number of non-zero singular values) is much smaller, it means that the data can be effectively described with fewer dimensions. This indicates redundancy or correlation among the data dimensions. The idea behind low-rank approximation is to approximate a matrix A by another matrix A` with lower rank, capturing the most significant patterns while ignoring the less important details.
The low-rank approximation is achieved by truncating the SVD components. Specifically, by retaining only the top k singular values and their corresponding singular vectors, we can construct a matrix A` that is a rank-k approximation of A. Mathematically, this can be expressed as: A` =UkΣkVkT where Uk and VkT are the matrices obtained by taking only the first k columns of U and VT, respectively, and Σ is the diagonal matrix containing only the top k singular values. The resulting matrix A` is the best rank-k approximation of A in terms of minimizing the Frobenius norm of the difference between A and A`.
This process of truncating the SVD is not arbitrary. The singular values are sorted in decreasing order, meaning the first few singular values capture the most variance in the data. By focusing on these dominant singular values, we retain the most significant features of the matrix while reducing its complexity. This reduction is the core idea behind dimensionality reduction technique like Principal Component Analysis (PCA), which is fundamentally linked to SVD.
Imagine a high-resolution image stored as a matrix, where each pixel’s intensity is a matrix element. A high-resolution image might have millions of pixels, but not every pixel contains unique information. Large areas of the image might be uniform, or patterns might repeat across different regions. By applying SVD and low-rank approximation, we can compress the image by retaining only the most critical information that defines its structure and discarding the less significant details. The result is a much smaller matrix that still captures the essence of the original image. See, an example below.
By approximating an image with a low-rank matrix, we can reduce the storage space required while preserving the visual quality.
Theorem, Proofs, Examples Of SVD & Low-Rank Approximation
Please expand to navigate through the pages. Once you have grasped these then move ahead,
Applications of SVD
Here is an example of how can we use SVD. Let’s consider you are trying to send a photo from space to the earth. The photograph was captured by a satellite. Now you want to find a way to send the image so that it takes less memory without losing the actual image contents. Here we can compress the image to some extent by using SVD matrix approximations to send the photo.
Here how the image looks like based on different singular values considerations:
Singular Value Decomposition using Numpy
In this section, we will use Numpy to perform SVD on a matrix and see how we can further implement low-rank approximation to find another matrix that approximates the original one. You can use images as well.
import numpy as np
# Creating a matrix
a = np.array([[1, 2, 3, 4], [1, 1, 2, 3], [0, 1, 1, 0]])
[1, 1, 2, 3],
[0, 1, 1, 0]])
np.linalg.matrix_rank(a)
u, s, vh = np.linalg.svd(a, full_matrices=True)
# Output U matrix
u
[ 0.57 , -0.3493, 0.7437],
[ 0.1323, 0.9324, 0.3365]])
u.dot(u.T)
[ 0., 1., -0.],
[ 0., -0., 1.]])
vh
[-0.2181, 0.6561, 0.438 , -0.5746],
[ 0.7598, -0.3431, 0.4167, -0.3625],
[-0.5774, -0.5774, 0.5774, 0. ]])
vh.dot(vh.T)
[ 0., 1., 0., 0.],
[-0., 0., 1., 0.],
[ 0., 0., 0., 1.]])
s
sd = np.diag(s)
[0. , 1.1734, 0. ],
[0. , 0. , 0.2186]])
b = np.zeros((3, 4))
b[:, :-1] = sd
sigma = b
sigma
[0. , 1.1734, 0. , 0. ],
[0. , 0. , 0.2186, 0. ]])
print(np.dot(np.dot(u, sigma), vh))
[1. 1. 2. 3.]
[0. 1. 1. 0.]]
We added zero because the sigma matrix is equal to the size of the original matrix. Finally, We got our original matrix!!
Low-Rank Approximation Of A Matrix
Now let’s find a rank-2 approximation of the above matrix.
sd = np.diag(s)
sd
[0. , 1.1734, 0. ],
[0. , 0. , 0. ]])
b
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
sigma = b
sigma
[0. , 1.1734, 0. , 0. ],
[0. , 0. , 0. , 0. ]])
print(rank2)
[ 0.8764 1.0558 1.9322 3.0589]
[-0.0559 1.0252 0.9693 0.0267]]
np.linalg.matrix_rank(rank2)
print(a)
[1 1 2 3]
[0 1 1 0]]
This result above approximates the matrix a which is rank 3 to a matrix with rank 2!! We can compare the outputs with matrix a. I hope this was helpful. For more details check out the additional resources below. The main concept you should take from this tutorial is of diagonal matrix and how it plays an important role in variance capturing. It will be useful in next tutorial on PCA. If you have any queries please drop your questions here: Linear Algebra.
Footnotes And Sources
- SingularValueDecomposition, Andrew Lounsbury Department of Mathematics TennesseeTechnological University [Link] ↩︎
- In Vivo Dynamic Image Characterization of Brain Tumor Growth Using Singular Value Decomposition and Eigenvalues [Link] ↩︎
- SingularValueDecomposition, Andrew Lounsbury Department of Mathematics TennesseeTechnological University [Link] ↩︎
Other sources and further readings: