Cracking Principal Components Analysis (PCA) — Part 1
This blog is based on Professor Tom Sager’s Unsupervised Learning class
M any aspiring data scientists would have heard about Principal Components Analysis (PCA) but most (including myself) simply use PCA without comprehensive knowledge about it. In this article, I would like to explore what PCA truly is, why people use it, and some drawbacks of it!
I n short, PCA is basically uncovering structure in a dataset by replacing the old variables with new variables. In other words, it is capturing most of the information of the old variables by the new variables. The new variables are uncorrelated with each other (which avoids multicollinearity), revealing hidden structure and perspectives with fewer number of variables than the old variables, which is dimensionality reduction that PCA is really famous about!
Along with it’s famous dimensionality reduction, PCA is also known for an example of unsupervised learning. Unsupervised means that there is no Y variable to guide the analysis. Unlike other techniques like regression where independent variables and dependent variable are obvious, PCA doesn’t have variable that can be estimated or predicted. One common application of PCA is replacing X variables in a regression which is called Principal Component Regression. However, in such application, Y variable does not play any role in the PCA to be done on the X variables. This is why PCA is so known as a unsupervised learning and there’s more about PCA than a simple application on a regression! Now let’s delve into PCA more!
In order to understand what PCA truly is, we have to understand what rotation is. Rotation is simply a concept of rotating 3D plane in order to achieve new perspective about dataset. Visual representations of rotation will make sense perfectly!
Figure 1 is a simple scatterplots of X, Y, Z with size of 100 data points. As you can observe, there’s not much going on here, except X vs Y plot which has positive correlation.
Figure 2 shows 3D version of Figure 1. As before, there’s not much going on either. But what if we “rotate” the cube a little bit?
Now we can see the same dataset with different perspective by “rotating” the cube! Here, we can conclude that Z is essentially constant since all data are concentrated around40 of Z axis. This is the power of PCA! Through rotating the cube, we can now see a new perspective of the same dataset and it allows us to treat Z as constant! Let’s “rotate” the cube more to see different perspective of the data.
Now do you see Z axis? No! We now have 2D plane because Z is effectively constant! Interestingly enough, we haven’t manipulated or dropped a single data point in order to reduce dimensionality — just by moving vantage point to different perspective or simply by rotating the cube! Note that Figure 4 is the same plot as the top left plot of Figure 1.
Let’s try different dataset — [U, V, W] with size of 100 data points.
Looks like we have some clear relationships between U, V, W! As same as before, let’s see the plots in 3D plane!
Let’s rotate the 3D plane for interesting perspective:
It seems…wait…how are we supposed to interpret this? We once again have a flatten disk of data points yet the “wall” is not obvious — we cannot know if whether W or V or U is constant… The disk doesn’t lie along a plane parallel to one of the walls. Although we have very similar dataset of [X,Y,Z] and [U,V,W], it is not possible to reduce the dimensionality in case of [U,V,W]. The reason why [X,Y,Z] was able to reduce its dimensionality is that we were able to clearly see Z is constant. But in [U,V,W] case, there’s no readily apparent way to find constant.
One very interesting fact is…[X,Y,Z] dataset and [U,V,W] dataset are exactly the same! The geometry of the points remained the same, but the background coordinate grid was moved to a new location, resulting the new [U,V,W] coordinates.
But if the two datasets are the same, then [U,V,W] dataset should also have superfluous dimension. If we can figure out how to express Z in terms of [U,V,W], then we can identify the unnecessary dimension. The equation turns out to be like below:
And the fact that Z is a constant and equal to 40 from Figure 3 can lead the above equation to this:
This is due to the fact that Equation 1 is for every point in 3D space. So the superfluous dimension can be expressed as Equation 1. This is not obvious equation for sure! First, we do not know how to find the equation of the plane of the flatten disk from Figure 7 if we were given just U,V,W data points. Moreover, if Equation 1 is a superfluous dimension, what would be non-superfluous dimension? In case of X,Y,Z, X and Y were non-superfluous dimensions. It turns out their expressions are as follows:
Please note that these equations were defined by professor Sager — so these equations can be changed depending on your case! Using the two equations above, we can derive the following set of equations:
Now we can derive that, along with superfluous dimension Z (i.e., Equation 1), the non-superfluous dimensions are the expressions X and Y from Equation 3. A bit confused? A easier way of digesting this concept is to think that all points X,Y,Z,U,V,W are on a same plane. Instead of visualizing 6-dimension (which would be impossible to interpret), we are just splitting those data into two by X,Y,Z and U,V,W so that it is much easier to visualize and interpret. We are using the above equations to switch between the two datasets.
So these two sets of equations allow to change U,V,W to X,Y,Z using Equation 2 and vice versa using Equation 3.
“A big take-away from this is that system of linear equations are equivalent to a rotation of coordinate systems, which are change of perspective”
Then how do we know if the rotation of Equation 2 preserves the geometry of the data? The unit basis vectors of the X,Y,Z system remain of unit length and maintain right angles with each other when transformed to the U,V,W system — this is called orthonormal transformation. Orthonormal transformation is basically a transformation that preserves the original geometry of data points. Here are the procedure that proves X,Y,Z unit basis vectors remain of unit length in U,V,W coordinates:
How can we tell if they remain right angle with each other? If the cosine of the angle between each pair is zero then it is the right angle. And this is proportional to the inner product of the coordinates.
We can see that the X,Y,Z unit basis vectors remain perpendicular to each other in U,V,W coordinates.
Two conditions for Orthonormal Transformation:
1. if the sum of squares of each column is equal to 1
2. if the inner product of each pair of different columns is equal to zero (perpendicular)
Easier way to check orthonormal transformation starts by forming the matrix of coefficients of the transformation. Here’s the matrix of coefficient of Equation 2:
In the logic of rotation, Equation 3 should define an orthonormal transformation as well. Here’s the matrix of coefficients for Equation 3:
The sum of squares of each column and the inner product of each pair of different columns is zero. Moreover, because M^T is the transpose of M, M and M^T are inverses of each other. Here’s the equation that explains it:
If you follow Equation 2 with Equation 3, you get back where you started which is the original unit basis vectors.
However, Equation 2 is not the only orthonormal rotation! There are infinite orthonormal rotations! So there are infinite number of orthonormal rotation but they are not all principal components. PCA provides one particularly interesting orthonormal rotation of data.
“All PCs are orthonormal, but not all orthonormal are PCs.”
So how does PCA choose interesting coordinates for the data? PCA chooses the new axes sequentially one after another. In other words, PCA looks at the data from all possible angles and figures out which of the infinite number of perspectives spread out the data the most. The first new axis is the one which the data are maximally spread out. So the first new axis would have the most information for distinguishing the points from each other. Consider the U,V,W dataset again:
Here, we can see that the dataset is maximally spread out in the direction of the black arrow. This is our first PC. Obviously, the variance of the first PC is higher than the second PC, and so forth.
It is worth knowing that PCA usually standardizes the data before finding the variance-maximizing dimension:
Each of the standardized variables now has mean 0 and variance 1. Standardization is employed in order to treat all variables equally. We will call standardized version of U,V,W as A,B,C. The variance of each A,B,C is 1 and the total variance of A,B,C is 3 (1+1+1). And the variance of all data doesn’t change if it’s orthonormal transformation:
As the definition of orthonormal transformation indicates, the second PC must be perpendicular to the first PC. The same logic applies to the third PC as well.
Note that there would be more than three PCs if there were more than three original variables. Here is my notes on the output graphs from JMP:
From the graph above, we can create equations of rotation:
Matrix of coefficients has columns of unit length and zero inner products between different columns which is orthonormal transformation. It is worth to note that PCs are ordered in terms of importance, variance and information.
Also it is notable that the total variance always equals the number of variables if PCs are extracted from standardized data (and since a standardized variable has unit variance). Therefore, we can conclude that Prin 1 and Prin 2 can substitute for the original data while Prin 3 cannot. So intuitively, variance and information are fundamentally unified concepts.
Okay, we now have much deeper understanding of PCA. But how can we interpret the result of PCA? In part 2, we will delve into the interpretation of PCA!
Click here for part 2 of PCA!