Principal Component Analysis (PCA) is a dimensionality-reduction technique that is often used to transform a high-dimensional dataset into a smaller-dimensional subspace prior to running a machine learning algorithm on the data.
When should you use PCA?
It is often helpful to use a dimensionality-reduction technique such as PCA prior to performing machine learning because:
- Reducing the dimensionality of the dataset reduces the size of the space on which k-nearest-neighbors (kNN) must calculate distance, which improve the performance of kNN.
- Reducing the dimensionality of the dataset reduces the number of degrees of freedom of the hypothesis, which reduces the risk of overfitting.
- Most algorithms will run significantly faster if they have fewer dimensions they need to look at.
- Reducing the dimensionality via PCA can simplify the dataset, facilitating description, visualization, and insight.
What does PCA do?
Principal Component Analysis does just what it advertises; it finds the principal components of the dataset. PCA transforms the data into a new, lower-dimensional subspace—into a new coordinate system—. In the new coordinate system, the first axis corresponds to the first principal component, which is the component that explains the greatest amount of the variance in the data.
Can you ELI5?
Let’s say your original dataset has two variables, x1 and x2:
Now, we want to identify the first principal component that has explains the highest amount of variance. Graphically, if we draw a line that splits the oval lengthwise, that line signifies the component that explains the most variance:
Let's say we just wanted to project the data onto the first principal component only. In other words, we wanted to use PCA to reduce our two-dimensional dataset onto a one-dimensional dataset.
Basically, we would collapse our dataset onto a single line (by projecting it onto that line). The single line is the first principal component.
Here is a picture:
Here is what the dataset will look like after being projected onto a single dimension corresponding to the first principal component:
We can see that we have destroyed some of the original information when we went from a two-dimensional dataset to the one-dimensional projection.
You can think of this sort of like a shadow.
Although we lost some information in the transformation, we did keep the most important axis, which incorporates information from both x1 and x2.
How do we calculate the second principal component?
The second principal component must be orthogonal to the first principal component. In otherwords, it does its best to capture the variance in the data that is not captured by the first principal component.
For our two-dimensional dataset, there can be only two principal components.
Here is a picture of the data and its first and second principal components:
Again, you can see that the two principal components are perpendicular to each other. They capture independent elements of the dataset.
If we were to perform PCA on this dataset and project the original dataset onto the first two principal components, then no information would be lost. (We are transforming from a two-dimensional dataset to a new two-dimensional dataset.) Instead, we would be merely rotating the data to use new dimensions.
Here is a picture of what the dataset looks like projected onto the first two principal components:
How many principal components do you need?
In general, the data will tend to follow the 80/20 rule. Most of the variance (interesting part of data) will be explained by a very small number of principal components. You might be able to explain 95% of the variance in your dataset using only 10% of the original number of attributes. However, this is entirely dependent on the dataset. Often, a good rule of thumb is to identify the principal components that explain 99% of the variance in the data.
You cannot have more principal components than the number of attributes in the original dataset.
Mathematically, the principal components are the eigenvectors of the covariance matrix of the original dataset. Because the covariance matrix is symmetric, the eigenvectors are orthogonal.
The principal components (eigenvectors) correspond to the direction (in the original n-dimensional space) with the greatest variance in the data.
Each eigenvector has a corresponding eigenvalue. An eigenvalue is a scalar. Recall that an eigenvector corresponds to a direction. The corresponding eigenvalue is a number that indicates how much variance there is in the data along that eigenvector (or principal component).
In other words, a larger eigenvalue means that that principal component explains a large amount of the variance in the data.
A principal component with a very small eigenvalue does not do a good job of explaining the variance in the data.
In the extreme case, if a principal component had an eigenvalue of zero, then it would mean that it explained none of the variance in the data.
When PCA is used for dimensionality reduction, you will typically want to discard any principal components with zero or near-zero eigenvalues.
One final word of caution
When performing PCA, it is typically a good idea to normalize the data first. Because PCA seeks to identify the principal components with the highest variance, if the data are not properly normalized, attributes with large values and large variances (in absolute terms) will end up dominating the first principal component when they should not. Normalizing the data gets each attribute onto more or less the same scale, so that each attribute has an opportunity to contribute to the principal component analysis.
- Wikipedia on Principal Component Analysis
- Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction by George Dallas
Georgia Tech lectures on PCA
Charles Isbell and Michael Littman for recording these lectures.