Principal Component Analysis (PCA)

What is it and what’s behind it

Cezar Sas

Apr 30, 2020 4 min read Lectures

Image credit: Cezar Sas

In some situations, when working with machine learning models, we have to deal with high dimensional data. This brings various problems to the table, from higher computational time, to higer complexity required by the model to describe the data (check Curse of Dimensionality). In order to solve this problem, various techniques to reduce the dimensionality of data have been proposed, one of them is called Principal Component Analysis (PCA). PCA is a technique that can be used for different application, from dimensionality reduction and feature extraction to lossy data compression and data visualization. In this article, we will present the concepts behide this method, and then present two implementations in Python from scratch.

Description

Principal Component Anlysis is a linear transformation that transforms data into a new orthogonal space spanned by the vectors called principal components axes. These axes are found by searching for the vectors that model the higest variance in the data. The first axis is the one with the highest variance, the last one with the lowest variance.

Formal Definition

For the formal definition, we will present the Singular Value Decomposition (SVD) formalization, as it is some stable and general compared to the Eigendecomposition, however, you can check the implementation section to get an idea of the differences.

Let $\mathbf{X} \in \mathbb{R}^{n \times d}$ be the observation matrix. The first step is to center the data by removing the mean $\mathbf{X}’ = \mathbf{M X}$, where $\mathbf{M}$ is the Centering Matrix. Then, we perform SVD on $\mathbf{X}'$, obtaining¹:

$$\mathbf{X’} = \mathbf{U\Sigma V}^\top$$

The covariance matrix $\mathbf{C}$ is given by:

$$\mathbf{C} = \frac{1}{N-1}\mathbf{X'^\top X’}$$ which, by pluging in the SVD of $\mathbf{X}'$, becomes: $$\mathbf{C} = \frac{1}{N-1} \mathbf{V\Sigma U}^\top \mathbf{U\Sigma V}^\top$$ since $\mathbf{U}$ is orthogonal: $$\mathbf{C} = \frac{1}{N-1}\mathbf{V\Sigma}^2\mathbf{V}^\top$$

The singular vectors $\mathbf{V}$ are the principal axes, sometimes they are alse refered to as principal diration or even principal components. To get the principal components, also called principal component scores, we projet the input onto the principal axes: $$\mathbf{X}'\mathbf{V}^\top$$

We can also rewrite the transformation as: $$\mathbf{X}'\mathbf{V} = \mathbf{U\Sigma V}^\top \mathbf{V} = \mathbf{U\Sigma}$$

which is used when we want to apply the transformation to the same data we are fitting.

The explained variance of each principal component is defined by:

$$\sigma = \frac{\mathbf{\Sigma}^2}{N-1}$$

To perform dimensionality reduction, to a size $k \lt d$, we then can pick a the first $k$ rows of $\mathbf{V}$, and to have the variance of these principal components a the first $k$ elements of the diagonal of $\sigma$. Then we can project $\mathbf{X}'$ onto the reduced space using the new $\mathbf{V}$.

Conclusions

Implementation from Scratch

For the implementation, we will create an abstract class with an interface similar to the one provided by the sklearn library (if you are not familiar with abstract classes check this link). We will use this class as a base for our two solutions that use the previously described methods to compute the PCA.

class PCA(ABC):
    def __init__(self, n_components):
        super().__init__()
        self._n_components = n_components
        self._components = None
        self._explained_variance = None
        self._mean = None

    @property
    def mean(self):
        return self._mean

    @property
    def explained_variance(self):
        return self._explained_variance

    @property
    def components(self):
        return self._components

    @abstractmethod
    def fit(self, X):
        pass

    def transform(self, X):
        if self._mean is not None:
            X -= self._mean
        X_transformed = np.dot(X, self._components.T)
        return X_transformed

The first implementation uses SVD, this method is more stable and is similar to the one used in sklearn.

class PCASVD(PCA):
    def fit(self, X, y=None, *args, **kwargs):
        """
        Finds the principal axes and their explained variance
        :param X: Numpy array
        :return:
        """
        n, _ = X.shape
        self._mean = np.mean(X, axis=0)
        X -= self._mean
        u, s, v = np.linalg.svd(X)
        self._components = v[:self._n_components]
        explained_variance = np.square(s[:self._n_components]) / (n - 1)
        self._explained_variance = explained_variance

        return self

The second implementation uses Eigenvalue Decomposition, as mentioned, this is less stable, as computing the dot product of $\mathbf{X}^\top\mathbf{X}$ in case of a numerically bad matrix (more formally, the Condition Number of the matrix is high) the error will increase. This problem is avoided when using SVD since we don’t compute the covariance matrix directly from the observation matrix.

class PCAEign(PCA):
    def fit(self, X):
        """
        Finds the principal axes and their explained variance
        :param X: Numpy array
        :return:
        """
        n, _ = X.shape

        self._mean = np.mean(X, axis=0)
        X -= self._mean

        covariance = 1 / (n - 1) * np.matmul(X.T, X)
        Q, A = np.linalg.eig(covariance)
        components = A[:, :self._n_components]
        explained_variance = Q[:self._n_components]
        self._components = components.transpose()
        self._explained_variance = explained_variance.T

        return self