Principal component analysis is a commonly used technique in multi-variate statistics and pattern recognition literature. In this post I try to merge ideas of Geometric and Algebraic interpretation of data as vectors in a vector space and its relationship with PCA. The 3 major sources used in this blog are:

[1] Thomas D. Wickens (1995). The geometry of multivariate statistics.  Psychology Press.

[2] Bishop, C. M. (2006). Pattern Recognition And Machine Learning. Springer. http://doi.org/10.1117/1.2819119

[3] Lindsay I Smith (2002). A tutorial on Principal Component Analysis. http://snobear.colorado.edu/Markw/BioMath/Otis/PCA/principal_components.ps

The code used in the blog can be seen here.

## Vectors and Variables

In data analysis, we come across collections of numbers associated with each other that are grouped logically, e.g. a collection of ages and a collection of weights from a sample of people.  These variables can be interpreted as algebraic vectors (in row or column format) or geometric vectors. Extending this idea, the vectors can be represented in various forms, like magnitude and direction form (e.g. polar coordinate form) and component form – while both forms can be interchangeable. Taking the example from [1], we have a data set with two variables X and Y, measured on 10 different subjects, which has been centred (i.e. mean subtracted) and has a correlation of about 0.904.

> df
x  y
S1  -3 -8
S2  -3 -5
S3  -1 -3
S4  -1  0
S5   0 -1
S6   0  0
S7   1  5
S8   2  1
S9   2  6
S10  3  5
> cor(df$x, df$y);
[1] 0.9039935
> angleRad = acos(cor(df$x, df$y))
[1] 0.4417766
> rad2deg(acos(cor(df$x, df$y)))
[1] 25.31193
> magX = norm(df$x, '2') > ptX = c(x=magX * cos(0), y=magX * sin(0)) > magY = norm(df$y, '2')
## magnitude
> magX
[1] 6.164414
> magY
[1] 13.63818
## polar coordinates
> ptX
x        y
6.164414 0.000000
> ptY
x         y
12.328828  5.830952
## standard deviations of variables/vectors
> apply(df, 2, sd)
x        y
2.054805 4.546061


The data has been plotted in 3 different ways: 1) in variable space as a scatter plot (where the axes are the variables X and Y); 2) in subject space where X and Y are the the vectors and each subject is a component – this however is not possible to do for more than 3 subjects (due to my inability to plot more than 3 dimensions on a flat screen); 3) the two vectors X and Y in subject space are plotted as actual geometric vectors.

The vector representation of the two variables (bottom panel of the figure) shows a few things: the relationship between the two variables (as the angle between these vectors); the standard deviation of the two vectors and thus the variability (Y has a higher standard deviation and is longer). The code snippet above shows how to calculate the angle, magnitude and polar coordinates of these vectors for drawing purposes.

## Some Basic Vector Concepts:

I will try to summarise some basic vector concepts and operations, the details can be found in [1] or a text book on Linear Algebra.

Two vectors with the same dimension, n, can be added as:

$(a_i)+(b_i) = (a_i+b_i) \; where \; i=1, 2, 3, ...,n$

A vector a of dimension n and a scalar s, the scalar multiplication is:

Subtraction can be performed by multiplying the vector to be subtracted by -1 and then performing the addition operation.

### Length or Magnitude

The length of vector can be calculated from its algebraic component form by using the Pythagorean theorem.

This is also called an $l^2norm$ and can be calculated directly in R using the norm function. This also links with the statistical concept of sum of squares and standard deviation. Longer vectors have higher variability and standard deviation. After standardizing the variables (e.g. z-scores), all the vectors have the same length and standard deviation of 1 – and only their direction is important.

### Linear Combination

A linear combination involves addition and scalar multiplication of two or more vectors to produce a new vector. e.g.

This can also be visualised as an input output operation, where some data in the form of vectors is provided, while the function performs the scalar multiplication and addition operation – and the function is called by vectorizing the operation instead of using a loop.

lincomb = function(x, y){
return(3*x - 2*y)
}

## vectorized call to the function
lincomb(c(2,1), c(-1,3))

[1]  8 -3


### Dot or Scalar Product

The dot or scalar product of two vectors a and b also takes into account the angle between them. It can be calculated in two ways:

$a \cdot b = \left| a \right| \left| b \right| cos \theta \newline a \cdot b = a_1b_1+a_2b_2+...+a_nb_n \newline \therefore cos \theta = \frac{a \cdot b}{\left| a \right| \left| b \right|} = Pearson \;correlation \;coefficient\newline where \; \theta \;(0 \leq \theta \geq \pi) \; is \; in\; radians$

The angle between the two vectors also highlights if vectors are colinear (angle is 0 or 180 degrees and cosine is 1 or -1) or orthogonal (angle 90 degrees and cosine is 0).

> ## define 2 vectors
> x = c(2,1); y = c(-1, 3)
>
> ## dot product
> t(x) %*% y
[,1]
[1,]    1
> ## magnitude x
> norm(x, '2')
[1] 2.236068
> ## or
> sqrt(t(x) %*% x)
[,1]
[1,] 2.236068
> ## magnitude y
> norm(y, '2')
[1] 3.162278
> ## calculate cosine of angle
> 1/( 2.236068 * 3.162278)
[1] 0.1414213
> ## convert this to the angle
> acos(0.1414213)
[1] 1.428899
> ## convert it to degrees
[1] 81.86988


Some other points that I am skipping at the moment are areas, volumes and rotations. Briefly, areas and volumes can explain colinearity among variables and rotations change the relationships of vectors to the coordinate system but does not alter the relationships among the vectors – more details can be seen in [1].

## Vector Space

Suppose vectors $x_1$ and $x_2$ are a pair of vectors that are not colinear – they produce a $y$ by a linear combination:

$y=\beta_1x_1+\beta_2x_2$

such that any point in this 2 dimensional plane can be reached by appropriate coefficients. The two vectors x belong to the 2 dimensional vector space, which is closed under linear combinations – i.e. a linear combination can not produce a vector outside this space. Similarly Basis vectors are the minimal set of vectors required to represent a space – defining the dimension of the space. If the basis vectors are orthogonal and have unit length – they are called orthonormal basis.

## Projection into Subspace

Projection divides a vector into two orthogonal components – each in an orthogonal subspace.

The figure shows a vector Y in 3-D space, that is projected on a 2-D plane as vector Y’, while the second orthogonal component is R’:

$Y=Y^{'}+R_{\perp }^{'}$

These two sub-spaces split the direction and length of the vector Y in the larger space into two parts.

$dim(Y)=dim(Y^{'})+dim(R_{\perp }^{'})$

## Linear Dependence

Collinear vectors lie along the same line, and that space can be represented by only one of them. Similarly multicollinear vectors span a space that can be represented by a smaller set of vectors – i.e. any point in the space that can be reached by all vectors can also be reached by a smaller set of non-redundant vectors. More formally when the vectors are linearly dependent, there is a solution to:

$\beta_1x_1+\beta_2x_2+...+\beta_px_p=0$

where at least one coefficient is nonzero.

This is also related to the concept of volume where the volume of a set of vectors is zero within p-dimensional space, if it falls in a space with fewer than p-dimensions – i.e. the covariance matrix of the vectors is singular [1].

## Matrices

Matrices are convenient ways to organise a calculation, and keeping the data from multiple variables and subjects together. Thinking in terms of an input output system and linear combinations, a set of inputs can be organised as vectors, and a collection of multiple inputs can be organised as an input matrix. Similarly a function performing a linear combination, can be written in a vector form, or a set of functions can be organised into an operation matrix.

> ### input output operations, functions, vectors and matrices
> ## some operations or functions
> op1 = function(x, y, z){
+   return(3*x + 4*y + 5*z)
+ }
>
> op2 = function(x, y, z){
+   return(3*x + 0*y + 0*z)
+ }
>
> op3 = function(x, y, z){
+   return(4*x + (-3)*y + 2*z)
+ }
>
>
> ## some vector inputs
> inp1 = c(5, 10, 12)
> inp2 = c(4, 6, -2)
> inp3 = c(8, -5, 4)
>
> ## perform operation on input
> op1(inp1[1], inp1[2], inp1[3])
[1] 115
> op2(inp2[1], inp2[2], inp2[3])
[1] 12
> op3(inp3[1], inp3[2], inp3[3])
[1] 55


These sets of inputs and operations shown above can be organised instead into an input and output operation in matrix form:

> ## create input matrix
> mInputs = cbind(inp1, inp2, inp3)
>
> ## operations matrix
> mOperations = rbind(c(3, 4, 5),
+                     c(3, 0, 0),
+                     c(4, -3, 2))
>
> ## matrix multiplication
> ## operations %*% inputs
> mOutput = mOperations %*% mInputs
> ## resultant matrix = rows=size of operation (3) and cols=number of inputs (3)
> mOutput
inp1 inp2 inp3
[1,]  115   26   24
[2,]   15   12   24
[3,]   14   -6   55


The diagonal of the matrix shows the results of the 3 operations performed on 3 inputs. Other related concept that we usually come across often is the determinant of the matrix (which intuitively means the scale of the transformation – read here for some more details). Only square matrices have determinants, and if the determinant is zero, this means that the operation performed is destructive. If determinant is non-zero then an inverse can be found and the original input can be retrieved, which is useful in performing linear transformations.

> det(mOperations)
[1] -69
> ## get inverse of the matrix
> mOperations.inv = solve(mOperations)
> ## retrieve original input
> mOperations.inv %*% mOutput
inp1 inp2 inp3
[1,]    5    4    8
[2,]   10    6   -5
[3,]   12   -2    4
> # original matrix
> mInputs
inp1 inp2 inp3
[1,]    5    4    8
[2,]   10    6   -5
[3,]   12   -2    4


If a square matrix is non-invertiable then it is called singular, this is related to the concept of linear dependencies. If the matrix is invertible then the matrix approach can be applied to solve a system of simultaneous linear equations. These transformations use the operation or transformation matrix and transform one set of input (or points/vectors/space) into a set of outputs. Singular transformation matrix will perform a destructive operation and will compress the space e.g. a 2 dimensional space to a line and the original inputs can not be retrieved. This post gives an intuitive understanding of transformations and singular matrices.

## Eigenvectors

If is a square operation/transformation matrix, the non-zero vector v is an eigenvector of A, if Av = λv where λ is a scalar number, also called the corresponding eigenvalue. This means that transforming v using does not change the space of the output value (i.e. it lies along the same line but in the same or opposite direction), so the vector v only changes scale. These lines are the axes of the transformation defined by the matrix A.

The figure shows an input matrix (a unit square with 4 points) in grey, that undergoes a transformation and produces an output matrix shown by the output rectangle (red). The transformation matrix scales the grey matrix to produce the red matrix – while the black arrows show eigenvectors of the transformation matrix, and the red arrows show the eigenvectors scaled by multiplying with the eigenvalues.

> ### eigen vectors
> ## a matrix representing a square with 4 points in x, y coordinates
> mOriginal = cbind(c(0,0), c(1,0), c(1,1), c(0,1))
> mOriginal
[,1] [,2] [,3] [,4]
[1,]    0    1    1    0
[2,]    0    0    1    1
> ## transformation or operation to be performed
> mTransform = cbind(c(3, 0), c(0, 2))
> mTransform
[,1] [,2]
[1,]    3    0
[2,]    0    2
> mChanged = mTransform %*% mOriginal
> mChanged
[,1] [,2] [,3] [,4]
[1,]    0    3    3    0
[2,]    0    0    2    2
> eigen(mTransform)
eigen() decomposition
$values [1] 3 2$vectors
[,1] [,2]
[1,]   -1    0
[2,]    0   -1


Eigenvalues and eigenvectors are related to the concept of linear dependencies and singular matrices. A zero eigenvalue represents a singular matrix, while small (how small is small) represent near singular (near multicolinearity) matrices. Compression matrices will have small eigenvalues, and a ratio of the largest to the smallest eigenvalue should give a clue about near singularity. I add links to few sources here that I found useful to understand this (link 1, link 2).

# Principal Component Analysis

PCA is a widely used technique and has applications in dimension reduction, data compression and visualisation. A linearly dependent set of vectors span a space that can be represented by a smaller set of independent vectors. In real data sets with colinear variables, the actual variability in the data is usually concentrated in fewer dimensions than the number of vectors. Sometimes the nature of the problem allows us to combine variables by adding or subtracting (or some other combination) of these variables to form a new variable. However in other cases or in an exploratory setting, the variables may not be combined directly but rather projected into a different space using PCA. Details about PCA can be seen in Ref [1, 2] and a gentle tutorial can be found in [3]. PCA can be applied in data compression, data pre-processing and visualisation.

Using the old faithful data, I reproduce some of the panels of the Figure 12.6 from [2]. The panels from left to right show the raw data, after linear rescaling of the variables to have zero mean and unit variance (standardizing or z-scoring). The second row of figures show the more substantial adjustment of the data after PCA, to make the variables decorrelated with centring or with centring and scaling to have unit variance. Another name for this is also called whitening or sphereing the data. This kind of transformation is useful if we wish to use some sort of a clustering algorithm on the data or perhaps perform principal component regression in a prediction setting.

The linear transformation or projection is shown as an input output system in the R code below:

> data("faithful")
> mData = as.matrix(faithful)
> ## structure of the data
> str(mData)
num [1:272, 1:2] 3.6 1.8 3.33 2.28 4.53 ...
- attr(*, "dimnames")=List of 2
..$: chr [1:272] "1" "2" "3" "4" ... ..$ : chr [1:2] "eruptions" "waiting"
> ## now perform PCA transformation
> ivMeans = colMeans(mData)
> # centered data
> mData.s = sweep(mData, 2, ivMeans, '-')
> # covariance matrix for the data without scaling/standardizing
> mCov = cov(mData)
> lEigens = eigen(mCov)
> colnames(lEigens$vectors) = c('Vector1', 'Vector2') > lEigens eigen() decomposition$values
[1] 185.8818239   0.2442167

$vectors Vector1 Vector2 [1,] 0.0755118 -0.9971449 [2,] 0.9971449 0.0755118 > ## inputs are the variables i.e. the data, one column at a time > t(mData.s)[,1:6] 1 2 3 4 5 6 eruptions 0.1122169 -1.687783 -0.1547831 -1.204783 1.045217 -0.6047831 waiting 8.1029412 -16.897059 3.1029412 -8.897059 14.102941 -15.8970588 > ## operations/transformation matrix is the matrix of eigen vectors > t(lEigens$vectors)
[,1]      [,2]
Vector1  0.0755118 0.9971449
Vector2 -0.9971449 0.0755118
> ## get the transformed points after running them through the matrix, i.e. input output system
> mData.rotated = t(lEigens$vectors) %*% t(mData.s) > rownames(mData.rotated) = c('Comp1', 'Comp2') > mData.rotated[,1:6] 1 2 3 4 5 6 Comp1 8.0882802 -16.976264 3.0823940 -8.9626322 14.14160220 -15.8973395 Comp2 0.4999712 0.407037 0.3886498 0.5295104 0.02270577 -0.5973592 > ## get the original data back > ## rowDataMatrix = (inverse(rowEigenVectors) * rotated Data) + original Means > solve(t(lEigens$vectors)) ## this equals the transpose of the rowEigenVectors matrix
Vector1    Vector2
[1,] 0.0755118 -0.9971449
[2,] 0.9971449  0.0755118
> mData.original.s = (solve(t(lEigens\$vectors)) %*% mData.rotated)
> mData.original.s[,1:6]
1          2          3         4         5           6
[1,] 0.1122169  -1.687783 -0.1547831 -1.204783  1.045217  -0.6047831
[2,] 8.1029412 -16.897059  3.1029412 -8.897059 14.102941 -15.8970588
> ## add the mean to un-center the data
> mData.original = sweep(mData.original.s, 1, ivMeans, '+')
> mData.original[,1:6]
1    2      3      4      5      6
[1,]  3.6  1.8  3.333  2.283  4.533  2.883
[2,] 79.0 54.0 74.000 62.000 85.000 55.000
> t(mData)[,1:6]
1    2      3      4      5      6
eruptions  3.6  1.8  3.333  2.283  4.533  2.883
waiting   79.0 54.0 74.000 62.000 85.000 55.000


The PCA equation can also be presented in a Bayesian framework where the likelihood function representing the distribution of the observed variable is Gaussian, conditioned on the latent variable. This is a different view of PCA based on mapping from projected space to the original data space. Chapter 12 in Ref [2] is a good place to read the details. I tried to implement this Bayesian model in Stan in 4 different ways, 3 of which seem to be adequate, but require a large simulation size and can still produce small effective sample sizes if the number of variables is small. A couple of other resources where I looked for some guidance are the Stan manual, here and here.

The figures above show the Iris data set with the PCA performed analytically and using 3 different Stan models with slightly different parameterizations (the models can be seen at github. I sampled from this model 3 chains with 5000 iterations each – we can see that the structure in the data with 2 species (blue and green) more similar to each other than the 3rd species (red). I can see a few reasons where we may want to perform PCA using an MCMC approach which include: missing data, variable containing too many zeros thus causing problems with scaling, using likelihood other than a normal distribution, or mixture distributions.

Data is considered high-dimensional where the number of subjects (N) is smaller than the number of variables or features (D) – i.e. N < D. In this case the number of eigen vectors/values that are extracted from the high-dimensional matrix will be no greater than N. The data I use in this example can be found here (it comes from a normalised set of gene expression data). In this matrix, there are a lot of zeros present in some variables and hence a PCA with scaling does not work, unless we add a slight jitter to every point (say normal(0, 1)).

Looking at both the figures we can see that the MCMC approach shows 4 possible clusters in the data, which is the same as the analytical approach. Another use for Bayesian PCA can be to decide how many dimensions to use, particularly if we place a prior over the scale parameter of the eigen vectors [2]. I use an independent normal prior over each eigen vector, where the scale hyperparameter has a weak prior e.g. cauchy(0, 2.5), or a gamma(0.5, 1e-3).

The 2 plots show the standard deviations of the eigen vectors (sorted from high to low). The plot on the left shows the scale parameter and suggests that most of the data variation can be captured in one dimension, while the first 6 dimensions are perhaps enough. A similar trend can be seen in the analytical SD (the square root of the eigen values) and suggests a similar trend of using the first 6 to 7 dimensions.