Pre-treatment of data (Prior to PCA).

 

PCA is a maximum variance projection method, it follows that  a variable with a large variance is more likely to be expressed in the modeling than low-variance variable.

In order to give variables, equal weight in the data analysis, we standardize them.

Standardization is also known as "Scaling" or "Weighing", and means that the length of each co-ordinate axis in the variable space is regulated according to a pre-determined criterion. The first time a dataset is analyzed, it is recommended to set the length of each variable axis to equal length. The most common criterion is that the length of each variable axis be set to be the same variance (Unit Variance).

In Unit Variance (UV) scaling, for each variable (k-column) one calculates standard deviation (Sk) and obtain the scaling weight as the inverse standard deviation (1/Sk). Subsequently, each column of X is multiplied by 1/Sk. Each scaled variable then has equal (unit variance). UV scaling is also called 'Auto-scaling'.

Plot below indicates effect of UV scaling on variables.



Prior to any pre-processing the variable have different variances and mean values. After scaling to UV, the length of each variable is identical. The mean values still remain different, however.

Like many projection method, PCA is sensitive to scaling. However, one must not overlook the risk of scaling subjectively to give you the model you want. Generally, UV-scaling is the most objective approach, and is recommended if there is no prior information about the data.

Sometimes no scaling at all would be appropriate, especially with data where all the variables are expressed in the same unit, for instance, with spectroscopic data or chemical composition measured by method 1 and method 2 with same unit of gm/ltrs.

Mean centering is the second part of the standard procedure for pre-processing. with mean centering the average value of each variable is calculated and then subtracted from the data, refer graph below. This improves the interpretability of the model. 


After UV-scaling and mean-centering, all variables will have equal length and mean value zero.

Reference

  1. Chapter 3 of book of Multivariate Training by Umetrics.




Comments

Popular posts from this blog

Clear Understanding on Sin, Cos and Tan (Trigonometric Functions)

Clear Understanding on Mahalanobis Distance

Vignettes for Matrix concepts, related operations