Squared distance measures; Chi-squared and Mahalanobis Distance

 Chi-squared and Mahalanobis Distance
(Squared standardized distance to mean)

1.      Using standardized distance to mean

Ø  For univariate normal data, you determine whether an observation is likely to belong to a predefined population, according to how many standard deviations (eg. Z-score), it is away from the mean.


Ø  An observation can be higher or lower than mean. So, standardized distance of an observation, from mean, can be positive or negative value.


Ø  Likewise for a multinormal distribution, the distance from the mean or center of the measurements can be used to determine likelihood of membership of predefined population.


Ø  For multivariate data, there is no specific positive or negative direction from center of the multivariate data. Need different metric.


2.      Using Squared standardized distance to mean

Ø  For univariate data, you can calculate squared standardized distance to the mean, termed as Chi-square value for an observation.


Ø  Chi-square value/score for the univariate normal observations is expected to follow Chi-square distribution with 1 degree of freedom, which is positive Skewed distribution.


Ø  When there is one variable measured, the chi squared distribution can be simply related to the normal distribution. The proportion of observations less than 4 chi-squared unit is the same as the proportion of observations within two normalized unit on either side of the mean as ND is symmetric.


Ø  For multivariate data, covariance in the variable determines relativeness of the observation to the center of multivariate data, in addition to variance of respective variables.


Ø  A new measurement of distance which considers covariance amongst the variable is required. Another squared distance, Mahalanobis distance, which consider variance and covariance of variable is to be considered.


Ø  Mahalanobis distance for the multivariate normal observations is expected to follow chi-square distribution with k degree of freedom, where k is number of variables.

Ø  Once k is greater than or equal to 3, it no longer has a maximum at 0. This means that the center of the population is no longer the place where it is most likely to find the data. This is because although the density still is the greatest at the center, the hypersurface of the hypersphere that represents equal Mahalanobis distance from the center expands faster initially, so it is a trade-off between density and area.

Ø  When k is large, the chi squared distribution resembles a normal distribution.


Ø The distribution is positively skewed, but skewness decreases with more degrees of freedom.


Ø  Mahalanobis distance for multivariate data obeys chi-square distribution only if the underlying distribution is multivariate normal.


Ø  The mean of the chi-squared distribution equals k, variance equals 2k.

a.      If K 2, the maximum value occurs at a value of chi-square of k-2.


Reference:

1. Chemometric column by Richard B. Grereton, J. Chemometrics 2015, 29: 9–12

2. Chi-square distribution - Minitab
3. 

Comments

Popular posts from this blog

Clear Understanding on Sin, Cos and Tan (Trigonometric Functions)

Clear Understanding on Mahalanobis Distance

Vignettes for Matrix concepts, related operations