Squared distance measures; Chi-squared and Mahalanobis Distance
Chi-squared and Mahalanobis Distance(Squared standardized distance to mean)
1.
Using
standardized distance to mean
Ø For univariate normal data, you determine whether an observation is
likely to belong to a predefined population, according to how many standard
deviations (eg. Z-score), it is away from the mean.
Ø An observation can be higher or lower
than mean. So, standardized distance of an observation, from mean, can be
positive or negative value.
Ø Likewise for a multinormal
distribution, the distance from the mean or center of the measurements can be
used to determine likelihood of membership of predefined population.
Ø For multivariate data, there is no
specific positive or negative direction from center of the multivariate data.
Need different metric.
2.
Using
Squared standardized distance to mean
Ø For univariate data, you can calculate squared standardized
distance to the mean, termed as Chi-square value for an observation.
Ø Chi-square value/score for the univariate
normal observations is expected to follow Chi-square distribution with 1
degree of freedom, which is positive Skewed distribution.
Ø When there is one variable measured,
the chi squared distribution can be simply related to the normal distribution.
The proportion of observations less than 4 chi-squared unit is the same as the
proportion of observations within two normalized unit on either side of the mean
as ND is symmetric.
Ø For multivariate data, covariance in the variable determines
relativeness of the observation to the center of multivariate data, in addition
to variance of respective variables.
Ø A new measurement of distance which
considers covariance amongst the variable is required. Another squared distance,
Mahalanobis distance, which consider variance and covariance of variable is to
be considered.
Ø Mahalanobis distance for the multivariate
normal observations is expected to follow chi-square distribution with k degree
of freedom, where k is number of variables.
Ø Once k is greater than or equal to 3,
it no longer has a maximum at 0. This means that the center of the population
is no longer the place where it is most likely to find the data. This is
because although the density still is the greatest at the center, the
hypersurface of the hypersphere that represents equal Mahalanobis distance from
the center expands faster initially, so it is a trade-off between density and
area.
Ø When k is large, the chi squared
distribution resembles a normal distribution.
Ø The distribution is positively skewed, but skewness decreases with more degrees of freedom.
Ø Mahalanobis distance for multivariate
data obeys chi-square distribution only if the underlying distribution is multivariate
normal.
Ø The mean of the chi-squared distribution
equals k, variance equals 2k.
a.
If
K ≥ 2, the maximum value occurs at a
value of chi-square of k-2.
1. Chemometric
column by Richard B. Grereton, J. Chemometrics 2015, 29: 9–12
Comments
Post a Comment