Clear Understanding on Mahalanobis Distance

 

Clear Understanding on Mahalanobis Distance

In multivariate/ multicharacteristics data, a measure of divergence or distance between groups in terms of  multiple characteristics is required.

Lets consider, you are interested in measuring the difference (distance) between groups G1 and G2 (each of p-dimensional). A common assumption is to take the p-dimensional random vector X , from each group, as having same variation about its mean within either group.

The difference between the groups can be considered in terms of difference between mean vectors of X, in each group relative to the common within-group variation (using common (pooled) covariance matrix).


The most often used measure for multiple characteristics data is, Mahalanobis distance (Mahalanobis Δ, where Δ is Uppercase Delta).

The square of Mahalanobis distance is given by, 

Δ212)T Σ-112) or        Δ212)Σ-112)

where the superfix T or ′ denotes matrix transpose,
                           Σ denotes Covariance matrix of X in each group of G1 and G2.

As Σ is nonsingular matrix, it is positive-definite. Hence, Δ2 is metric.


If the variables in X were uncorrelated in each group and were scaled so that they had unit variances, then Σ would be the identity matrix, I, and Mahalanobis Δ, corresponds to using squared Euclidean distance between the group-mean vectors µ1 and µ2.

For nonsingular matrix, like Σ, Transpose of matrix is equal to the Inverse of matrix. 


The presence of transpose of inverse or transpose of covariance matrix, Σ of X, in the quadratic form ,in Mahalanobis distance formula, is to allow for the different scales on which variables are measures and for non-zero correlation between the variables.

Alternately, The quadratic form of Σ  has effect of transforming the variables to uncorrelated standardized variables, Y, and computing the squared Euclidean distance between the mean vectors of Y in two groups.


To understand Quadratic form of matrix,  if A is squared matrix, we can compute quadratic form by using vector, X. 


By looking at the exponents in the final expression, you can see why this is called a quadratic form or transformation of A.


It is now known that many standard distance measures such as Kolmogorov's variational distance, the Hellinger distance, Rao's distance, etc., are increasing functions of Mahalanobis distance under assumptions of normality and homoscedasticity and in certain other situations.


Sample Version of the Mahalanobis Distance, D:

In practice, the means µ1 and µ2, and the common covariance matrix Σ of the two groups G1 and G2 are generally unknown and must be estimated from random samples of sizes n1 and n2 from G1 and G2, yielding sample means 1 and 1 and (bias-corrected) sample covariance matrices S1 and S2.

The common covariance matrix Σ can then be estimated by the pooled estimate, given by,


where N = n1 + n2 - 2.

The sample version of the Δ2 is denoted by D2 and is given by


The sample Mahalanobis distance, D2, is known to overestimate its population counter part, Δ2 .

In the situation where D2 is used, knowledge of  Dis needed.  

It follows under the assumption of normality that cD2 is distributed as a noncentral F-distribution with p and N-p+ 1 degrees of freedom and noncentrality parameter cΔ2, where c=k (N-p+ l)/(PN) and k=(n1n2 )/(n1+n2).


When Mahalanobis distance is used to test that an observed random sample  x1,...., xn  is from a multivariate normal distribution, under the null hypothesis  Dj2 should be distributed independently (approximate), with common distribution that can be approximated by a chi-squared distribution with p degrees of freedom, where j is jth random sample (where j = n, number of sample)

Mahalanobis formula, in term of respective random sample is, 

where  denotes sample mean and S denotes the (bias-corrected) sample covariance matrix of the n observations in the observed sample.


Alternatively, we can form the modified Mahalanobis distances d1, ..., dj, where

where x(j) and S(j) denote respectively the sample mean and (bias-corrected) sample covariance matrix of the n -1 observations after the deletion of xj, (j = 1, ... ,n).

In this case, the dj2 can be taken to be approximately independent with the common distribution of qdj2 given exactly by a F-distribution with p and n - p -1 degrees of freedom, where q = (n -1 ) (n-p-1 )/ {(pn)(n-2)}.


Interesting question which can be answered based on Mahalanobis Distance

  1. How different are the metabolic characteristics of normal persons, chemical diabetics and overt diabetics as determined by a total glucose tolerance test and how to make a diagnosis?
  2. On the basis of remote sensing data from satellite, how do you classify various tracts of land by vegetation type, rock type, etc.?
Answer to above questions are of importance in developing methods for medical diagnosis, and in developing GIS, Geographical Information System.


Other questions that can be answered are, 
  1. Problem of pattern recognition or discriminant analysis (using Optimal discriminant function, measured in terms of Δ2).
  2. In Classification problem (how is it different that discriminant analysis).

Reference

Comments

Popular posts from this blog

Clear Understanding on Sin, Cos and Tan (Trigonometric Functions)

Vignettes for Matrix concepts, related operations