Distance Measures

Distance Measures:

Some of the methods described in this chapter take a high dimensional data matrix as input and give the results as a “rearrangement” of this input. The rearrangement is usually performed with regard to some similarity or dissimilarity measure. In clustering algorithms, two relatively similar objects should be placed in the same cluster. In projection algorithms, they should be placed in the vicinity of each other in the projected space.
Input to our algorithms is an n x m dimensional matrix with n vectors of length m. We assume that all vectors are of the same length. When we refer to vector x in state k, it is generally element (x,k) in the input matrix. We refer to the distance between vector x and vector y as d(x,y).
If the input matrix consists of n vectors in 1 state (nx1 dimensional input matrix) the similarity decision is simple: d(x,y)=|x1-y1|. If vector x has a value of 3.0 and vector y has a value of 1.0 the distance between them is d(x,y)=3.0 –1.0 = 2.0. The distance from x to y should be the same as the distance from y to x (symmetric distance), so we define a mathematical equation for the distance (1):

The most commonly used proximity measure, at least for ratio scales (scales with an absolute 0) is the Minkowski metric (3), which is a generalization of the distance between points in Euclidean space.

The following is a list of the common Minkowski distances for specific values of r:

r = 1. Manhattan /City block distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors.

r = 2. Euclidean distance. The most common measure of the distance between two points.

r "supremum" (LMAX norm, L norm) distance. This is the maximum difference between any component of the vectors.

The distance functions implemented in J-Express:

A weakness of the standard Minkowsky distance measure is that if one of the input attributes has a relatively large range, then it can overpower the other attributes. For example, if x0 has a value range of 0-100, and x1 has a value range from 0 to 10, then x0’s influence on the distance function will usually be overpowered by x1’s influence. This is normally not a problem in microarray experiments, as all attributes generally have the same value span. However, to prevent overpowering, the data is often normalized. A simple way of normalizing the vectors is variance and mean normalization, (EQN 7).

It is very important to understand the effect that normalization has on the data. In the case of equation (7), vectors with very small absolute values will be scaled to have the same variation as vectors with initially very large values (Figure 1). Sometimes this is not desirable.

If we are just looking for profile similarity, that is the shape of the lines, normalization prior to distance calculation is appropriate and allows a simple distance measure (e.g Euclidean) to be used. However, if the absolute value of the vector has some meaning, this will be lost after variance normalization.

Typically, the different distance measures falls into one of two classes: metric and semi-metric. To be classified as ‘metric’, a distance between two vectors x and y must obey several rules:

The distance must be positive definite.
The distance must be symmetric, so that the distance from x to y is the same as the distance from y to x. This is sometimes called the symmetry rule.
An object has a distance 0 from itself.
When considering three objects, x, y and z, the distance from x to z is always less than or equal to the sum of the distance from x to y and the distance from y to z. This is sometimes called the triangle rule.

Distance measures that obey the first three rules, but fail to obey rule 4 are referred to as semi-metric.