Chi Square Distance

 The chi squared distance d(x,y) is, as you already know, a distance between two histograms x=[x_1,..,x_n] and y=[y_1,...,y_n] having n bins both. Moreover, both histograms are normalized, i.e. their entries sum up to one.
The distance measure d is usually defined (although alternative definitions exist) as d(x,y) = sum( (xi-yi)^2 / (xi+yi) ) / 2 . It is often used in computer vision to compute distances between some bag-of-visual-word representations of images.

The name of the distance is derived from Pearson's chi squared test statistic X²(x,y) = sum( (xi-yi)^2 / xi) for comparing discrete probability distributions (i.e histograms). However, unlike the test statistic, d(x,y) is symmetric wrt. x and y, which is often useful in practice, e.g., when you want to construct a kernel out of the histogram distances.

Chi-Square Distance

Consider a frequency table with n rows and p columns, it is possible to calculate row profiles and column profiles. Let us then plot the n or p points from each profile. We can define the distances between these points. The Euclidean distance between the components of the profiles, on which a weighting is defined (each term has a weight that is the inverse of its frequency), is called the chi-square distance. The name of the distance is derived from the fact that the mathematical expression defining the distance is identical to that encountered in the elaboration of the chi square goodness of fit test.

MATHEMATICAL ASPECTS

Let  (fij), be the  frequency of the  ith row and  jth column in a frequency table with  n rows an  p columns. The chi-square distance between two rows  i and  i is given by the formula:
where
f i. is the sum of the components of the ith row;
f .j is the sum of the components of the jth column;
is the ith row profile for j = 1,2,...,p.
Likewise, the distance between two columns  j and  j is given by:
where   is the  jth column profile for  j = 1,...,n.

DOMAINS AND LIMITATIONS

The chi-square distance incorporates a weight that is inversely proportional to the total of each row (or column), which increases the importance of small deviations in the rows (or columns) which have a small sum with respect to those with more important sum package.

The chi-square distance has the property of distributional equivalence, meaning that it ensures that the distances between rows and columns are invariant when two columns (or two rows) with identical profiles are aggregated.

EXAMPLES

Consider a contingency table charting how satisfied employees working for three different businesses are. Let us establish a distance table using the chi-square distance.

Values for the studied variable X can fall into one of three categories:

  • X 1: high satisfaction;
  • X 2: medium satisfaction;
  • X 3: low satisfaction.

The observations collected from samples of individuals from the three businesses are given below:

 

Business 1

Business 2

Business 3

Total

X 1

20

 55

30

105

X 2

18

 40

15

 73

X 3

12

  5

 5

 22

Total

50

100

50

200

The relative frequency table is obtained by dividing all of the elements of the table by 200, the total number of observations:

 

Business 1

Business 2

Business 3

Total

X 1

0.1

0.275

0.15

0.525

X 2

0.09

0.2

0.075

0.365

X 3

0.06

0.025

0.025

0.11

Total

0.25

0.5

0.25

1

We can calculate the difference in employee satisfaction between the the 3 enterprises. The column profile matrix is given below:

 

Business 1

Business 2

Business 3

Total

X 1

0.4 

0.55

0.6

1.55

X 2

0.36

0.4 

0.3

1.06

X 3

0.24

0.05

0.1

0.39

Total

1  

1  

1 

3  

This allows us to calculate the  distances between the different columns:
We can calculate  d(1,3) and  d(2,3) in a similar way. The distances obtained are summarized in the following  distance table:
 

Business 1

Business 2

Business 3

Business 1

0

0.613

0.514

Business 2

0.613

0

0.234

Business 3

0.514

0.234

0

We can also calculate the distances between the rows, in other words the difference in employee satisfaction; to do this we need the line profile table:

 

Business 1

Business 2

Business 3

Total

X 1

0.19 

0.524

0.286

1

X 2

0.246

0.548

0.206

1

X 3

0.546

0.227

0.227

1

Total

0.982

1.299

0.719

3

This allows us to calculate the  distances between the different rows:

We can calculate d(1,3) and d(2,3) in a similar way. The differences between the degrees of employee satisfaction are finally summarized in the following distance table:

 

X 1

X 2

X 3

X 1

0

0.198

0.835

X 2

0.198

0

0.754

X 3

0.835

0.754

0

http://www.researchgate.net/post/What_is_chi-squared_distance_I_need_help_with_the_source_code

http://www.springerreference.com/docs/html/chapterdbid/60817.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值