数学的奇妙之处就在于把一些散乱的看似毫不相关的东西以数字的形式表现,再通过公式和模型组织起来加以计算,最终的结果又刚好完美的证明了它自己是真实可信的,局部哈希完美的诠释了这句真理,局部哈希与哈希的不同在于它温柔的描述了一个对象在向另一个对象变化过程中保留着自身的主要特征,而不像哈希那样基因突变式的猝不及防。
航天员: [0110011010100110010000010100000001111110010010110011011111011010] -> 0.042866474940001
载人: [1001011010001100001110101111111110010110111010101100111110100010] -> 0.03454385915100477
交会: [1011000111010111110101011110010100000010111000010000111001010101] -> 0.03324403373307827
对接: [1011000001110110101011110011011100000110110001111111000111100001] -> 0.02921719003554388
天宫: [0010000101101100110111111011101001101010111110101110111011001110] -> 0.02515174408716358
航天: [0111110101011100000100011010010000001111100101001010100011010011] -> 0.023303336236545936
发射: [0110110000000111010000000101000000010010101010111010000010011100] -> 0.021250399979987955
节点: [1110111001010101111001001110010001111110001010110010000100001001] -> 0.019656970658882354
我国: [1110100101101110000101101001111111011100100110100100100110111010] -> 0.018924900735380758
飞船: [1011011101111100011100010101001011010000001100101000000101001000] -> 0.016226164698362116
数据“航天员: [0110011010100110......] -> 0.042866474940001”,其中航天员是分词,[0110011010100110......]是分词二进制哈希值,0.042866474940001是分词权重。
算法过程是这样的,遍历每个分词二进制哈希值的逐个比特位,如果是1就加其权重,否则减其权重,64位长整形逐位判断逐位加减,最后判断结果大于零的话,则在返回值相同比特位上记录1否则记为0,由此生成一个全新的长整形值。每个分词被看做是一个多维坐标系,并拥有各自象限空间初值,1代表正象限0代表负象限,将权重值按照象限的伸展方向不断累积,最后只关心结果落在正负象限哪边即可,从而实现降维的根本目的,余弦值的计算在多维象限空间中的正确性数学家们早已证明过了不必怀疑。
public static class XxPair {
public Long md5;
public String key;
public Double val;
public XxPair() {}
public XxPair(Long m, String k, Double v) {md5 = m; key = k; val = v;}
}
private static final int XX_BITS = 0X40;
private static final Long[] XX_MASK = {
0X8000000000000000L, 0X4000000000000000L, 0X2000000000000000L, 0X1000000000000000L,
0X0800000000000000L, 0X0400000000000000L, 0X0200000000000000L, 0X0100000000000000L,
0X0080000000000000L, 0X0040000000000000L, 0X0020000000000000L, 0X0010000000000000L,
0X0008000000000000L, 0X0004000000000000L, 0X0002000000000000L, 0X0001000000000000L,
0X0000800000000000L, 0X0000400000000000L, 0X0000200000000000L, 0X0000100000000000L,
0X0000080000000000L, 0X0000040000000000L, 0X0000020000000000L, 0X0000010000000000L,
0X0000008000000000L, 0X0000004000000000L, 0X0000002000000000L, 0X0000001000000000L,
0X0000000800000000L, 0X0000000400000000L, 0X0000000200000000L, 0X0000000100000000L,
0X0000000080000000L, 0X0000000040000000L, 0X0000000020000000L, 0X0000000010000000L,
0X0000000008000000L, 0X0000000004000000L, 0X0000000002000000L, 0X0000000001000000L,
0X0000000000800000L, 0X0000000000400000L, 0X0000000000200000L, 0X0000000000100000L,
0X0000000000080000L, 0X0000000000040000L, 0X0000000000020000L, 0X0000000000010000L,
0X0000000000008000L, 0X0000000000004000L, 0X0000000000002000L, 0X0000000000001000L,
0X0000000000000800L, 0X0000000000000400L, 0X0000000000000200L, 0X0000000000000100L,
0X0000000000000080L, 0X0000000000000040L, 0X0000000000000020L, 0X0000000000000010L,
0X0000000000000008L, 0X0000000000000004L, 0X0000000000000002L, 0X0000000000000001L
};
private static Long SimHashGetValue(Vector<XxPair> v) {
Long sim = 0L;
for (int bit = 0; bit < CM_BITS; bit ++) {
Double wgt = new Double("0");
for (int idx = 0; idx < v.size(); idx ++) {
if (0 != (v.get(idx).md5 & Xx_MASK[bit])) {
wgt += v.get(idx).val;
}
else {
wgt -= v.get(idx).val;
}
}
sim |= wgt > 0 ? CM_MASK[bit] : 0;
}
return sim;
}
数学的伟大之处就在于能够通过已知的现实世界探寻和预测未知世界的奥秘,相信在高维度空间中也一定存在着更高层次的生命形式,每个生命个体就如同一个多维模型,恶习结业善习结果,最终落在哪个象限里皆由“权重”决定~~~