汉明距离(Hamming distance)
两个字符串之间的汉明距离是指两个相等长度的字符串,对应位置上不同字符的个数。
例子如下:
A=abcdef
B=adddef
则A与B之间的汉明距离是2,因为第二位和第三位不同。
虽然比较两个hash值的汉明距离非常快,但受不住数据的爆发式增长,在海量的数据中找出两个相似的hash值,性能也会慢慢变差,显然这种最基本的顺序查找,无法扩展到数以亿计的数据中。
在图片的识别中,汉明距离在0~10之间认为是相似的。如果采用顺序查找,查询完的次数为1万次。
算法原理
如果把hash值分成11份,那么两个hash值相同,则必有一块区域是完全相同的。这个分法不太科学,我们可以把hash值分为8份,这样如果每部分都不相同,则汉明距离肯定大于8;相反,如果汉明距离小于8,则至少有一块是相同的。
按照这个原理,分以下步骤,可以对海量数据建立索引:
- 将64位hash值分成8等份。
- 调整上面64位hash,将任意一块作为前8位,总共有8个table
- 采用精确匹配的方式,查找前8位
- 如果查找到,再精确判断这里的hash值。
java实现
public class DIndex implements Serializable {
private static final long serialVersionUID = -4463444087393922139L;
List<Map<Integer, List<Long>>> index_store = new ArrayList<Map<Integer, List<Long>>>();
Map<Long, String> data = new HashMap<Long, String>();
static final int STORE_COUNT = 8;
static final int MAX_DIS = 25;
private String indexPath = "";
/**
*
* @param indexPath
* 索引保存路径
*/
public DIndex(String indexPath) {
this.indexPath = indexPath;
File file = new File(indexPath);
if (!file.exists()) {
try {
file.createNewFile();
} catch (IOException e) {
e.printStackTrace();
}
}
// 初始化8个索引库
for (int i = 0; i < STORE_COUNT; i++) {
index_store.add(new HashMap<Integer, List<Long>>());
}
}
public void index(String image) throws IOException {
long fingerprint = DHash.fingerprint(image);
intoIndex(fingerprint);
data.put(fingerprint, image);
}
public void intoIndex(Long fingerprint) {
int subHash[] = subHash(fingerprint);
for (int i = 0; i < STORE_COUNT; i++) {
int hash = subHash[i];
Map<Integer, List<Long>> map = index_store.get(i);
intoIndex(hash, fingerprint, map);
}
}
public void intoIndex(Integer key, Long value, Map<Integer, List<Long>> index) {
List<Long> list = index.get(key);
if (list == null) {
list = new ArrayList<Long>();
}
list.add(value);
index.put(key, list);
}
public Top<String, Integer> search(String image) throws IOException {
long fingerprint = DHash.fingerprint(image);
return search(fingerprint);
}
public Top<String, Integer> search(long finger0) throws IOException {
int subHash[] = subHash(finger0);
Top<String, Integer> top = new Top<String, Integer>();
for (int hash : subHash) {
for (Map<Integer, List<Long>> ind : index_store) {
List<Long> fingers = ind.get(hash);
if (fingers != null) {
for (Long finger : fingers) {
int dis = HammingDistance.distance(finger0, finger);
if (dis < MAX_DIS) {
String file = data.get(finger);
top.add(file, dis);
}
}
}
}
}
return top;
}
public int[] subHash(long fingerprint) {
int[] subHash = new int[STORE_COUNT];
for (int i = 56; i >= 0; i -= STORE_COUNT) {
int hash = (int) (fingerprint >> i) & 0xff;
subHash[STORE_COUNT - i / STORE_COUNT - 1] = hash;
}
return subHash;
}
public Top<String, Integer> fullSearch(String toFind) throws IOException {
long fingerprint = DHash.fingerprint(toFind);
String find = "";
Top<String, Integer> top = new Top<String, Integer>();
for (Long f : data.keySet()) {
int dis = HammingDistance.distance(fingerprint, f);
if (dis < MAX_DIS) {
find = data.get(f);
top.add(find, dis);
}
}
return top;
}
public void write() throws IOException {
FileOutputStream fout = new FileOutputStream(indexPath);
ObjectOutputStream out = new ObjectOutputStream(fout);
out.writeObject(this);
out.close();
}
public boolean canReload() {
return new File(indexPath).exists();
}
public DIndex reload() {
try {
FileInputStream fin = new FileInputStream(indexPath);
ObjectInputStream in = new ObjectInputStream(fin);
DIndex dindex = (DIndex) in.readObject();
in.close();
return dindex;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
}