mysql 汉明距离检索,如何加快此BIT_COUNT查询的汉明距离?

I have a php script that checks hamming distance between 2 still photos taken from a security camera.

The table is mySQL with 2.4M rows, and consists of a Key and 4 INT(10)s. The INT(10)s have been indexed individually, together, and together with the Key, but I don't have significant evidence that any combination was faster than the others. I can try again if you suggest to do so.

The hamming weights are calculated by transforming the image into 8x16 pixels, and each quarter of the bits is stored in a column, pHash0, pHash1... etc.

There are 2 ways I have written it. The first way was to use nested derived tables. Theoretically, each derivation should have lesser data to check than it's predecessor. The query is a prepared statement, and the ? fields are the pHash[0-3] of the file I'm checking against.

Select

`Key`,

Bit_Count(T3.pHash3 ^ ?) + T3.BC2 As BC3

From

(Select

*,

Bit_Count(T2.pHash2 ^ ?) + T2.BC1 As BC2

From

(Select

*,

Bit_Count(T1.pHash1 ^ ?) + T1.BC0 As BC1

From

(Select

`Key`,

pHash0,

pHash1,

pHash2,

pHash3,

Bit_Count(pHash0 ^ ?) As BC0

From

files

Where

Not pHash0 Is Null And

Bit_Count(pHash0 ^ ?) < 4) As T1

Where

Bit_Count(T1.pHash1 ^ ?) + T1.BC0 < 4) As T2

Where

Bit_Count(T2.pHash2 ^ ?) + T2.BC1 < 4) As T3

Where

Bit_Count(T3.pHash3 ^ ?) + T3.BC2 < 4

The second approach was a bit more direct. It just did all of the work at once.

Select

`Key`,

From

files

Where

Not pHash0 is null AND

Bit_Count(pHash0 ^ ?) + Bit_Count(pHash1 ^ ?) + Bit_Count(pHash2 ^

?) + Bit_Count(pHash3 ^ ?) < 4

The first query is faster on large recordsets, while the second is faster on smaller recordsets, but neither will exceed 1-1/3 seconds per compare on 2.4M records.

Do you see a way of tweaking this process to make it faster? Any suggestions can be quickly tried, such as changing datatypes or indexes.

The setup is Win7x64, MySQL/5.6.6 and InnoDB, nginx/1.99, php-cgi/7.0.0 with zend enabled. The script is called from a webpage, and has buffering turned off for immediate feedback.

EDIT:

It might work better if I change the 4 32-bit integers to 1 binary(16), which would change the compares from 4 to one, but I'd also have to convert my 4 parameters to a 128-bit character, which php won't do. If there was a fast way to combine them, it might squeeze a bit more time off.

EDIT

The accepted answer has increased the speed by ~500%. A quick synopsis of our hypothesis: The bitcount of pHash "A" will always be within pHash "B" +/- Hamming Distance.

Special thanks to @duskwuff for tenacity and patience. Cheers @duskwuff!

EDIT

This was my most recent query:

Select

files.`Key`,

Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +

Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) as BC

From

files FORCE INDEX (bitcount)

Where

bitCount Between ? And ?

AND Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +

Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) <= ?

ORDER BY Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +

Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3)

Where the first 4 "?" represent the 4 32-bit hashes of the file being checked, the next 2 "?" represent the pre-calculated bitcount of that file +/- the desired hamming distance, and the last "?" represents that hamming distance. The ORDER BY clause is necessary only to bring the closest matches to the top, where a LIMIT 1 clause will return the best match. There is a B-TREE index on the bitcount field.

The dispersion of bitcounts from 2.4-million files fell into a bell curve, having 3 or 4 on the extremes, with 70,000 in the center. If given a file with a bitcount of 64 (which is worst-case), looking for files within a hamming distance of 3 means comparing 20% of the files (490,000 in my case), whereas looking for a hamming distance of 0 would compare only 2.8% of the records (70,000, of course).

解决方案

Observe that BIT_COUNT(a ^ b) is bounded below by the difference between BIT_COUNT(a) and BIT_COUNT(b). (That is, it is always at least equal to the difference, and may be greater.) If you precalculate the total bit count for each row, you can use that to rule out rows which have a total bit count that's too far away from your target. Even better, you can create an index on that column, and that index will be used.

What I'd have in mind would be something along the lines of:

ALTER TABLE files ADD COLUMN totalbits INTEGER;

CREATE INDEX totalbits_index ON files (totalbits);

UPDATE files SET totalbits = BIT_COUNT(pHash1) + BIT_COUNT(pHash2)

+ BIT_COUNT(pHash3) + BIT_COUNT(pHash4);

SELECT `Key` FROM files WHERE (totalbits BETWEEN … AND …) AND …

Note that, with this in place, there's no need to split your hashes into four chunks. Combining them back into a single column would make things easier.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值