mysql 汉明距离检索,如何加快此BIT_COUNT查询的汉明距离？

最新推荐文章于 2021-03-17 15:09:22 发布

weixin_39856630

最新推荐文章于 2021-03-17 15:09:22 发布

阅读量189

点赞数

文章标签： mysql 汉明距离检索

I have a php script that checks hamming distance between 2 still photos taken from a security camera.

The table is mySQL with 2.4M rows, and consists of a Key and 4 INT(10)s. The INT(10)s have been indexed individually, together, and together with the Key, but I don't have significant evidence that any combination was faster than the others. I can try again if you suggest to do so.

The hamming weights are calculated by transforming the image into 8x16 pixels, and each quarter of the bits is stored in a column, pHash0, pHash1... etc.

There are 2 ways I have written it. The first way was to use nested derived tables. Theoretically, each derivation should have lesser data to check than it's predecessor. The query is a prepared statement, and the ? fields are the pHash[0-3] of the file I'm checking against.

Select

`Key`,

Bit_Count(T3.pHash3 ^ ?) + T3.BC2 As BC3

From

(Select

Bit_Count(T2.pHash2 ^ ?) + T2.BC1 As BC2

From

(Select

Bit_Count(T1.pHash1 ^ ?) + T1.BC0 As BC1

From

(Select

`Key`,

pHash0,

pHash1,

pHash2,

pHash3,

Bit_Count(pHash0 ^ ?) As BC0

From

files

Where

Not pHash0 Is Null And

Bit_Count(pHash0 ^ ?) < 4) As T1

Where

Bit_Count(T1.pHash1 ^ ?) + T1.BC0 < 4) As T2

Where

Bit_Count(T2.pHash2 ^ ?) + T2.BC1 < 4) As T3

Where

Bit_Count(T3.pHash3 ^ ?) + T3.BC2 < 4

The second approach was a bit more direct. It just did all of the work at once.

Select

`Key`,

From

files

Where

Not pHash0 is null AND

Bit_Count(pHash0 ^ ?) + Bit_Count(pHash1 ^ ?) + Bit_Count(pHash2 ^

?) + Bit_Count(pHash3 ^ ?) < 4

The first query is faster on large recordsets, while the second is faster on smaller recordsets, but neither will exceed 1-1/3 seconds per compare on 2.4M records.

Do you see a way of tweaking this process to make it faster? Any suggestions can be quickly tried, such as changing datatypes or indexes.

The setup is Win7x64, MySQL/5.6.6 and InnoDB, nginx/1.99, php-cgi/7.0.0 with zend enabled. The script is called from a webpage, and has buffering turned off for immediate feedback.

EDIT:

It might work better if I change the 4 32-bit integers to 1 binary(16), which would change the compares from 4 to one, but I'd also have to convert my 4 parameters to a 128-bit character, which php won't do. If there was a fast way to combine them, it might squeeze a bit more time off.

EDIT

The accepted answer has increased the speed by ~500%. A quick synopsis of our hypothesis: The bitcount of pHash "A" will always be within pHash "B" +/- Hamming Distance.

Special thanks to @duskwuff for tenacity and patience. Cheers @duskwuff!

EDIT

This was my most recent query:

Select

files.`Key`,

Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +

Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) as BC

From

files FORCE INDEX (bitcount)

Where

bitCount Between ? And ?

AND Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +

Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) <= ?

ORDER BY Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +

Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3)

Where the first 4 "?" represent the 4 32-bit hashes of the file being checked, the next 2 "?" represent the pre-calculated bitcount of that file +/- the desired hamming distance, and the last "?" represents that hamming distance. The ORDER BY clause is necessary only to bring the closest matches to the top, where a LIMIT 1 clause will return the best match. There is a B-TREE index on the bitcount field.

The dispersion of bitcounts from 2.4-million files fell into a bell curve, having 3 or 4 on the extremes, with 70,000 in the center. If given a file with a bitcount of 64 (which is worst-case), looking for files within a hamming distance of 3 means comparing 20% of the files (490,000 in my case), whereas looking for a hamming distance of 0 would compare only 2.8% of the records (70,000, of course).

解决方案

Observe that BIT_COUNT(a ^ b) is bounded below by the difference between BIT_COUNT(a) and BIT_COUNT(b). (That is, it is always at least equal to the difference, and may be greater.) If you precalculate the total bit count for each row, you can use that to rule out rows which have a total bit count that's too far away from your target. Even better, you can create an index on that column, and that index will be used.

What I'd have in mind would be something along the lines of:

ALTER TABLE files ADD COLUMN totalbits INTEGER;

CREATE INDEX totalbits_index ON files (totalbits);

UPDATE files SET totalbits = BIT_COUNT(pHash1) + BIT_COUNT(pHash2)

+ BIT_COUNT(pHash3) + BIT_COUNT(pHash4);

SELECT `Key` FROM files WHERE (totalbits BETWEEN … AND …) AND …

Note that, with this in place, there's no need to split your hashes into four chunks. Combining them back into a single column would make things easier.

weixin_39856630

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
mysql 汉明距离检索,如何加快此BIT_COUNT查询的汉明距离？

I have a php script that checks hamming distance between 2 still photos taken from a security camera.The table is mySQL with 2.4M rows, and consists of a Key and 4 INT(10)s. The INT(10)s have been ind...
复制链接

扫一扫