I have a php script that checks hamming distance between 2 still photos taken from a security camera.
The table is mySQL with 2.4M rows, and consists of a Key and 4 INT(10)s. The INT(10)s have been indexed individually, together, and together with the Key, but I don't have significant evidence that any combination was faster than the others. I can try again if you suggest to do so.
The hamming weights are calculated by transforming the image into 8x16 pixels, and each quarter of the bits is stored in a column, pHash0, pHash1... etc.
There are 2 ways I have written it. The first way was to use nested derived tables. Theoretically, each derivation should have lesser data to check than it's predecessor. The query is a prepared statement, and the ? fields are the pHash[0-3] of the file I'm checking against.
Select
`Key`,
Bit_Count(T3.pHash3 ^ ?) + T3.BC2 As BC3
From
(Select
*,
Bit_Count(T2.pHash2 ^ ?) + T2.BC1 As BC2
From
(Select
*,
Bit_Count(T1.pHash1 ^ ?) + T1.BC0 As BC1
From
(Select
`Key`,
pHash0,
pHash1,
pHash2,
pHash3,
Bit_Count(pHash0 ^ ?) As BC0
From
files
Where
Not pHash0 Is Null And
Bit_Count(pHash0 ^ ?) < 4) As T1
Where
Bit_Count(T1.pHash1 ^ ?) + T1.BC0 < 4) As T2
Where
Bit_Count(T2.pHash2 ^ ?) + T2.BC1 < 4) As T3
Where
Bit_Count(T3.pHash3 ^ ?) + T3.BC2 < 4
The second approach was a bit more direct. It just did all of the work at once.
Select
`Key`,
From
files
Where
Not pHash0 is null AND
Bit_Count(pHash0 ^ ?) + Bit_Count(pHash1 ^ ?) + Bit_Count(pHash2 ^
?) + Bit_Count(pHash3 ^ ?) < 4
The first query is faster on large recordsets, while the second is faster on smaller recordsets, but neither will exceed 1-1/3 seconds per compare on 2.4M records.
Do you see a way of tweaking this process to make it faster? Any suggestions can be quickly tried, such as changing datatypes or indexes.
The setup is Win7x64, MySQL/5.6.6 and InnoDB, nginx/1.99, php-cgi/7.0.0 with zend enabled. The script is called from a webpage, and has buffering turned off for immediate feedback.
EDIT:
It might work better if I change the 4 32-bit integers to 1 binary(16), which would change the compares from 4 to one, but I'd also have to convert my 4 parameters to a 128-bit character, which php won't do. If there was a fast way to combine them, it might squeeze a bit more time off.
EDIT
The accepted answer has increased the speed by ~500%. A quick synopsis of our hypothesis: The bitcount of pHash "A" will always be within pHash "B" +/- Hamming Distance.
Special thanks to @duskwuff for tenacity and patience. Cheers @duskwuff!
EDIT
This was my most recent query:
Select
files.`Key`,
Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +
Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) as BC
From
files FORCE INDEX (bitcount)
Where
bitCount Between ? And ?
AND Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +
Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) <= ?
ORDER BY Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +
Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3)
Where the first 4 "?" represent the 4 32-bit hashes of the file being checked, the next 2 "?" represent the pre-calculated bitcount of that file +/- the desired hamming distance, and the last "?" represents that hamming distance. The ORDER BY clause is necessary only to bring the closest matches to the top, where a LIMIT 1 clause will return the best match. There is a B-TREE index on the bitcount field.
The dispersion of bitcounts from 2.4-million files fell into a bell curve, having 3 or 4 on the extremes, with 70,000 in the center. If given a file with a bitcount of 64 (which is worst-case), looking for files within a hamming distance of 3 means comparing 20% of the files (490,000 in my case), whereas looking for a hamming distance of 0 would compare only 2.8% of the records (70,000, of course).
解决方案
Observe that BIT_COUNT(a ^ b) is bounded below by the difference between BIT_COUNT(a) and BIT_COUNT(b). (That is, it is always at least equal to the difference, and may be greater.) If you precalculate the total bit count for each row, you can use that to rule out rows which have a total bit count that's too far away from your target. Even better, you can create an index on that column, and that index will be used.
What I'd have in mind would be something along the lines of:
ALTER TABLE files ADD COLUMN totalbits INTEGER;
CREATE INDEX totalbits_index ON files (totalbits);
UPDATE files SET totalbits = BIT_COUNT(pHash1) + BIT_COUNT(pHash2)
+ BIT_COUNT(pHash3) + BIT_COUNT(pHash4);
SELECT `Key` FROM files WHERE (totalbits BETWEEN … AND …) AND …
Note that, with this in place, there's no need to split your hashes into four chunks. Combining them back into a single column would make things easier.