Hamming distance in binary strings in SQL

I have a table in my database where I store SHA256 hashes in the BINARY column (32). I am looking for a way to calculate the Hamming distance of records in a column to a given value, i.e. something like:

SELECT * FROM table ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC LIMIT 10 

(if you're interested, the Hamming distance of lines A and B is defined as BIT_COUNT(A^B) , where ^ is the bitwise XOR operator, and BIT_COUNT returns the number 1s in the binary string).

Now I know that both the operator ^ function and the BIT_COUNT function work only with INTEGER, so I would say that probably the only way to do this is to split the binary strings in substrings, drop each binary substring into an integer, calculate the Hamming distance by substring and then add them. The problem is that it sounds terribly complicated, inefficient and definitely not elegant. So my question is: can you suggest a better way? (note that I am on a shared hosting, so I can not change the database server or load libraries)

edit (1): Obviously, loading the whole table in PHP and doing the calculations would be possible, but I would rather avoid this because this table is likely to grow quite.

edit (2): Database server is MySQL 5.1

edit (3): My answer below contains the code I just described above.

edit (4): I just found out that using 4 BIGINT to store a hash instead of BINARY (32) gives significant speed improvements (more than 100 times faster). See comments on my answer below.

+22
sql binary-data mysql hash hamming distance
Jan 23 '11 at 10:45
source share
2 answers

It seems that storing data in a BINARY column is an approach related to poor execution. The only quick way to get decent performance is to split the contents of the BINARY column into multiple BIGINT columns, each of which contains an 8-byte substring of source data.

In my case (32 bytes), this would mean using 4 BIGINT columns and using this function:

 CREATE FUNCTION HAMMINGDISTANCE( A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT ) RETURNS INT DETERMINISTIC RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1) + BIT_COUNT(A2 ^ B2) + BIT_COUNT(A3 ^ B3); 

Using this approach in my testing is more than 100 times faster than using the BINARY approach.




FWIW, this is the code I hinted at while explaining the problem. It is best to use the same thing, welcome (I especially don't like binary conversions> hex> decimal):

 CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32)) RETURNS INT DETERMINISTIC RETURN BIT_COUNT( CONV(HEX(SUBSTRING(A, 1, 8)), 16, 10) ^ CONV(HEX(SUBSTRING(B, 1, 8)), 16, 10) ) + BIT_COUNT( CONV(HEX(SUBSTRING(A, 9, 8)), 16, 10) ^ CONV(HEX(SUBSTRING(B, 9, 8)), 16, 10) ) + BIT_COUNT( CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10) ) + BIT_COUNT( CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10) ); 
+14
Jan 24 '11 at 2:55 april
source share

Interesting question: I found a way to do this for binary(3) , which could work for binary(32) :

 drop table if exists BinaryTest; create table BinaryTest (hash binary(3)); insert BinaryTest values (0xAAAAAA); set @supplied = cast(0x888888 as binary); select length(replace(concat( bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))), bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))), bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1))) ),'0','')) from BinaryTest; 

replace removes all zeros, and the length of the remainder is the number of units. (Converting to binary values ​​results in zero, so counting zeros will not work.)

Prints 6 , which corresponds to the number of units in

 0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010 
+1
Jan 23 2018-11-23T00:
source share



All Articles