Can you explain those disturbing anomalies of md5 and modulo?

Well, the name is really very subjective. But this is just what the problem is for me.

It is assumed that I want to evenly distribute hits of static web content over a certain number of cache servers. Also, delivery to customers should be accelerated, because multiple domains are used, and requests do not block each other. I also do not need the classic load balancer, but immediately create the correct links in my html code.

I also want the same URL to always be served by the same server.

So, I just defined a small function that returns the host to use by hashing the request URL and calculates the module by the number of servers used:

function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url; } 

At first I had something like hexadecimal decoding and substring to prevent in-place overflow, but found that it just worked fine above.

However, my problem is that if I run the following test script:

 for($i=0;$i<100000;$i++) { $md5 = md5(uniqid($i).microtime().rand(1,999999999999)); $result[$md5%2]++; } 

I expected a uniform distribution. that $ result [0] will be near the value of $ result [1];

That was not so.

Well, that’s nothing special. I would just agree that md5 is not as evenly distributed as I thought, and would go for another hashing algorithm like sha1 or something like that.

But I tried to reproduce the results and found a sample that I can not explain.

The ratio has always been around 2/1. In fact, this ratio has always been something like 1 / 2.16 to 1 / 2.17

An example of the output of some of the script runs above:

 output was generated by: echo "ratio: ".$result[0]/$result[1]."\n"; ratio: 2.1757121534504 ratio: 2.1729411578062 ratio: 2.1726559360393 ratio: 2.1676895664225 ratio: 2.1667416128848 ratio: 2.1667115284133 ratio: 2.1677791605385 ratio: 2.1658969579688 ratio: 2.1668508131769 ratio: 2.1689292821741 

Now it was strange that the ratio of the sums of% 2, equal to 1, and the sums of% 2, equal to 0, sometimes alternate!

 for($j = 0; $j<100;$j++) { for($i=0;$i<100000;$i++) { $md5 = md5(uniqid($i).microtime().rand(1,999999999999)); $result[$md5%2]++; } var_dump($result); } 

I ran the script from the command line two times at first and interrupted it after 3 starts, and it produced two exits:

 joe@joe-laptop :/home/flimmit/httpdocs$ php test.php PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6 array(2) { [0]=> int(68223) [1]=> int(31777) } array(2) { [0]=> int(136384) [1]=> int(63616) } array(2) { [0]=> int(204498) [1]=> int(95502) } ^C joe@joe-laptop :/home/flimmit/httpdocs$ php test.php PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6 array(2) { [1]=> int(31612) [0]=> int(68388) } array(2) { [1]=> int(63318) [0]=> int(136682) } array(2) { [1]=> int(94954) [0]=> int(205046) } ^C joe@joe-laptop :/home/flimmit/httpdocs$ 

As you can see in the first, the first record of results is always higher, in the second - vice versa. same script.

The strange thing is that I can ONLY reproduce this behavior when I run the script several times.

I wrote this little script to reproduce the swap and create enough measurement data:

 for($j = 0; $j<100;$j++) { for($i=0;$i<rand(1000,10000);$i++) { $md5 = md5(uniqid($i).microtime().rand(1,99999999)); $result[$md5%2]++; } #var_dump($result); echo "ratio: ".$result[0]/$result[1]." ".(($result[0]<$result[1]) ? "A":"B")."\n"; sleep(rand(2,5)); } 

But here it prints only b, not A. It made me think that there might be a semantic error in the script, but I did not find it.

I am really stuck and it bothers me a lot.

So my questions are:

  • You can recommend any literature / web links that I could read about md5 a little deeper, including distributions, etc.

  • Can you explain / reproduce the behavior? Do I have a mistake here? (actually it is very likely, but I can not find it)

  • Can you recommend any other algorithm that would be useful for my use? It should not be cryptographic or strong, but fast, deterministic and evenly distributed.

+6
php cryptography md5 checksum
source share
3 answers

The md5() function returns a string , not an integer.

This means that this string will be given for an integer to make modulo; and since this string will contain characters in the range 0-9A-F , cast as an integer, you have:

  • 1 chance out of 16 get 0
  • 9 chances out of 16 from 1 to 9
  • 6 out of 16 chances to get between A and F - that will be sent to 0


For example, this:

 $a = md5('plop1'); var_dump($a, (int)$a); $a = md5('plop2'); var_dump($a, (int)$a); $a = md5('plop5'); var_dump($a, (int)$a); 

You will get the following result:

 string 'ac4bf0e466417336599b72a8b2f595da' (length=32) int 0 string 'ed91c463402dd797d0718350f5bd0acd' (length=32) int 0 string '85782b3afb04072c1bf172a6a7e6bb5e' (length=32) int 85782 

I will let you guess the possible impact this may have on the result of the modulo operator; -)

+7
source share
 for ($i = 0; $i < 10; $i++) { srand(crc32('test_url1')); echo rand().'<br />'; } for ($i = 0; $i < 10; $i++) { srand(crc32('test_url2')); echo rand().'<br />'; } 

add a range to the rand function and you have the value of your server.

+1
source share

"The background is that I want to evenly distribute images of static web content across a certain number of cache servers."

Many, many load balancers are already doing this. Squid, nginx, Varnish, HAProxy ...

Please do not write your own load balancer in PHP. You are welcome.

0
source share

All Articles