Well, the name is really very subjective. But this is just what the problem is for me.
It is assumed that I want to evenly distribute hits of static web content over a certain number of cache servers. Also, delivery to customers should be accelerated, because multiple domains are used, and requests do not block each other. I also do not need the classic load balancer, but immediately create the correct links in my html code.
I also want the same URL to always be served by the same server.
So, I just defined a small function that returns the host to use by hashing the request URL and calculates the module by the number of servers used:
function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url; }
At first I had something like hexadecimal decoding and substring to prevent in-place overflow, but found that it just worked fine above.
However, my problem is that if I run the following test script:
for($i=0;$i<100000;$i++) { $md5 = md5(uniqid($i).microtime().rand(1,999999999999)); $result[$md5%2]++; }
I expected a uniform distribution. that $ result [0] will be near the value of $ result [1];
That was not so.
Well, thatβs nothing special. I would just agree that md5 is not as evenly distributed as I thought, and would go for another hashing algorithm like sha1 or something like that.
But I tried to reproduce the results and found a sample that I can not explain.
The ratio has always been around 2/1. In fact, this ratio has always been something like 1 / 2.16 to 1 / 2.17
An example of the output of some of the script runs above:
output was generated by: echo "ratio: ".$result[0]/$result[1]."\n"; ratio: 2.1757121534504 ratio: 2.1729411578062 ratio: 2.1726559360393 ratio: 2.1676895664225 ratio: 2.1667416128848 ratio: 2.1667115284133 ratio: 2.1677791605385 ratio: 2.1658969579688 ratio: 2.1668508131769 ratio: 2.1689292821741
Now it was strange that the ratio of the sums of% 2, equal to 1, and the sums of% 2, equal to 0, sometimes alternate!
for($j = 0; $j<100;$j++) { for($i=0;$i<100000;$i++) { $md5 = md5(uniqid($i).microtime().rand(1,999999999999)); $result[$md5%2]++; } var_dump($result); }
I ran the script from the command line two times at first and interrupted it after 3 starts, and it produced two exits:
joe@joe-laptop :/home/flimmit/httpdocs$ php test.php PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6 array(2) { [0]=> int(68223) [1]=> int(31777) } array(2) { [0]=> int(136384) [1]=> int(63616) } array(2) { [0]=> int(204498) [1]=> int(95502) } ^C joe@joe-laptop :/home/flimmit/httpdocs$ php test.php PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6 PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6 array(2) { [1]=> int(31612) [0]=> int(68388) } array(2) { [1]=> int(63318) [0]=> int(136682) } array(2) { [1]=> int(94954) [0]=> int(205046) } ^C joe@joe-laptop :/home/flimmit/httpdocs$
As you can see in the first, the first record of results is always higher, in the second - vice versa. same script.
The strange thing is that I can ONLY reproduce this behavior when I run the script several times.
I wrote this little script to reproduce the swap and create enough measurement data:
for($j = 0; $j<100;$j++) { for($i=0;$i<rand(1000,10000);$i++) { $md5 = md5(uniqid($i).microtime().rand(1,99999999)); $result[$md5%2]++; }
But here it prints only b, not A. It made me think that there might be a semantic error in the script, but I did not find it.
I am really stuck and it bothers me a lot.
So my questions are:
You can recommend any literature / web links that I could read about md5 a little deeper, including distributions, etc.
Can you explain / reproduce the behavior? Do I have a mistake here? (actually it is very likely, but I can not find it)
Can you recommend any other algorithm that would be useful for my use? It should not be cryptographic or strong, but fast, deterministic and evenly distributed.