There is not one, but they can be emulated .
Here is a copy of the achievement before the link died .. all content
Object Set in PHP: Arrays vs. SplObjectStorage
One of my QueryPath projects performs many tasks that require support for a set of unique objects. In my quest to optimize QueryPath, I explored various ways to efficiently store sets of objects in such a way as to provide effective containment checks. In other words, I need a data structure that stores a list of unique objects and can quickly tell if an object exists in this list. You also need the ability to scroll through the contents of the list.
I recently narrowed the list of candidates to two methods:
Use the good old-fashioned arrays to emulate the hash set. Use the SplObjectStorage system present in PHP 5.2 and later. Before embedding anything directly in QueryPath , I first outlined the design of the two methods, and then used several micro-tests (using Crell help) for the two methods. To say that the results were amazing is an understatement. Tests are likely to change the way we structure future code, both inside and outside Drupal .
Constructions
Before presenting the benchmarks, I want to quickly explain the two projects that I focused on.
Hash Set Arrays
The first method I considered was using a standard PHP () array to emulate a set supported by a hash mapping (a "hash set"). A set is a data structure designed to store a list of unique elements. In my case, I am interested in preserving a unique set of DOM objects. A hash set is a set that is implemented using a hash table, where the key is a unique identifier for the stored value. Although it would usually be possible to write a class to encapsulate this function, I decided to test the implementation as a bare array without any layers of indirection on top. In other words, what I'm going to introduce is the insides of what would be a hash set implementation.
Purpose: To maintain a (unique) set of objects so that they (a) are easy to repeat and (b) are cheap to verify membership. Strategy. Create an associative array in which the key is the hash identifier and the value is an object.
With a fairly good hash function, the strategy described above should work as desired.
Reasonably Good Hashing Function - This was the first loot. How do you create a good hash function for an object of type DOMDocument ? One (bad) way is to serialize the object, and then perhaps take its MD5 hash. This, however, will not work on many objects - in particular, on any object that cannot be serialized. For example, a DOMDocument actually supported by a resource and cannot be serialized.
One must, however, look for such a function. It turns out that there is an object hashing function in PHP 5. It is called spl_object_hash() , and it can take any object (even one that is not native PHP) and generate a hash code for it.
Using spl_object_hash() , we can build a simple data structure that functions like a hash set. This structure looks something like this:
array( $hashcode => $object );
For example, we generate a record like this:
$object = new stdClass(); $hashcode = spl_object_hash($object); $arr = array( $hashcode => $object );
In the above example, the hashcode string is the key of the array, and the object itself is the value of the array. Please note: since the hash code will be the same each time the object is hashed again, it serves not only as a comparison point ("if the object hashkey == object b hashkey, then a == b"), it also functions as a uniqueness constraint. For each array, there can only be one object with the specified hash code, so there is no possibility of two copies (in fact, two links) to the same object that is placed in the array.
With such a data structure, we have many functions available to manage the structure, since we have at our disposal all the functions of the PHP array. So to some extent this is an attractive choice out of the box.
The most important import task, at least in our context, is to determine if an entry exists within the set. There are two possible candidates for this check, and both require the delivery of a hash code:
Check if the key exists with array_key_exists (). Check if the key is set using isset (). To go to chase, isset () is faster than array_key_exists (), and offers the same functions in our context, so we will use this. (The fact that they handle null values differently has nothing to do with us. No null values will ever be inserted into a set.)
Given this, we performed a containment check using something like this:
$object = new stdClass(); $hashkey = spl_object_hash($object); // ... // Check whether $arr has the $object. if (isset($arr[$hashkey])) { // Do something... }
Again, using an array that emulates a hash set allows us to use all the existing functions of the array, and also provides an easy iteration. We can easily drop this into the foreach loop and iterate over the contents. Before looking at how this works, let him consider another possible solution.
Using SplObjectStorage
The second method under consideration uses the new SplObjectStorage class from PHP 5.2+ (it can be in 5.1). This class, which is supported by the C implementation, provides a storage mechanism similar to the set for classes. This provides uniqueness; only one of each object can be saved. It is also available as it implements the Iterable interface. This means that you can use it in loops like foreach. (On the other hand, the version in PHP 5.2 does not provide any index-based random access or access method. The version in PHP 5.3 corrects this drawback.)
Purpose: To maintain a (unique) set of objects so that they (a) are easy to repeat and (b) are cheap to verify membership. Strategy: create an object of class SplObjectStorage and save references to objects inside this.
Creating a new SplObjectStorage is simple:
$objectStore = new SplObjectStorage();
An SplObjectStorage instance not only stores uniqueness information about objects, but objects are also stored in a predictable order. SplObjectStorage is FIFO - First In, First Out.
Adding objects is done using the attach() method:
$objectStore = new SplObjectStorage(); $object = new stdClass(); $objectStore->attach($object);
It should be noted that the application will attach the object only once. If the same object is passed to attach() twice, the second attempt will simply be ignored. For this reason, there is no need to make a call to contains() before calling attach() . It is redundant and expensive.
Checking for an object inside an instance of SplObjectStorage is also simple:
$objectStore = new SplObjectStorage(); $object = new stdClass(); $objectStore->attach($object); // ... if ($objectStore->contains($object)) { // Do something... }
While SplObjectStorage does not have a number of supporting methods that can be accessed with arrays, it allows iteration and somewhat limited access to objects stored inside. In many use cases (including the one I'm studying here) SplObjectStorage provides the required functionality.
Now that we’ve examined two candidate data structures, let’s see how they are implemented.
Comparison
Anyone who has seen Larry's Crell Garfield micro-benchmarks for arrays and SPL ArrayAccess objects is likely to enter this test suite with the same set of expectations that Larry and I have. We expected PHP arrays to throw SplObjectStorage out of the water. After all, arrays are a primitive type in PHP and enjoy many years of optimization. However, the documentation for SplObjectStorage indicates that the search time for the SplObjectStorage object is O (1), which will certainly make it competitive if the base speed is similar to the speed of the array.
My test environments:
iMac (current generation) with 3.06 GHz Intel Core 2 Duo topology and 2 GB of 800 MHz DDR2 memory. MAMP 1.72 (PHP 5.2.5) provides an AMP stack. MacBook Pro with 2.4 GHz Intel Core 2 Duo and 4G of 667 MHz RAM DDR2. MAMP 1.72 (PHP 5.2.5) provides an AMP stack. In both cases, performance tests averaged about the same. The tests in this article are taken from the second system.
Our main testing strategy was to create a simple test that collected information about three things:
The time required to load the data structure The time taken to search the data structure The amount of memory used by the data structure We did everything possible to minimize the influence of other factors on the test. Here is our testing script:
<?php /** * Object hashing tests. */ $sos = new SplObjectStorage(); $docs = array(); $iterations = 100000; for ($i = 0; $i < $iterations; ++$i) { $doc = new DOMDocument(); //$doc = new stdClass(); $docs[] = $doc; } $start = $finis = 0; $mem_empty = memory_get_usage(); // Load the SplObjectStorage $start = microtime(TRUE); foreach ($docs as $d) { $sos->attach($d); } $finis = microtime(TRUE); $time_to_fill = $finis - $start; // Check membership on the object storage $start = microtime(FALSE); foreach ($docs as $d) { $sos->contains($d); } $finis = microtime(FALSE); $time_to_check = $finis - $start; $mem_spl = memory_get_usage(); $mem_used = $mem_spl - $mem_empty; printf("SplObjectStorage:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used); unset($sos); $mem_empty = memory_get_usage(); // Test arrays: $start = microtime(TRUE); $arr = array(); // Load the array foreach ($docs as $d) { $arr[spl_object_hash($d)] = $d; } $finis = microtime(TRUE); $time_to_fill = $finis - $start; // Check membership on the array $start = microtime(FALSE); foreach ($docs as $d) { //$arr[spl_object_hash($d)]; isset($arr[spl_object_hash($d)]); } $finis = microtime(FALSE); $time_to_check = $finis - $start; $mem_arr = memory_get_usage(); $mem_used = $mem_arr - $mem_empty; printf("Arrays:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used); ?>
The test above is broken into four separate tests. The first two tests show how well the SplObjectStorage method handles load and contain checks. The second two perform the same test in our data structure of impromptu arrays.
There are two things to note about the test above.
First, the object of choice for our test was DOMDocument . There are several reasons for this. The obvious reason is that this test was performed to optimize QueryPath , which works with elements from the DOM implementation. However, there are two more interesting reasons. First off, DOMDocuments are not easy. Another is that DOMDocuments supported by the C implementation, which makes them one of the most difficult cases when storing objects. (For example, they cannot be conveniently serialized.)
Nevertheless, after observing the result, we repeated the test with the stdClass base objects and found that the performance results are almost identical, and the memory usage is proportional.
Secondly, it is worth mentioning that we used 100,000 iterations for testing. This was due to the upper limit that my PHP configuration allowed before running out of memory. However, in addition to this, the number was chosen arbitrarily. When I ran tests with a lower number of iterations, SplObjectStorage definitely scaled linearly. Array performance was less predictable (larger standard deviation) with smaller data sets, although it seemed average about the same for smaller sizes than it (more predictable) for larger arrays.
results
So how did these two strategies relate to our micro tests? Here is a representative sample of the result generated at startup above:
SplObjectStorage: Time to fill: 0.085041999817. Time to check: 0.073099000000. Memory: 6124624 Arrays: Time to fill: 0.193022966385. Time to check: 0.153498000000. Memory: 8524352
Averaging this over several runs, SplObjectStorage performs fill and check functions twice as fast as the array method presented above. We tried various permutations of the above tests. Here, for example, are the results for the same test with a stdClass object :
SplObjectStorage: Time to fill: 0.082209110260. Time to check: 0.070617000000. Memory: 6124624 Arrays: Time to fill: 0.189271926880. Time to check: 0.152644000000. Memory: 8524360
Not so much. Even adding arbitrary data to the object we saved does not affect the time it takes for SplObjectStorage (although it seems to increase the time for the array a bit).
Our conclusion is that SplObjectStorage indeed the best solution for storing many objects in a collection. Over the past week, I have ported QueryPath to SplObjectStorage (see Quark branch in GitHub - the existing Drupal QueryPath module can use this experimental branch without changes) and most likely will continue benchmarking. However, preliminary results appear to provide a clear idea of the best approach.
As a result of these results, I am much less likely to have arrays as the “best choice” by default, simply because they are basic data types. If the SPL library contains functions that go beyond arrays, they should be used when necessary. From QueryPath to my Drupal modules, I expect these results to affect my code.
Thanks to Krell for his help, and also to Eddie at Frameweld for starting to learn these two methods first.