Fast hash function for `std :: vector`

I implemented this solution to get the hash value from vector<T> :

 namespace std { template<typename T> struct hash<vector<T>> { typedef vector<T> argument_type; typedef std::size_t result_type; result_type operator()(argument_type const& in) const { size_t size = in.size(); size_t seed = 0; for (size_t i = 0; i < size; i++) //Combine the hash of the current vector with the hashes of the previous ones hash_combine(seed, in[i]); return seed; } }; } //using boost::hash_combine template <class T> inline void hash_combine(std::size_t& seed, T const& v) { seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); } 

But this solution does not scale at all: with a vector<double> of 10 million elements, it will take more than 2.5 s (according to VS).

Is there a quick hash function for this scenario?

Please note that creating a hash value from a vector link is not a valid solution, since the associated unordred_map will be used in different runs and, in addition, two vector<double> with the same content, but different addresses will be displayed differently (undesirable behavior for this application).

+5
source share
2 answers

NOTE: As for , you get 25-50x acceleration by compilation with optimization. Do it first. Then, if it is still too slow, see below.


I don’t think you can do much. You must touch all the elements, and this combinational function will be as fast as it is.

One option would be to parallelize the hash function. If you have 8 cores, you can start 8 threads for each hash of the 1 / 8th vector, and then combine the 8 resulting values ​​at the end. Synchronization overhead can be useful for very large vectors.

+8
source

The approach used by the old MSVC hashmap was less frequent.

This means that isolated changes will not be displayed in your hash, but what you are trying to avoid is reading and processing only 80 mb of data to hash your vector. Not reading some characters is pretty inevitable.

The second thing you should do is not specializing in std::hash on all vector s, it can make your program poorly formed (as suggested by resolving a defect whose status I do not remember), and at least a bad plan (since std sure to allow adding a hash combination and hashing vectors).

When I write a custom hash, I usually use ADL (Koenig Lookup) to simplify its extension.

 namespace my_utils { namespace hash_impl { namespace details { namespace adl { template<class T> std::size_t hash(T const& t) { return std::hash<T>{}(t); } } template<class T> std::size_t hasher(T const& t) { using adl::hash; return hash(t); } } struct hash_tag {}; template<class T> std::size_t hash(hash_tag, T const& t) { return details::hasher(t); } template<class T> std::size_t hash_combine(hash_tag, std::size_t seed, T const& t) { seed ^= hash(t) + 0x9e3779b9 + (seed << 6) + (seed >> 2); } template<class Container> std::size_t fash_hash_random_container(hash_tag, Container const& c ) { std::size_t size = c.size(); std::size_t stride = 1 + size/10; std::size_t r = hash(hash_tag{}, size); for(std::size_t i = 0; i < size; i += stride) { r = hash_combine(hash_tag{}, r, c.data()[i]) } return r; } // std specializations go here: template<class T, class A> std::size_t hash(hash_tag, std::vector<T,A> const& v) { return fash_hash_random_container(hash_tag{}, v); } template<class T, std::size_t N> std::size_t hash(hash_tag, std::array<T,N> const& a) { return fash_hash_random_container(hash_tag{}, a); } // etc } struct my_hasher { template<class T> std::size_t operator()(T const& t)const { return hash_impl::hash(hash_impl::hash_tag{}, t); } }; } 

now my_hasher is a universal hash. It uses either hashes declared in my_utils::hash_impl (for std types) or free functions called hash that will hash this type for hash things. Otherwise, it tries to use std::hash<T> . If this fails, you will get a compile time error.

Writing a free hash function in a namespace of the type that you want to use for a hash is generally less annoying than having to turn std off and open and specialize std::hash in my experience.

It understands vectors and arrays recursively. Performing tuples and pairs requires a bit more work.

It projects the indicated vectors and arrays approximately 10 times.

(Note: hash_tag is both a joke and a way to force ADL to prevent the hash specialization from being sent in the hash_impl namespace because it sucks.)

The price of the sample is that you can get more collisions.


Another approach, if you have a huge amount of data, is to hash them once and keep track of when they will change. To make this approach, use the monad interface to copy to write for your type, which keeps track of whether the hash is updated. Now the vector receives hashing once; if you change it, the hash will be discarded.

You can go further and have a random access hash house (where it is easy to predict what happens when you edit a given value hash-wise) and mediate all access to the vector. It's complicated.

You can also multimill hashing, but I would suggest that your code is probably bandwidth related, and multithreading will not help. Worth a try.

You can use a more attractive structure than a flat vector (something like a tree), where changes in the values ​​of the bubbles in a hash-like way have a root hash value. This would add lg (n) overhead to all access to the element. Again, you'll have to wrap the raw data in controls that keep the hashing up to date (or keep track of which ranges are dirty and need updating).

Finally, as you work with 10 million items at a time, consider moving to a powerful, large-scale data storage solution such as a database or whatever you have. Using 80 megabyte keys on a card seems strange to me.

+4
source

All Articles