Compiling a consistent hash of char types

I want to implement a kind of Symbol in the same way ruby ​​does.

To do this, I created a custom literal that returned std::hash corresponding std::basic_string<T> .

The code was wonderful, but since I read somewhere , the hash function may not be consistent across multiple executions of the same program. Moreover, I would like to do this calculation at compile time, which was 1) not supported by std::hash and 2) if std::hash changed the return value, I would have broken the code.

So, I wrote the following implementation based on java.lang.String.hashCode .

 typedef size_t symbol; template<typename CharT> constexpr size_t constant_hash(const CharT* p, size_t h = 0) noexcept { return (*p == 0) ? h : constant_hash(p + 1, h * 31 + static_cast<size_t>(*p)); } constexpr symbol operator "" _sym (const char* p, size_t n) noexcept { return constant_hash(p); } 

My question is : are there any problems with this implementation?

I can only test it on GCC 4.7.1, and I would like to know if it complies with the standard and should also work with another compiler.

I ask this because the previous implementation worked on GCC, but called segfault if the binary was compiled using clang ++ (the problem is undefined behavior with index operators, which I think).

Thanks in advance

Edit

Work with clang ++ (thanks KennyTM)

+4
source share
3 answers

No UB, it works fine if the string has a '\0' termination. Note that evaluating constexpr cannot cause UB; arithmetic or pointer operations that invoke UB at runtime are necessary to create a compilation error in the context of a constant expression.

Note that static_cast not required; the operand char will be assigned the value size_t .

In addition, at first glance, the hash function does not look very good, because h * 31 is just ( h << 5 ) - h . You can choose a larger number with 1 randomly distributed over the entire binary value. But, on the other hand, they can try to be smart, since low 5 bits of ASCII have the greatest entropy, and this eliminates the possibility of collision of short lines of different lengths.

+2
source

Note: n3333 is a proposal for C ++ 17. Although I do not believe that for C ++ 11 there is a requirement that the hash produces the same result on several runs, in practice I believe that all current implementations are executed.

+2
source

In the current active C ++ standard, the definition of a hash function is generally written in such a way as to provide more domain-specific implementation capabilities, rather than requiring that the hashes be executed in a specific way. For example, it allows you to create empty lines and use the memory location of the pool instance as a hash value (by the way, this is how Ruby executes its lines and hashes, which led to some interesting questions ). If you calculate your hash on data, not a link, then the value will be stable - if you have not found any form of mathematics where there are no constant expressions.

In principle, β€œmay not be,” in this case, gives permission for things to behave in a certain way, as opposed to a statement about what might happen.

However, if you use std::hash , you cannot guarantee that the values ​​will always be the same between executions (and in the future, if n3333 is accepted, any code that relies on this will break), so it’s better to define your own own stable hash function if you need stable hashing.

+1
source

All Articles