Creating a hash of string sortable

Is it even necessary to create a hash of strings where the hashes can be sorted and have the same results as the strings themselves were sorted?

+6
language-agnostic sorting hash
source share
6 answers

This will not be possible, at least if you allow strings longer than the size of the hash. You have 256 ^ (maximum row size) possible strings mapped to 256 ^ hash values ​​(hash size), so you will end up with some unsorted strings.

Imagine the simplest hash: Truncating each line (bytes) of bytes.

+7
source share

Yes. He called using the entire input string as a hash.

+5
source share

As others have pointed out, it is impractical to do what you requested. You should use this string as a hash that limits the length of strings that could be "hashed", etc.

An obvious approach to preserving a “sorted hash” data structure would be to support both a sorted list (such as a heap and a binary tree) and hashed data mapping. The insertions and deletions will be O (log (n)), while the samples will be O (1). I'm not sure how often it will cost extra complexity and overhead.

If you had a particularly large dataset, mostly read-only and such that the logarithmic time search was too expensive, I think this might be useful. Note that the cost of updates is actually the sum of the constant (hash) time and logarithmic time operations (binary tree or heap). However, O (1) + O (log (n)) reduces to the larger of the two terms in asymptotic analysis. (The underlying cost still exists - applicable to any implementation effort, regardless of its theoretical inconsistency).

For a significant range of data set sizes, the cost of maintaining this hypothetical hybrid data structure can be estimated as "twice as much" the cost of maintaining any of the clean ones. (In other words, many binary tree implementations can scale to billions of elements (2 ^ ~ 32 or so) in time, which is comparable to the cost of typical hash functions). Therefore, it would be difficult for me to convince myself that such added code complexity and time cost (hybrid data structure) would really be useful for this project.

(Note: I saw that Python 3.1.1 added the concept of “ordered” dictionaries ... and it looks like sorting, but not quite the same. From what I collect, an ordered dictionary preserves the order in which the elements were added to collection. I also seem to remember some talk of “views” ... objects in the language that somehow can access the keys of the dictionary (sort, reverse, reverse sort, ...) at a (possibly) lower cost than transmitting a set of keys through the built-in "sorted ()" and "reverse" ("). I do not use I used them and didn’t look at the implementation details. I would suggest that one of these "views" would look like a lazily evaluated index, performing the necessary sorting by call and saving the results with some flag or trigger (observer or listener template) that reset when the original collection of sources is updated, so that the “presentation” call scheme updates its index, subsequence calls will be able to use these ts results if no insertions or deletions were made in the dictionary. Any appeal to the view after changing the key will lead to the cost of updating the view. However, this is all pure reflection on my part. I mention this because it can also give an idea of ​​some alternative ways of approaching the issue).

+2
source share

Not if there are fewer lines than hashes, and hashes are perfect . Even then you still have to ensure that the hash order is the same as the string order, this is probably not possible if you don't know all the strings ahead of time.

+1
source share

Not. The hash must contain the same amount of information as the string that it replaces. Otherwise, if two lines are mapped to the same hash value, how could you sort them?

Another way to think about this is this: if I have two lines: "a" and "b", then I mean both of them with this view, which stores the hash function, and I get f (a) and f ( b) However, there are an infinite number of lines that are greater than "a" but less than "b". This will require string hashing to arbitrary precision. Actual values ​​(due to power). After all, you basically just have to have a string encoded as a number.

+1
source share

You are essentially asking if you can compress key lines into smaller keys while preserving their sort order. So it depends on your data. For example, if your lines consist of only hexadecimal digits, they can be replaced with 4-bit codes.

But for the general case this is not possible. You end up hashing every source key in yourself.

+1
source share

All Articles