Correct data structure for bidirectional key table & # 8596; val

I have a set of keys (characters) ↔ hash (integer) in R. I would like to store these associations in one structure, which allows me to refer to a key / hash pair by key and also hash.

So something like

"hello" <-> 1234 

in the db variable.

And access it (ish; this exact access syntax does not have to be):

 db["hello"] -> 1234 db[1234] -> "hello" 

I tried using a data frame and named keys for growths. But then I can not refer to the string for an integer number of hashes. If I use hash integers as growth names, then I cannot refer by name, etc.

My current solution is to save two dbs as two data frames. One has hashes as outlet names, the other has keys as growth names. This works, but it seems a bit inconvenient and repetitive to support two identical data frames (except for their rosers).

I would like it to be very fast in both directions :). I think it means O (log (n)) for character direction and O (1) for integer direction, but I'm not a specialist in data structure / algorithm. O (log (n)) in the integer direction is probably OK, but I think that O (n) (you need to cross the entire db solution) in any direction will weigh things too much.

DB is also bijective. That is, each key displays exactly one value, and each value displays exactly one key.

EDIT: Thanks for the posts:

By performing several tests, the matching technique is certainly slower than the data key. As Martin noted, this is due solely to the time required for matching to create the table with the key. That is, both match and keyed data.table perform a binary search to find the value. But despite this, the match is too slow for my needs when returning a single value. Therefore, I will code the solution data.table and the message.

 > system.time(match(1,x)) user system elapsed 0.742 0.054 0.792 > system.time(match(1,x)) user system elapsed 0.748 0.064 0.806 > system.time(match(1e7,x)) user system elapsed 0.747 0.067 0.808 > system.time(x.table[1]) user system elapsed 0 0 0 > system.time(x.table[1e7]) user system elapsed 0.001 0.001 0.000 > system.time(x.table[1e7]) user system elapsed 0.005 0.000 0.005 > system.time(x.table[1]) user system elapsed 0.001 0.000 0.000 > system.time(x.table[1]) user system elapsed 0.020 0.001 0.038 

EDIT2:

I went with fmatch solution and named vector. I liked the simplicity of the match approach, but I do repeated searches on db, so the impact on the performance of reconstructing the hash table for each matching call is too big.

fmatch has the same interface as the match, works with the same vector data type name, etc. It simply caches / remembers the created hash table, so that subsequent calls on the specified vector should only perform a hash search. All this abstracts from the caller, so fmatch is just a dropin to match.

Simple wrapper code for bidirectional search:

 getChunkHashes = function(chunks, db) { return(db[fmatch(chunks, names(db))]) } getChunks = function(chunkHashes, db) { return(names(db[fmatch(chunkHashes, db)])) } 
+7
source share
3 answers

Considering:

DB is also bijective. That is, each key displays exactly one value, and each value displays exactly one key.

Then I would suggest a hash solution (like a hash package ), a fastmatch package or data.table::chmatch . The key connection in data.table more for ordered multi- data.table keys and / or grouped data, which is not really a problem.

+3
source

The basic approach is to use a named vector:

 db <- c(hello = 1234, hi = 123, hey = 321) 

To go from key (s) to value, use [ :

 db[c("hello", "hey")] # hello hey # 1234 321 

Switching from values ​​to keys (s) is a bit more complicated:

 names(db)[match(c(321, 123), db)] # [1] "hey" "hi" 

(Note that match(x, y) returns the index of the first match x in y , so this approach only works well if your map is injective, which you didn’t specify in your question.)

If you find that the last use is too "heavy", you can definitely write your own function.

Note : as indicated, this approach is potentially slow in the key-to-key direction, so it may not be ideal for re-bidirectionally accessing a large map. To protect it, it is easy to implement, does not require any packages other than base , and will do a very decent job for 99% of people's needs. If nothing, it can be used here as a guide against faster alternatives.

+4
source

More information on the @claytonstanley issue of the @flodel reaction. match makes a hash of one of the arguments, and then searches for the other. The cost is to create a hash, not a search

 > n = 1e7; x = seq_len(n) > system.time(match(1, x)) user system elapsed 1.156 0.064 1.222 > system.time(match(n, x)) user system elapsed 1.152 0.068 1.221 

and it is depreciated by the number of completed requests

 > y = sample(x) > system.time(match(y, x)) user system elapsed 2.112 0.052 2.167 

therefore, you definitely want the look-up to be “vectorized”.

+2
source

All Articles