Random hash table enumeration in OCaml

Question

Random hash table enumeration in OCaml

Sorry for the long question. I decided to first explain the context of the problem, as there may be other solutions to my problem. If in a hurry, just read the QUESTION below.

(EDITED - At the same time, I added some attempts to solve the problem. The fourth has my final conclusion, you can go straight to it.)

CONTEXT

I have a hash table filled with about 20k pairs (key (i), value (i)). I want to generate random lists like this

[(key(213),value(213));(key(127),value(127));(key(89),value(89));...]

The limitation is that as soon as I select key (213) to be the first element of the list, not all of its keys can follow it (I have some other “solve” function that can decide if the next the key is to be the next one on the list or not). So, I would like to select a random next element and check if it matches - in the example above, the key (127) was selected. If this element is rejected by my "solve" function, I would like to choose another randomly. But I would not want to choose the same thing that was rejected, because I know that it will be rejected again, and this will not only be ineffective, but also a risk, if only a few keys can be next, take a long time until I find suitable key. Please note that may be repeated, for example

 [(key(213),value(213));(key(213),value(213));(key(78),value(78));...]

This is normal if the “decide” function accepts key (213) as the next one in the list. So, all I need is a way to randomly list pairs (key, value) in a hash table. Whenever I have to select a key, I create an enumeration that I use, checking each new element with the "solve" function (therefore, no repetitions occur), and when I find it, I add it to the list and continue to increase the list, The thing is, I don’t want this hash table enumeration to be the same every time. I want this to be random. (This is due to the structure of the search space, which I have in my specific problem, which does not matter here.)

I can, of course, realize this by creating random integers and using only lists - what I am doing now. But, since this is something that I come across quite often, I wonder if there is any random enumeration facility for hash tables somewhere.

QUESTION

Is there any random enumeration function for hash tables somewhere? I know the BatHashtbl.enum function (Batteries library), but I think that it will always give me the same enumeration for the same hash table (is this true?). In addition, there seems to be nothing like this in this BatHashtbl module. I would be interested in something like

 random_enum: ('a, 'b) t -> int -> ('a * 'b) Enum.t

which, if provided with a hash table and some integer as the starting number for a random generator, will give another random enumeration of the hash table. Any ideas?

Thanks for any help!

Best, Surikator.

FIRST LAST

Following Niki's suggestion in the comments and looking at the Batteries library in more detail, I came up with this

 let rand_enum ht n = BatRandom.init n; let hte = BatHashtbl.enum ht in let s = BatRandom.shuffle hte (* This returns*) in Array.to_list s

which is of type

 val rand_enum : ('a,'b) BatHashtbl.t -> int -> ('a*'b) list

It uses the Fisher-Yeiss algorithm for shuffling, which works in O (n). It returns a list instead of an enumeration, and this is very annoying because it means that even if I am satisfied with the third element of the list obtained using rand_enum, the function will still calculate random numbering for all 20k elements in the hash table.

Best, Surikator

SECOND LAST

I defined the RndHashtblEnum module as

 (* Random Hashtable Enumeration Module *) type ('a,'b) t = { ht:('a,'b) BatHashtbl.t; mutable ls:('a*'b) list; f: (('a,'b) BatHashtbl.t -> ('a*'b) list)} let shuffle ht = let hte = BatHashtbl.enum ht in let s = BatRandom.shuffle hte in Array.to_list s let create ht n = (BatRandom.init n; {ht=ht;ls=shuffle ht;f=shuffle}) let rec next re = match re.ls with | [] -> re.ls<-(re.f re.ht);next re | h::t -> re.ls<-t; h

It has a new type t for random hash table enumerations. This type stores the hash table that we want to enumerate, the list we will enumerate, and the function to calculate the new enumerated list (from the hash table) after the list is completed. As soon as the list ends, when we ask for a new random item in the hash table, type t automatically places a new random list created from the hash table.

So, using the above module, if we want to randomly list a hash table, we simply do:

 let re = RndHashtblEnum.create ht 1236

to create a random enumeration of the ht hash table with a random seed of 1236 (in this code I assume that the hash table was defined earlier), and we can then write

 let (k,v) = RndHashtblEnum.next re

to get the next (k, v) pair from a random census.

One question we can ask is whether this is a fair coincidence, because I will use the remaining list to randomly list the hash table the next time I need a random listing. Well, that is not so. If my hash table says 1000 elements and after extracting 5 random, I am satisfied with the result, I know that in the next 995 (second set of exceptions) none of these 5 elements will be extracted. So this is an unfair accident. This is even worse. It is possible that in the next 1000 selections (995 from this list, 5 from the next listing list) some elements will not be covered. On average, the algorithm is fair, but it's not fair all the time.

Best, Surikator.

THIRD ATTEMPT

Hi,

Including Niki’s proposal to use BatArray.enum and a fundamental change to the stochastic part of the algorithm, I came up with a new and improved version of the RndHashtblEnum module. This sentence:

 (* Improved Random Hashtable Enumeration Module *) type ('a,'b) t = {ht:('a,'b) BatHashtbl.t; mutable enum:('a*'b) BatEnum.t; enum0: ('a*'b) BatEnum.t} let shuffle ht = let hte = BatHashtbl.enum ht in let s = BatRandom.shuffle hte in BatArray.enum s let create ht n = let e = shuffle ht in (BatRandom.init n; {ht=ht;enum=BatEnum.clone e;enum0=e}) let rec next re = match BatEnum.get re.enum with | None -> re.enum<-re.enum0; next re | Some e -> e

This new module gets rid of the (stupid) costs of passing the array to the list and only uses the Fisher-Yates algorithm once at the beginning - so in the end we can consider the Fisher contribution - There should be O (1) for the bit.

The new version is now fair in terms of randomness. It wasn’t easy to see, and I needed to understand it. Suppose a hash table has 1000 entries. In the new version, we always use the same enumeration (enum0 - fixed when we create a random enumeration using the "create" function). This means that when trying to find the next element in our final list, since some key in the hash table will have to satisfy the "solve" function (otherwise we could not continue the algorithm, and we would just stop), it will do it somewhere- then between the 0th and 999th records. Suppose the input is 300. Now, given that we selected this key to select the next key in the final list, our listing will continue with the remaining 700 elements, and then move on to the next 300 in a copy of the same listing. So, 700 + 300 is 1000 in the hash table. This means that we will always consider each entry in the hash table once and only once. Another thing is that every time we try to find a key that can be found in a list that can be found in entry 300, as well as in item 734 or something else, because the decision function actually depends on which previous keys were selected prior to this item in the final list. So, every time we start a new search for the item for the final list in the hash table, we start with a random item in the hash table.

Sorry if this is not very clear. Hard to explain. =)

Thanks for all the comments.

Best, Surikator.

FOURTH AND FINAL ATTEMPT - THIS IS MY SUGGESTED DECISION

Hello again,

The gas exchange is worried about the variable fields and listings in general and all the strange side effects that can come from there, I decided to forget about ready-made solutions using the available hash table table libraries, and wrote my materials using simple lists. I also made laziness cope in order to avoid generating random lists from which only a part will be used (in this way, useful lazy things that you suggested, Nicky were used).

I created a type

 type 'a node_t = | ENil | ECons of 'a * 'a list * 'at and 'at = ('a node_t) Lazy.t

for lazy random listing listings. Each enumeration is either empty (ENIL) or not (ECons), in which case it has three parts: (1) the element currently in focus, (2) the remaining available elements for the enumeration, (3) another enumeration for continuation of this listing.

Then a random list rewrite can be obtained using the create function

 let rec create ls = lazy( match ls with | [] -> ENil | h::t -> let n = Random.int (List.length ls) in let newx,rest=remove ls n in ECons(newx,rest,create t))

where the auxiliary function remove defined to extract the nth element of the list and return the pair (x,ls) , where x is the selected element and ls is the new list without the extracted element. For completeness, I also add remove function code.

 let rec remove ls n = let rec remove_ ls acc kn = match ls with | [] -> raise (Failure "remove") | h::t -> if k=n then h, List.rev_append acc t else remove_ t (h::acc) (k+1) n in remove_ ls [] 0 n

Now we can define very simple functions to generate the next random census state and get the actual item in each enumeration state. it

 exception End_of_enum let next e = match Lazy.force e with | ENil -> raise End_of_enum | ECons(x,ls,t) -> t let rec get e = match Lazy.force e with | ENil -> raise End_of_enum | ECons(x,ls,t) -> x

OK, so far, I just listed lists randomly. If we want to list a hash table, we can use

 let rand_enum ht = let ls = Hashtbl.fold (fun kv acc -> (k, v) :: acc) ht [] in create ls

to get a random listing of pairs in a hash table, and we can use the following and get to get pairs (key, value). fold , but this is just a way to get all the pairs (key, value) of the hash table in the list (thanks Pascal for your answer in this question ).

This completes the entire enumerated hash table. For completeness, I also add a solution to the general problem that I was trying to solve, described in "The Context" above. The problem, if you remember, is to randomly create a list of pairs (key, value) from (1) the hash table and (2) the decide function, which can determine if (key, value) can be added to a specific list of pairs. Since the entire generation process can never end in order to provide an end, I thought it would make sense to have a third argument, which is a function that indicates whether to stop the process or not (and that we need to make sure that at some point we return true to complete the whole process).

The generate function may be something like

 let generate ht d stop = let rec gen1 d fst e = if d (List.rev fst) (get e) then (get e)::fst else gen1 d fst (next e) in let rec generate_ ht d stop acc = let e = rand_enum ht in if stop acc then acc else try generate_ ht d stop (gen1 d acc e) with End_of_enum -> generate_ ht d stop (List.tl acc) in generate_ ht d stop []

Thanks to everyone who contributed helpful comments. It was really helpful.

All the best, Surikator.

+7

hashtable enumeration random ocaml

Surikator Oct 29 '10 at 12:41

source share

4 answers

Niki Yoshiuchi · Answer 1 · 2010-10-29T17:32:50+0000

I have two suggestions. The first is to change your rand_enum function rand_enum that it returns Enum.t:

 let rand_enum ht n = BatRandom.init n; let hte = BatHashtbl.enum ht in Array.enum (BatRandom.shuffle hte)

which is not much different (it still computes a random enumeration for all 20k), but closer to what you originally wanted.

Alternatively, you can always take the HashTbl source code and recompile it using the rand_enum function. However, it also probably won't be that different, since HashTbl is implemented as an array, and if you want to avoid bad duplicates, you will probably end up using shuffle.

gasche · Answer 2 · 2010-10-29T18:25:25+0000

What is the density of the potential next element? What is the cost of your decide function?

All of your current solution has an O (n) value. Fisher-Yates is O (n) (and it makes no sense to try to adapt it for Enums, as this will require a forced enumeration), and Array.to_list alos is O (n).

If your decide function is fast enough and your density is low enough, I think the easiest way is to build a list / array of all suitable elements (call decide for each element of the table), then randomly select one of them.

If the density is high enough and decide expensive, I think your first idea is to select keys in a random order and save a list of keys that are already encountered. You can select the first counter element (the optimal number of calls to decide ). This way of listing the sequence becomes costly “at the end” when all the elements are already selected, but if your density is high, you will not run into this case.

If you don’t know, it may be interesting to start with the “high density” hypothesis and change your mind when you see this part of the table and find nothing.

Finally: if you do not need to add / remove elements while generating your sequence, it would be interesting to convert your hash table into an array once and forall (somewhere else a different key table → an array is stored somewhere), since all such problems easier when indexing is contiguous.

gasche · Answer 3 · 2010-10-29T19:13:18+0000

Your implementations) (second and third) are too complex. I don't like mutable , and I don't like Enum . Combining both of them is the best way to shoot in the foot with uncontrollable side effects.

I also believe that your specific problem is too specific to be solved using the universal form of "shuffle something and what it is." Trying to find a domain-independent feature that also solves your domain-related problem might be why your sequential implementation becomes more complex and complex with every attempt.

Creating a random stream from a Hashtable is simple: BatHashtbl.enum |- BatRandom.shuffle |- BatArray.enum . The rest of your code should be about using the decide function.

ygrek · Answer 4 · 2010-10-29T14:42:04+0000

I doubt that such a function is set by the interface open by Hashtbl. An obvious approach, such as getting all the values in an array and searching in Array.get a (Random.int (Array.length a)) , works fine for me.

Random hash table enumeration in OCaml

More articles: