Choose N random items from the list efficiently (without toArray and change the list)

Question

Choose N random items from the list efficiently (without toArray and change the list)

As in the title, I want to use the Shuffle Knuth-Fisher-Yates algorithm to select N random items from a list, but without using List.toArray and changing the list. Here is my current code:

public List<E> getNElements(List<E> list, Integer n) { List<E> rtn = null; if (list != null && n != null && n > 0) { int lSize = list.size(); if (lSize > n) { rtn = new ArrayList<E>(n); E[] es = (E[]) list.toArray(); //Knuth-Fisher-Yates shuffle algorithm for (int i = es.length - 1; i > es.length - n - 1; i--) { int iRand = rand.nextInt(i + 1); E eRand = es[iRand]; es[iRand] = es[i]; //This is not necessary here as we do not really need the final shuffle result. //es[i] = eRand; rtn.add(eRand); } } else if (lSize == n) { rtn = new ArrayList<E>(n); rtn.addAll(list); } else { log("list.size < nSub! ", lSize, n); } } return rtn; }

It uses list.toArray () to create a new array to avoid changing the original list. However, now my problem is that my list can be very large, can have 1 million items. Then list.toArray () is too slow. And my n could range from 1 to 1 million. When n is small (say 2), the function is very inefficient, since it still needs to do list.toArray () for a list of 1 million elements.

Can someone help improve the code above to make it more efficient when working with large lists. Thanks.

Here, I assume that Knuth-Fisher-Yates shuffle is the best algorithm for performing a selection of n random items from a list. I'm right? I would be very happy if other algorithms were better than Knuth-Fisher-Yates to complete the task in terms of speed and quality of results (guarantee real randomness).

Update:

Here are some of my results:

When choosing n out of 1,000,000 items.

When n <1000000/4 is the fastest way to use the Daniel Lemire bitmap function to first select n random identifiers, then get the elements with these identifiers:

 public List<E> getNElementsBitSet(List<E> list, int n) { List<E> rtn = new ArrayList<E>(n); int[] ids = genNBitSet(n, 0, list.size()); for (int i = 0; i < ids.length; i++) { rtn.add(list.get(ids[i])); } return rtn; }

GenNBitSet uses the generateUniformBitmap code from https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2013/08/14/java/UniformDistinct.java

For n> 1,000,000 / 4, the collector sampling method is faster.

So, I created a function to combine these two methods.

+8

java algorithm random

Changwang zhang May 18, '14 at 8:29

source share

5 answers

If n is very small compared to the length of the list, take an empty int set and continue to add a random index until the set has the desired size.

If n is comparable to the length of the list, do the same, but then return the items to a list that does not have indexes in the set.

At the midpoint, you can iterate over the list and randomly select items based on the number of items you saw and the number of items that you have already returned. In pseudo code, if you want k elements from N:

 for i = 0 to N-1 if random(Ni) < k add item[i] to the result k -= 1 end end

Here, random (x) returns a random number between 0 (inclusive) and x (exception).

This creates a uniformly random selection of k elements. You might also consider creating an iterator to avoid creating a list of results to save memory, assuming the list has not changed when you repeat it.

With the help of profiling, you can determine the transition point, where it makes sense to switch from the naive method of constructing a set to the iteration method.

+5

Paul hankin May 18, '14 at 9:11

source share

Suppose you can generate n random indices from m that do not intersect in pairs, and then efficiently search for them in the collection. If you do not need the order of the elements to be random, you can use the algorithm because of Robert Floyd.

 Random r = new Random(); Set<Integer> s = new HashSet<Integer>(); for (int j = m - n; j < m; j++) { int t = r.nextInt(j); s.add(s.contains(t) ? j : t); }

If you want the order to be random, you can run Fisher-Yates, where instead of using an array, you use a HashMap that stores only those mappings in which the key and value are different. Assuming that hashing is constant time, both of these algorithms are asymptotically optimal (although, obviously, if you want to randomly select a large part of the array, i.e. data structures with better constants).

+3

David Eisenstat May 18, '14 at 15:47

source share

Just for convenience: MCVE with the implementation of the Resorvoir sample proposed by amit ( possible upvotes should go to it (I'm just hacking some kind of code))

It seems that this is really an algorithm that perfectly covers cases where the number of elements to choose is low compared to the size of the list, and cases where the number of elements is large compared to the size of the list (suppose that the randomness properties that are indicated on the page wikipedia, true).

 import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.Map; import java.util.Map.Entry; import java.util.Random; import java.util.TreeMap; public class ReservoirSampling { public static void main(String[] args) { example(); //test(); } private static void test() { List<String> list = new ArrayList<String>(); list.add("A"); list.add("B"); list.add("C"); list.add("D"); list.add("E"); int size = 2; int runs = 100000; Map<String, Integer> counts = new TreeMap<String, Integer>(); for (int i=0; i<runs; i++) { List<String> sample = sample(list, size); String s = createString(sample); Integer count = counts.get(s); if (count == null) { count = 0; } counts.put(s, count+1); } for (Entry<String, Integer> entry : counts.entrySet()) { System.out.println(entry.getKey()+" : "+entry.getValue()); } } private static String createString(List<String> list) { Collections.sort(list); StringBuilder sb = new StringBuilder(); for (String s : list) { sb.append(s); } return sb.toString(); } private static void example() { List<String> list = new ArrayList<String>(); for (int i=0; i<26; i++) { list.add(String.valueOf((char)('A'+i))); } for (int i=1; i<=26; i++) { printExample(list, i); } } private static <T> void printExample(List<T> list, int size) { System.out.printf("%3d elements: "+sample(list, size)+"\n", size); } private static final Random random = new Random(0); private static <T> List<T> sample(List<T> list, int size) { List<T> result = new ArrayList<T>(Collections.nCopies(size, (T) null)); int i = 0; for (T element : list) { if (i < size) { result.set(i, element); i++; continue; } i++; int j = random.nextInt(i); if (j < size) { result.set(j, element); } } return result; } }

+2

Marco13 May 18, '14 at 11:06

source share

If n smaller than size, you can use this algorithm, which, unfortunately, is quadratic with n , but doest depends on the size of the array in general.

Example with size = 100 and n = 4.

 choose random number from 0 to 99, lets say 42, and add it to result. choose random number from 0 to 98, lets say 39, and add it to result. choose random number from 0 to 97, lets say 41, but since 41 is bigger or equal than 39, increment it by 1, so you have 42, but that is bigger then equal than 42, so you have 43. ...

In short, you select from the remaining numbers, and then summarize which number you selected. I would use a list of links for this, but maybe there are better data structures.

+1

kajacx May 18, '14 at 10:56

source share

amit · Accepted Answer · 2014-05-18T10:29:46+0000

You might be looking for something like Resorvoir Sampling .

Start with the initial array using the first elements of k and change it to new elements with decreasing probabilities:

java as pseudo code:

 E[] r = new E[k]; //not really, cannot create an array of generic type, but just pseudo code int i = 0; for (E e : list) { //assign first k elements: if (i < k) { r[i++] = e; continue; } //add current element with decreasing probability: j = random(i++) + 1; //a number from 1 to i inclusive if (j <= k) r[j] = e; } return r;

This requires one data pass with very cheap operations at each iteration, and the space consumption is linear with the required output size.

Choose N random items from the list efficiently (without toArray and change the list)

More articles: