As in the title, I want to use the Shuffle Knuth-Fisher-Yates algorithm to select N random items from a list, but without using List.toArray and changing the list. Here is my current code:
public List<E> getNElements(List<E> list, Integer n) { List<E> rtn = null; if (list != null && n != null && n > 0) { int lSize = list.size(); if (lSize > n) { rtn = new ArrayList<E>(n); E[] es = (E[]) list.toArray(); //Knuth-Fisher-Yates shuffle algorithm for (int i = es.length - 1; i > es.length - n - 1; i--) { int iRand = rand.nextInt(i + 1); E eRand = es[iRand]; es[iRand] = es[i]; //This is not necessary here as we do not really need the final shuffle result. //es[i] = eRand; rtn.add(eRand); } } else if (lSize == n) { rtn = new ArrayList<E>(n); rtn.addAll(list); } else { log("list.size < nSub! ", lSize, n); } } return rtn; }
It uses list.toArray () to create a new array to avoid changing the original list. However, now my problem is that my list can be very large, can have 1 million items. Then list.toArray () is too slow. And my n could range from 1 to 1 million. When n is small (say 2), the function is very inefficient, since it still needs to do list.toArray () for a list of 1 million elements.
Can someone help improve the code above to make it more efficient when working with large lists. Thanks.
Here, I assume that Knuth-Fisher-Yates shuffle is the best algorithm for performing a selection of n random items from a list. I'm right? I would be very happy if other algorithms were better than Knuth-Fisher-Yates to complete the task in terms of speed and quality of results (guarantee real randomness).
Update:
Here are some of my results:
When choosing n out of 1,000,000 items.
When n <1000000/4 is the fastest way to use the Daniel Lemire bitmap function to first select n random identifiers, then get the elements with these identifiers:
public List<E> getNElementsBitSet(List<E> list, int n) { List<E> rtn = new ArrayList<E>(n); int[] ids = genNBitSet(n, 0, list.size()); for (int i = 0; i < ids.length; i++) { rtn.add(list.get(ids[i])); } return rtn; }
GenNBitSet uses the generateUniformBitmap code from https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2013/08/14/java/UniformDistinct.java
For n> 1,000,000 / 4, the collector sampling method is faster.
So, I created a function to combine these two methods.