How can I iterate over a large list of Java integers in less time?

int[] records = job.getTargetSearchIDs(); topology.applyMatcherSearchWeight(records); int[] mIDs = topology.getMatcherIds(); SystemResponse[] sysResponse = new SystemResponse[mIDs.length]; Map<Integer, SearchCommand> mrCmdsMap = new HashMap<Integer, SearchCommand>(); 

The length of the MID is 250, and the length of the records is 7.5 million integers. I want this loop to work in less than 3 seconds on a server with an Intel Xeon X5355 octa-core processor, 64-bit Linux (Ubuntu), and 32-bit Java.

 for (long mID : mIDs) { List<Integer> recIDsToMatch = new LinkedList<Integer>(); Matcher matcher = topology.getMatcherById(mID); for (long record : records) { if (matcher.getRange().isInRange(record)) recIDsToMatch.add(record); } if (recIDsToMatch.size() > 0) { SearchCommand command = new SearchCommand(job.getMatchParameters(), job.getRequestType(), job.getId(), job.getMatchParameters().getEngineProperties(), recIDsToMatch); command.setTimeout(searchTimeout, TimeUnit.SECONDS); mrCmdsMap.put(mID, command); } } 

What improvements come to mind when you read this piece of code? What data structure and / or algorithm can be made?

+4
source share
5 answers

If isInRange() really checks to see if a given integer is in a specific range, it might be better to put records in a data structure that will perform this operation in a more efficient way.

For example, try putting records in a TreeSet , and then use subSet to find records in a range.

Another way is to create something like TreeMap<Integer, List<Matcher>> , where value is a Matcher list that spans the range between the current key and the next key. This might be even better because the number of Matcher less than the number of records.

+3
source

One single loop does not take advantage of the multicore core ... it would be better if you could break this loop iteration into subsets by creating threads.

For example: divide the array into 6 pieces, one thread for each part.

+2
source

If you have large data sets and want speed and simplicity, consider using a text search engine such as Lucene , which can index millions of documents and extract hits using fairly complex matching parameters in a few milliseconds.

+1
source

You are trying to iterate over some set (rather than looking for an element inside the set), this means that you will work at least in O [n] time complexity (ie linear time complexity), you also have a nested loop cycle which introduces your time complexity to O [n ^ 2] time complexity (ie quadratic time complexity).

Make sure that you do not perform any redundant operations in the loop and, if possible, move as far as possible outside the loop (any initialization, etc.).

If you just want to iterate over a set, and then iterate through a subset of the elements of this set element, then you cannot do what you have not done yet.

0
source

You left us to guess the data structures underlying the problem, but there are three possibilities:

  • You do what it really takes to go through 40 billion records in 3 seconds (13G records / second). I do not think your memory system can handle this bandwidth; you need more equipment if this is true. (But I'm sure not.)

  • You just want to see if one number is in a set of 40 million ranges, and most numbers are not in that range. Then you want an interval tree . You can find many implementations floating around; Unfortunately, neither Apache Commons nor Guava have this.

  • You just want to see which numbers out of 40 million are in 1000 different ranges. Sort 40 million numbers (once), then binary search to endpoints of the range (for each range). Everything is between them.

If 2. or 3. describe your problem, it should only take a fraction of a second on one core.

0
source

All Articles