Best Java Big Set Data Structure

I need to find spaces in a large Integer set filled with a read loop through files, and I want to know if something already done for this purpose exists to avoid a simple Set object with the risk of heap overflow.

To better explain my question, I must tell you how my ticketing software works. Each ticket has a global progressive number stored in a daily log file, with different information. I have to write a verification procedure to check if there are spaces in the numbers inside the daily log files.

The first idea was to create a reading cycle with all the log files, read each line, get the ticket number and save it in the Integer TreeSet, and then find the spaces in this set. The problem is that the ticket number can be very high and can saturate the space of the heap of memory, and I also want a good solution if I need to switch to Long objects. Set's solution is wasting a lot of memory, because if I find that there are no spaces in the first number 100, it makes no sense to store them in Set.

How can i decide? Can I use some data structure already made for this purpose?

+5
source share
5 answers

I assume that (A) the spaces you are looking for are the exception, not the rule, and (B) the log files you process are mostly sorted by ticket number (although some entries are out of sequence OK).

If so, then I would consider moving your own data structure for this. Here is a brief example of what I mean (a lot remains for the reader).

Basically, it implements Set , but actually saves it as a Map , with each record representing a range of contiguous values ​​in the set.

The add method is overridden to maintain Map support accordingly. For example, if you add 5 to a set and already have a range containing 4, then it simply extends that range instead of adding a new record.

Note that the reason for the “mostly sorted” premise is that for completely unsorted data this approach will still use a lot of memory: the support map will grow (as unsorted records will be added all over the place) to (as additional records fill spaces, allowing you to combine adjacent entries).

Here is the code:

 package com.matt.tester; import java.util.Collection; import java.util.Comparator; import java.util.Iterator; import java.util.Map; import java.util.SortedSet; import java.util.TreeMap; public class SE { public class RangeSet<T extends Long> implements SortedSet<T> { private final TreeMap<T, T> backingMap = new TreeMap<T,T>(); @Override public int size() { // TODO Auto-generated method stub return 0; } @Override public boolean isEmpty() { // TODO Auto-generated method stub return false; } @Override public boolean contains(Object o) { if ( ! ( o instanceof Number ) ) { throw new IllegalArgumentException(); } T n = (T) o; // Find the greatest backingSet entry less than n Map.Entry<T,T> floorEntry = backingMap.floorEntry(n); if ( floorEntry == null ) { return false; } final Long endOfRange = floorEntry.getValue(); if ( endOfRange >= n) { return true; } return false; } @Override public Iterator<T> iterator() { throw new IllegalAccessError("Method not implemented. Left for the reader. (You'd need a custom Iterator class, I think)"); } @Override public Object[] toArray() { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public <T> T[] toArray(T[] a) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public boolean add(T e) { if ( (Long) e < 1L ) { throw new IllegalArgumentException("This example only supports counting numbers, mainly because it simplifies printGaps() later on"); } if ( this.contains(e) ) { // Do nothing. Already in set. } final Long previousEntryKey; final T eMinusOne = (T) (Long) (e-1L); final T nextEntryKey = (T) (Long) (e+1L); if ( this.contains(eMinusOne ) ) { // Find the greatest backingSet entry less than e Map.Entry<T,T> floorEntry = backingMap.floorEntry(e); final T startOfPrecedingRange; startOfPrecedingRange = floorEntry.getKey(); if ( this.contains(nextEntryKey) ) { // This addition will join two previously separated ranges T endOfRange = backingMap.get(nextEntryKey); backingMap.remove(nextEntryKey); // Extend the prior entry to include the whole range backingMap.put(startOfPrecedingRange, endOfRange); return true; } else { // This addition will extend the range immediately preceding backingMap.put(startOfPrecedingRange, e); return true; } } else if ( this.backingMap.containsKey(nextEntryKey) ) { // This addition will extend the range immediately following T endOfRange = backingMap.get(nextEntryKey); backingMap.remove(nextEntryKey); // Extend the prior entry to include the whole range backingMap.put(e, endOfRange); return true; } else { // This addition is a new range, it doesn't touch any others backingMap.put(e,e); return true; } } @Override public boolean remove(Object o) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public boolean containsAll(Collection<?> c) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public boolean addAll(Collection<? extends T> c) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public boolean retainAll(Collection<?> c) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public boolean removeAll(Collection<?> c) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public void clear() { this.backingMap.clear(); } @Override public Comparator<? super T> comparator() { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public SortedSet<T> subSet(T fromElement, T toElement) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public SortedSet<T> headSet(T toElement) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public SortedSet<T> tailSet(T fromElement) { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public T first() { throw new IllegalAccessError("Method not implemented. Left for the reader."); } @Override public T last() { throw new IllegalAccessError("Method not implemented. Left for the reader."); } public void printGaps() { Long lastContiguousNumber = 0L; for ( Map.Entry<T, T> entry : backingMap.entrySet() ) { Long startOfNextRange = (Long) entry.getKey(); Long endOfNextRange = (Long) entry.getValue(); if ( startOfNextRange > lastContiguousNumber + 1 ) { System.out.println( String.valueOf(lastContiguousNumber+1) + ".." + String.valueOf(startOfNextRange - 1) ); } lastContiguousNumber = endOfNextRange; } System.out.println( String.valueOf(lastContiguousNumber+1) + "..infinity"); System.out.println("Backing map size is " + this.backingMap.size()); System.out.println(backingMap.toString()); } } public static void main(String[] args) { SE se = new SE(); RangeSet<Long> testRangeSet = se.new RangeSet<Long>(); // Start by putting 1,000,000 entries into the map with a few, pre-determined, hardcoded gaps for ( long i = 1; i <= 1000000; i++ ) { // Our pre-defined gaps... if ( i == 58349 || ( i >= 87333 && i <= 87777 ) || i == 303998 ) { // Do not put these numbers in the set } else { testRangeSet.add(i); } } testRangeSet.printGaps(); } } 

And the result:

 58349..58349 87333..87777 303998..303998 1000001..infinity Backing map size is 4 {1=58348, 58350=87332, 87778=303997, 303999=1000000} 
+3
source

Well, either you store everything in memory, and you run the risk of heap overflow, or you do not store it in memory, and you need to calculate a lot.

I would suggest something in between - to store the minimum necessary information needed during processing. You can save the endpoints of a known non-slot sequence in a class with two long fields. And all these types of sequence data can be stored in a sorted list. When you find a new number, go to the list to see if it is adjacent to one of the endpoints. If so, change the endpoint to a new integer and check if you can combine adjacent objects in the sequence (and therefore delete one of the objects). If not, create a new sequence object in a properly sorted place.

In the end, it will be O(n) in memory usage and O(n) in processor usage. But using any data structure that stores information about all numbers will be just n in memory usage and O(n*lookuptime) in the processor if the search time is not performed in constant time.

+1
source

I think this is the perfect moment to get to know bloom-filter . This is a wonderful probabilistic data structure that can be used to immediately prove that an element is not in the set.

How it works? The idea is quite simple, acceleration is becoming more complex, and an implementation can be found in Guava .

Idea

Initialize a filter, which will be an array of bits of length, which will allow you to keep the maximum value of the hash function . When adding an item to a set, calculate its hash. Determine which bit is 1 and make sure that they all switch to 1 in the filter (array). If you want to check if an element is in a set, just calculate its hash, and then check if all bits 1 in the hash, 1 in the filter. If any of these bits is 0 in the filter, the element is definitely not in the set. If all of them are set to 1 , the element may be in the filter, so you will need to iterate over all the elements. Boost

A simple probabilistic model answers the question of how large the filter should be (and the hash range) to provide an optimal chance for false positive , which is a situation where all bits are 1 , but this element is not in the set.

Implementation

The Guava implementation provides the following bloom-filter constructor: create(Funnel funnel, int expectedInsertions, double falsePositiveProbability) . You can configure the filter yourself, depending on expectedInsertions and falsePositiveProbability .

False positive

Some people are aware of bloom-filters due to a false positive opportunity. You can use the Bloom filter so that you don’t rely on the mightBeInFilter flag. If so, you should skip all the elements and check one by one if the element is in the set or not.

Possible Use In your case, I would create a filter for the set, and then after all the tickets have been added, just skip all the numbers (since you still need to loop) and check if they are filter#mightBe int set. If you set falsePositiveProbability to 3%, you will achieve complexity around O(n^2-0.03m*n) , where m means the number of spaces. Correct me if I am mistaken in the assessment of complexity.

0
source

Read as many ticket numbers as you can fit in available memory.

Sort them and write the sorted list to a temporary file. Depending on the expected number of spaces, this can save time and space in order to use the path length and ndash encoding scheme when writing sorted numbers.

After all the ticket numbers have been sorted into temporary files, you can combine them into one, sorted stream of ticket numbers, look for spaces.

If this leads to the fact that too many temporary files open immediately for merging, groups of files can be combined into intermediate files, etc., keeping the total number below the permissible limit. However, this additional copying can significantly slow down the process.

Old tape drive algorithms are still relevant.

0
source

Here's the idea: if you know your range of numbers in advance, then

pre-compute the sum of all the numbers you expect there. 2. Then continue to read your numbers and make up the sum of all numbers read, as well as the number of your numbers. 3. If the amount you come up with is the same as the previously calculated, then there are no spaces. 4. If the amount differs and the number of your numbers is short only by one of the expected number, then the pre-calculated amount - the actual amount will give you the missing number. 5. If the number of your numbers is shorter than more than one, then you will find out how many digits are missing and what is their sum.

The best part is that you do not need to store a collection of your numbers in memory.

0
source

All Articles