Fastest way to check if List <String> contains a unique string
Basically, I have about 1,000,000 rows, for each request I have to check whether the row belongs to the list or not.
I'm worried about performance, and what's the best method? ArrayList ? Hash?
It is best to use a HashSet and check if a string exists in the set using the contains() method. HashSets are created for quick access using the Object hashCode() and equals() methods. The Javadoc for HashSet states:
This class offers consistent performance over time for basic operations (add, delete, contain and size)
A HashSet stores objects in hash codes , that is, the value returned by the hashCode method determines which bucket the object stores. Thus, the number of equality checks that a HashSet must perform using the equals() method reduces only to other objects in the same hash bucket .
To use HashSets and HashMaps effectively, you must comply with the equals and hashCode contract specified in javadoc . In the case of java.lang.String these methods are already implemented for this.
In general, a HashSet will give you better performance, since it does not have to look at each element and compare how ArrayList does, but usually compares no more than a few elements where hash codes are equal.
However, for 1M strings, the performance of hashSet may still not be optimal. A lot of misses in the cache will slow down the set search. If all lines are equally likely, this is inevitable. However, if some lines are more often requested than others, then you can put the general lines in a small hash set and check this first before checking the larger set. A small hash must have a size corresponding to the size of the cache (for example, several hundred K). Hits to a smaller hash will be very fast, and hits to a larger hash will continue at a speed limited by the memory bandwidth.
Before you move on, think about this: why are you worried about performance? How often is this check called?
Regarding possible solutions:
If the list is already sorted, you can use
java.util.Collections.binarySearch, which offers the same performance characteristics asjava.util.TreeSet.Otherwise, you can use
java.util.HashSet, which is an O (1) performance characteristic. Note that computing a hash code for a string that has not yet been calculated is an O (m) operation with m =string.length(). Also keep in mind that hashtables only work until you reach the specified load factor, that is, hashtables will use more memory than regular lists. The default load factor used by the HashSet is 0.75, which means that an array with 1.3e6 entries will be used inside the HashSet for 1e6 objects.If a HashSet does not work for you (for example, because there are many hash collisions, because the memory is dense, or because there are many inserts) than consider using Trie . Searching in Trie has the worst complexity O (m), where m =
string.length(). Trie also has some additional advantages that may be useful to you: for example, it can give you the most suitable for the search bar. But keep in mind that the best code is not code, so just roll your own Trie implementation if the benefits are due to the high costs.Consider using a database if you need more complex queries, for example. match for a substring or regular expression.
I would use Set , in most cases a HashSet is fine.
With so many lines, I immediately think of Trie . It works better with a more limited set of characters (e.g. letters) and / or at the beginning of multiple line overlays.
Running the exercise here is my results.
private static final int TEST_CYCLES = 4000; private static final long RAND_ELEMENT_COUNT = 1000000l; private static final int RAND_STR_LEN = 20; //Mean time /* Array list:18.55425 Array list not contains:17.113 Hash set:5.0E-4 Hash set not contains:7.5E-4 */ I think the numbers speak for themselves. Hash set lookup time is wayyyy way faster.
If you have such a large number of rows, the best option for you is to use a database. Find MySQL.
This may not be necessary for your case, but I think itβs useful to know that there are some probability-ineffective algorithms. For example, a Bloom filter .
Not only for String, you can use Set for any occasion when you need unique elements.
If the element type is primitive or wrapper, you might not care. But if it is a class, you must override two methods:
- hash code ()
- is equal to ()
Sometimes you want to check if an object is in a list / set and at the same time you want the list / set to be ordered. If you also want to easily retrieve objects without using an enumeration or an iterator, you can consider using both ArrayList<String> and HashMap<String, Integer> . The list is supported by the map.
An example from some work I recently did:
public class NodeKey<K> implements Serializable, Cloneable{ private static final long serialVersionUID = -634779076519943311L; private NodeKey<K> parent; private List<K> children = new ArrayList<K>(); private Map<K, Integer> childrenToListMap = new HashMap<K, Integer>(); public NodeKey() {} public NodeKey(Collection<? extends K> c){ List<K> childHierarchy = new ArrayList<K>(c); K childLevel0 = childHierarchy.remove(0); if(!childrenToListMap.containsKey(childLevel0)){ children.add(childLevel0); childrenToListMap.put(childLevel0, children.size()-1); } ... In this case, the parameter K will be String for you. The map ( childrenToMapList ) stores the Strings inserted in the list ( children ) as a key, and map values ββare the position of the index in the list.
The reason for the list and map is that you can get indexed list values ββwithout having to iterate over the HashSet<String> .