Fastest way to check if List <String> contains a unique string

Basically, I have about 1,000,000 rows, for each request I have to check whether the row belongs to the list or not.

I'm worried about performance, and what's the best method? ArrayList ? Hash?

+61
java performance string contains list
Jul 22 2018-10-22T00:
source share
10 answers

It is best to use a HashSet and check if a string exists in the set using the contains() method. HashSets are created for quick access using the Object hashCode() and equals() methods. The Javadoc for HashSet states:

This class offers consistent performance over time for basic operations (add, delete, contain and size)

A HashSet stores objects in hash codes , that is, the value returned by the hashCode method determines which bucket the object stores. Thus, the number of equality checks that a HashSet must perform using the equals() method reduces only to other objects in the same hash bucket .

To use HashSets and HashMaps effectively, you must comply with the equals and hashCode contract specified in javadoc . In the case of java.lang.String these methods are already implemented for this.

+90
Jul 22 2018-10-22T00:
source share

In general, a HashSet will give you better performance, since it does not have to look at each element and compare how ArrayList does, but usually compares no more than a few elements where hash codes are equal.

However, for 1M strings, the performance of hashSet may still not be optimal. A lot of misses in the cache will slow down the set search. If all lines are equally likely, this is inevitable. However, if some lines are more often requested than others, then you can put the general lines in a small hash set and check this first before checking the larger set. A small hash must have a size corresponding to the size of the cache (for example, several hundred K). Hits to a smaller hash will be very fast, and hits to a larger hash will continue at a speed limited by the memory bandwidth.

+11
Jul 22 '10 at 10:01
source share

Before you move on, think about this: why are you worried about performance? How often is this check called?

Regarding possible solutions:

  • If the list is already sorted, you can use java.util.Collections.binarySearch , which offers the same performance characteristics as java.util.TreeSet .

  • Otherwise, you can use java.util.HashSet , which is an O (1) performance characteristic. Note that computing a hash code for a string that has not yet been calculated is an O (m) operation with m = string.length() . Also keep in mind that hashtables only work until you reach the specified load factor, that is, hashtables will use more memory than regular lists. The default load factor used by the HashSet is 0.75, which means that an array with 1.3e6 entries will be used inside the HashSet for 1e6 objects.

  • If a HashSet does not work for you (for example, because there are many hash collisions, because the memory is dense, or because there are many inserts) than consider using Trie . Searching in Trie has the worst complexity O (m), where m = string.length() . Trie also has some additional advantages that may be useful to you: for example, it can give you the most suitable for the search bar. But keep in mind that the best code is not code, so just roll your own Trie implementation if the benefits are due to the high costs.

  • Consider using a database if you need more complex queries, for example. match for a substring or regular expression.

+8
Jul 22 '10 at 10:41
source share

I would use Set , in most cases a HashSet is fine.

+5
Jul 22 2018-10-22T00:
source share

With so many lines, I immediately think of Trie . It works better with a more limited set of characters (e.g. letters) and / or at the beginning of multiple line overlays.

+2
Jul 22. '10 at 14:29
source share

Running the exercise here is my results.

 private static final int TEST_CYCLES = 4000; private static final long RAND_ELEMENT_COUNT = 1000000l; private static final int RAND_STR_LEN = 20; //Mean time /* Array list:18.55425 Array list not contains:17.113 Hash set:5.0E-4 Hash set not contains:7.5E-4 */ 

I think the numbers speak for themselves. Hash set lookup time is wayyyy way faster.

+2
Apr 11 '15 at 2:44
source share

If you have such a large number of rows, the best option for you is to use a database. Find MySQL.

+1
Jul 22 2018-10-22T00:
source share

This may not be necessary for your case, but I think it’s useful to know that there are some probability-ineffective algorithms. For example, a Bloom filter .

+1
Mar 25 '16 at 20:39
source share

Not only for String, you can use Set for any occasion when you need unique elements.

If the element type is primitive or wrapper, you might not care. But if it is a class, you must override two methods:

  • hash code ()
  • is equal to ()
0
Jul 22 '10 at 13:45
source share

Sometimes you want to check if an object is in a list / set and at the same time you want the list / set to be ordered. If you also want to easily retrieve objects without using an enumeration or an iterator, you can consider using both ArrayList<String> and HashMap<String, Integer> . The list is supported by the map.

An example from some work I recently did:

 public class NodeKey<K> implements Serializable, Cloneable{ private static final long serialVersionUID = -634779076519943311L; private NodeKey<K> parent; private List<K> children = new ArrayList<K>(); private Map<K, Integer> childrenToListMap = new HashMap<K, Integer>(); public NodeKey() {} public NodeKey(Collection<? extends K> c){ List<K> childHierarchy = new ArrayList<K>(c); K childLevel0 = childHierarchy.remove(0); if(!childrenToListMap.containsKey(childLevel0)){ children.add(childLevel0); childrenToListMap.put(childLevel0, children.size()-1); } ... 

In this case, the parameter K will be String for you. The map ( childrenToMapList ) stores the Strings inserted in the list ( children ) as a key, and map values ​​are the position of the index in the list.

The reason for the list and map is that you can get indexed list values ​​without having to iterate over the HashSet<String> .

0
May 7 '14 at 15:47
source share



All Articles