Faster data structure for string search

I have this code that determines if a word (ignoring the case) is included in the wordList text file. However, a wordList text file can have 65,000 lines of ++, and just finding a word using my implementation below takes almost a minute. Could you think of any better implementation?

Thanks!

import java.io.*; import java.util.*; public class WordSearch { LinkedList<String> lxx; FileReader fxx; BufferedReader bxx; public WordSearch(String wordlist) throws IOException { fxx = new FileReader(wordlist); bxx = new BufferedReader(fxx); lxx = new LinkedList<String>(); String word; while ( (word = bxx.readLine()) != null) { lxx.add(word); } bxx.close(); } public boolean inTheList (String theWord) { for(int i =0 ; i < lxx.size(); i++) { if (theWord.compareToIgnoreCase(lxx.get(i)) == 0) { return true; } } return false; } } 
+8
java data-structures
source share
7 answers

Use the HashSet in which you place the line version of each word. Verifying that the HashSet contains the specified string is, on average, a constant-time operation (read: extremely fast).

+12
source share

As you search, you may want to sort the list before searching; then you can perform a binary search, which is much faster than a linear search. This can help if you fulfill several requests in one list, otherwise the penalty that you pay for sorting the list should not be searched only once.

In addition, when performing a linear search on a linked list using "lxx.get (i)", a problem occurs. LinkedList.get () - O (n). You can use Iterator (easy way: for (String s: lxx)) or switch to a list type that has O (1) access time, like ArrayList.

+2
source share

Each search is done through l in the O (n) operation, so it will be quite expensive if you have thousands of words. Use a HashSet instead:

 Set<String> lxx; ... lxx = new HashSet<String>(); while ( (word = bxx.readLine()) != null) { lxx.add(word.toLowerCase()); } bxx.close(); 

and then use lxx.contains(theWord.toLowerCase()) to check if that word is in the file. Each search in a HashSet is an O (1) operation, so the time it takes is (almost) regardless of the size of your file.

0
source share

First of all, do not declare your variable as a LinkedList, declare its List (parts of the code that are not related to the remote list:

 public class WordSearch { List<String> lxx; public WordSearch(String wordlist) throws IOException { lxx = new LinkedList<String>(); } } 

Next, do not call get on the list, using LinkedList get will be VERY slow. Use an iterator instead ... better still use the new stype for the loop that the iterator uses for you:

  public boolean inTheList (String theWord) { for(String word : lxx) { if (theWord.compareToIgnoreCase(word) == 0) { return true; } } return false; } 

Then change the new LinkedList to the new ArrayList:

lxx = new ArrayList ();

This code should be faster, but you can still do better.

Since you don't need duplicate words, use Set instead of List and use HashSet instead of ArrayList.

This will greatly speed up the program.

Your source code, using LinkedList with get, should start at the top of the list every time you search for the next word in the list. Using Iterator (through a new style for each loop) stops this.

Using LinkedList means that every time you need to go to the next word in the list, there is a related search, ArrayList does not have this overhead.

The use of the HashSet is completed (possibly) using a tree structure that has a very fast search.

0
source share

Here, my implementation runs for 50 ms.

First you need to download the file and save it in memory.

You can download it however you want, but if you load it in large pieces, it will be easier for you.

My input was a byte in a python book (a variant with one HTML file was uploaded) and Java Language Specification (load html and create one file from all html pages)

To create a list in a large file, I used the same program (see comment).

As soon as I have a large file with a volume of about 300 thousand words, I started the program with this output:

 C:\Users\oreyes\langs\java\search>dir singlelineInput.txt El volumen de la unidad C no tiene etiqueta. El nรบmero de serie del volumen es: 22A8-203B Directorio de C:\Users\oreyes\langs\java\search 04/03/2011 09:37 pm 3,898,345 singlelineInput.txt 1 archivos 3,898,345 bytes C:\Users\oreyes\langs\java\search>javac WordSearch.java C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great" Loaded 377381 words in 2844 ms true in 31 ms C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great" Loaded 377381 words in 2812 ms true in 31 ms C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "awesome" Loaded 377381 words in 2813 ms false in 47 ms C:\Users\oreyes\langs\java\search>gvim singlelineInput.txt C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "during" Loaded 377381 words in 2813 ms true in 15 ms C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "specification" Loaded 377381 words in 2875 ms true in 47 ms C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<href" Loaded 377381 words in 2844 ms false in 47 ms C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<br>" Loaded 377381 words in 2829 ms true in 15 ms 

Always less than 50 ms.

Here is the code:

  import java.io.*; import java.util.*; class WordSearch { String inputFile; List<String> words; public WordSearch(String file ) { inputFile = file; } public void initialize() throws IOException { long start = System.currentTimeMillis(); File file = new File( inputFile ); ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length()); FileInputStream in = new FileInputStream( file ); copyLarge( in, baos, (int)file.length() ); Scanner scanner = new Scanner( new ByteArrayInputStream( baos.toByteArray() )); words = new LinkedList<String>(); while( scanner.hasNextLine() ) { String l = scanner.nextLine().trim(); //for( String s : l.split("\\s+")){ //System.out.println( s ); words.add( l.toLowerCase() ); //} } Collections.sort( words ); for( String s : words ) { //System.out.println( s ); } System.out.println("Loaded " + words.size() + " words in "+ ( System.currentTimeMillis() - start ) + " ms" ); } public boolean contains( String aWord ) { return words.contains( aWord.toLowerCase() ); } // taken from: http://stackoverflow.com/questions/326390/how-to-create-a-java-string-from-the-contents-of-a-file/326413#326413 public static long copyLarge(InputStream input, OutputStream output, int size ) throws IOException { byte[] buffer = new byte[size];// something biggie long count = 0; int n = 0; while (-1 != (n = input.read(buffer))) { output.write(buffer, 0, n); count += n; } return count; } public static void main( String ... args ) throws IOException { WordSearch ws = new WordSearch( args[0] ); ws.initialize(); long start = System.currentTimeMillis(); System.out.println( ws.contains( args[1] ) ); System.out.println("in "+ ( System.currentTimeMillis() - start ) +" ms "); } } 

The tough part was getting a sample input: P

0
source share

Guess that using the HashMap returns in no time:

The version has been changed here, and it always ends at 0 ms.

  import java.io.*; import java.util.*; class WordSearch { String inputFile; //List<String> words; Set<String> words; public WordSearch(String file ) { inputFile = file; } public void initialize() throws IOException { long start = System.currentTimeMillis(); File file = new File( inputFile ); ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length()); FileInputStream in = new FileInputStream( file ); copyLarge( in, baos, (int)file.length() ); Scanner scanner = new Scanner( new ByteArrayInputStream( baos.toByteArray() )); words = new HashSet<String>(); while( scanner.hasNextLine() ) { String l = scanner.nextLine().trim(); //for( String s : l.split("\\s+")){ //System.out.println( s ); words.add( l.toLowerCase() ); //} } //Collections.sort( words ); for( String s : words ) { System.out.println( s ); } System.out.println("Loaded " + words.size() + " words in "+ ( System.currentTimeMillis() - start ) + " ms" ); } public boolean contains( String aWord ) { return words.contains( aWord.toLowerCase() ); } public static long copyLarge(InputStream input, OutputStream output, int size ) throws IOException { byte[] buffer = new byte[size];// something biggie long count = 0; int n = 0; while (-1 != (n = input.read(buffer))) { output.write(buffer, 0, n); count += n; } return count; } public static void main( String ... args ) throws IOException { WordSearch ws = new WordSearch( args[0] ); ws.initialize(); long start = System.currentTimeMillis(); System.out.println( ws.contains( args[1] ) ); System.out.println("in "+ ( System.currentTimeMillis() - start ) +" ms "); } } 

Now I know for sure :)

0
source share

two suggestions: Both data structures give you better performance.

  • Directional Acyclic Word Graph (DAWG)
  • Dictionary data structure. n-tree
0
source share

All Articles