Best way to store and access 120,000 words in java

I am programming a Java application that reads strictly text files (.txt). These files can contain more than 120,000 words.

The application should store all +120 000 words. He should call them word_1, word_2, etc. And he also needs to access these words in order to perform various methods on them.

All methods are associated with strings. For example, a method will be called to indicate how many letters are in word_80. Another method will be called to say which specific letters are specified in word_2200.

In addition, some methods will compare two words. For example, a method will be called to compare word_80 with word_2200 and need to be returned that has more letters. Another method will be called to compare word_80 with word_2200 and should return those specific letters that separate the words.

My question is: since I work almost exclusively with Strings, is it better to store these words in one big ArrayList? A Few Small ArrayLists? Or should I use one of many other storage options such as Vectors, HashSets, LinkedLists?

My two main problems: 1.) access speed and 2.) the availability of the maximum possible number of built-in methods at my disposal.

Thank you for your help!


Wow! Thanks to everyone for the quick answer to my question. All your suggestions really helped me. I ponder and consider all the options presented in your reviews.

Please forgive me for any fuzziness; and let me ask your questions:

  • Q) English? A) Text files are actually books written in English. The appearance of a word in a second language would be rare, but not impossible. Id puts the percentage of non-English words in text files by .0001%

  • Q) Homework?
    A) I look at my questions with a smile. Yes, it looks like a school assignment. But no, this is not homework.

  • Q) Duplicates?
    A) Yes. And probably every five or so words considering unions, articles, etc.

  • Q) Access? A) Both random and sequential. Of course, it is possible that the method will find a random word. It is equally possible that the method wants to search for the corresponding word between word_1 and word_120000 sequentially. Which leads to the last question ...

  • Q) Iterate over the whole list?
    A) Yes.

In addition, I plan to develop this program to perform many other methods in words. Again, I apologize for my fuzziness. (Details make a world of difference, right?)

Hooray!

+5
java storage
source share
11 answers

I would save them in one big ArrayList and worry about (possibly unnecessary) optimizations later.

Being inherently lazy, I don't think it is a good idea to optimize if there is no proven need. Otherwise, you just spend the effort that you can better spend elsewhere.

In fact, if you can set the upper limit of your word count and you donโ€™t need any of the fancy List operations, I would choose a regular (custom) array of string objects with an integer containing the actual number. It will probably be faster than a class based approach.

This gives you maximum speed in accessing individual elements, while maintaining the ability to perform all these wonderful string manipulations.

Note. I have not tested my own arrays for ArrayLists arrays. They can be as fast as native arrays, so you should check it yourself if you have less blind faith in my abilities than me :-).

If they prove to be as fast (or even close), the added benefits (extensibility for one) may be enough to justify their use.

+16
source share

Just confirming pax assumptions, with a very naive reference

public static void main(String[] args) { int size = 120000; String[] arr = new String[size]; ArrayList al = new ArrayList(size); for (int i = 0; i < size; i++) { String put = Integer.toHexString(i).toString(); // System.out.print(put + " "); al.add(put); arr[i] = put; } Random rand = new Random(); Date start = new Date(); for (int i = 0; i < 10000000; i++) { int get = rand.nextInt(size); String fetch = arr[get]; } Date end = new Date(); long diff = end.getTime() - start.getTime(); System.out.println("array access took " + diff + " ms"); start = new Date(); for (int i = 0; i < 10000000; i++) { int get = rand.nextInt(size); String fetch = (String) al.get(get); } end = new Date(); diff = end.getTime() - start.getTime(); System.out.println("array list access took " + diff + " ms"); } 

and conclusion:
access to the array took 578 ms
access to the list of arrays took 907 ms

starts it several times when the actual time seems to change it, but usually access to the array between 200 and 400 ms is faster, over 10,000,000 iterations.

+3
source share

If you consistently refer to these lines, LinkedList would be the best option.

For random access, ArrayLists have good memory usage / access speed.

+2
source share

My welcome:

For a non-streaming program, an arrailist is always faster and easier.

For a streaming program, java.util.concurrent.ConcurrentHashMap <Integer, String> or java.util.concurrent.ConcurrentSkipListMap <Integer, String> is excellent. You may later want to allow threads in order to simultaneously create multiple requests against this huge thing.

+1
source share

If you are going for a quick crawl as well as a compact size, use DAWG (Directed Acyclic Word Graph.). This data structure takes the idea of โ€‹โ€‹trie and improves it by detecting and decomposing common suffixes, as well as common prefixes.

http://en.wikipedia.org/wiki/Directed_acyclic_word_graph

+1
source share

Use a hashtable ? This will give you the best search speed.

0
source share

ArrayList / Vector if the order matters (it seems like you are calling the words "word_xxx") or HashTable / HashMap if it is not.

I will leave the exercise to find out why you would like to use ArrayList against Vector or HashTable against HashMap before you, because I have a suspicious suspicion that this is your homework. Check out the Javadocs.

You will not get any methods that will help you, as you requested, in the above examples from the Collections Framework class, since none of them performs string comparison operations. If you just do not want to order them alphabetically or something in this case, you should use one of the Tree implementations in the Collections structure.

0
source share

What about a base tree or patricia trie?

http://en.wikipedia.org/wiki/Radix_tree

0
source share

The only advantage of a linked list over an array or a list of arrays will be if there are inserts and deletes in arbitrary places. I do not think this is the case here: you read in the document and build the list in order.

I THINK that when the original poster talked about searching for the word "word_2200," he simply meant the 2200th word in the document, and not that there are any marks associated with each word. If so, then all he needs is indexed access to all words. Hence a list of arrays or arrays. If there really is something more complicated, if one word can be marked as "word_2200", and the next word will be marked as "foobar_42" or something like that, then yes, it will need a more complex structure.

Hey, you want to give us a hint, WHY do you want to make any of this? Itโ€™s hard for me to remember the last time I said to myself: โ€œHey, I wonder if the 1237th word in this document that I read is longer or shorter than the 842th word?โ€

0
source share

Depends on the problem - speed or memory.

If it is memory, the minimum solution is to write the getWord (n) function, which scans the entire file every time it starts and retrieves the word n.

Now - this is not a good solution. The best solution is to decide how much memory you want to use: say, 1000 elements. Scan a file for words once at application launch and save a series of bookmarks containing the word number and position in the file where it is located - to do this in such a way that the bookmarks are more or less evenly distributed on the file.

Then open the file for random access. The getWord (n) function now scans bookmarks to find the largest word # <= n (please use binary search), tries to find in the specified location and scans the file, counting the words to find the requested word.

An even faster solution that uses quite a lot of memnory is to create some kind of block cache - based on the fact that getWord () requests usually go through clusters. You can pack things so that if someone asks for the word # X, but it is not in the bookmarks, then you look for it and bookmark it, preserving the memory, combining any other bookmark that was used recently.

And so on. Actually, it depends on what the problem is: which retrovilation models are likely.

-one
source share

I donโ€™t understand why so many people offer Arraylist or the like, since you donโ€™t mention that you ever had to iterate over the whole list. Also, it seems you want to access them as key / value pairs ("word_348" = "pedantic").

For quick access, I would use TreeMap, which will perform binary searches to find your keys. Its only drawback is that it is not synchronized, but this is not a problem for your application.

http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html

-2
source share

All Articles