I'm not sure if I understand your requirements and situation.
You have about 2,500 files, each of 3,000 words (or 400?). There are many duplicate words that are found in multiple files.
Now someone will ask you which words have file-345 and file-765.
You can create a hashmap where you store each word, and a list of files in which the words occur.
If you get a 345 file with 3000 words (400?) With it, you look at it in hashmap and see where the 765 file is listed.
However, 2 * 3000 is not so much. If I create 2 string lists in Scala (which runs on the JVM):
val g1 = (1 to 3000).map (x=> "" + r.nextInt (10000)) val g2 = (1 to 3000).map (x=> "" + r.nextInt (10000))
and build the intersection
g1.intersect (g2)
I get the result (678 elements) almost not on an 8-year-old laptop.
So how many requests will you have to answer? How often does file input change? If rare, then reading two files can be a critical point.
How many unique words do you have? Perhaps it is not a problem to keep them all in mind.
source share