How to read words in java

I am looking for an algorithm, tooltip or any source code that can solve my next problem.

I have a folder in which there are a lot of text files. I read them and saved all the text in STRING. Now I want to calculate if any word appeared in other files or not. (I know this is unclear, let me give an example)

For example, I have two documents: Doc A => "jump with brown fox" Doc B => "dog not jump" Doc C => "dog jumping fox"

Say my program read the first document, and now the first word "brown" now my program will check whether this word appeared in any other document? Thus, the answer will be 0. Now he will check the second word "fox" again, it will give the conclusion that yes, he appeared in (Doc C) and so on ...... Now he will read Doc B, and he will check if the dog appeared in another document? The answer would be (Doc C) so on ....

Any advice or pseudo code?

Hint: it is also called reverse document frequency (Idf). I know what idf is.

+4
source share
6 answers

Like GregS, use a HashMap. I do not send the code because I think this is homework, and I want to give you the opportunity to create it yourself, but the outline:

  • Open new document
  • For each word, look at your hash if it already exists. If this is not the case, create a new key in HashMap with this word and add a new document (file name) at this position. If so, just add the document file name.

For example, if you have: DocA: jump with a hood DocB: Fox jump dog

You will open DocA and go through its contents. "brown" is not in your hashmap, so you would add a new element with the key "brown" and the value "DocA". Same thing with the fox and the jump. Then you open DocB. "fox" is already in your hashmap, so you add its DocB value (the value will be "DocA DocB"). Perhaps using an ArrayList (in Java) will help.

+6
source

Hint: mapping HashMap strings to file lists.

+5
source

Perhaps it would be useful to think about the problem in terms of “I have this set of words for all documents together” and “I could somehow save each of these words in which document”. Given this presentation of your data, it would be very easy to determine if a given word is displayed in several documents. Others have indicated how to do this here.

+2
source

Another idea that is different from all the valuable answers, I admit that the hash looks better, I just wanted to see it from a different angle.

I would sort all the words in each document and compare each document with each other.

For example, docA> brown, fox, jump; docB-> doc, jump, not docC-> dog, fox, jump

comparing them is like this

  until there is a single document with words
    get first element of documents
    compare the most descending first element if that element exists more than once reserve it
    throw the one that is the most descending (in my case)

therefore in the first comparison

docA → fox, jump docB → doc, jump, not docC → dog, fox, jump

in the second comparison

docA → fox, jump docB → jump, not docC → dog, fox

in the third comparison

docA → fox, jump docB → jump, not docC → fox, jump

reserve fox in the 4th comparison, reserve jump in the 5th comparison.

+2
source

Display HashMap strings for integers. Whole ones are immutable, so there is a little fuss to “increase”, but not too much. You can override the put () method.

+1
source

This code will return all the different words as a key and will be considered the value of each word found in the sentence. Just create a String object as input from a file or command line and pass it below.

public Map<String,Integer> getWordsWithCount(String sentances) { Map<String,Integer> wordsWithCount = new HashMap<String, Integer>(); String[] words = sentances.split(" "); for (String word : words) { if(wordsWithCount.containsKey(word)) { wordsWithCount.put(word, wordsWithCount.get(word)+1); } else { wordsWithCount.put(word, 1); } } return wordsWithCount; } 
+1
source

All Articles