The most efficient way to find the number of matches with a string against an array of words?

let's say i have a string

String test = "This is a test string and I have some stopwords in here"; 

and I want to see how many times the words in the array below match my string

psudocode

 array = a,and,the,them,they,I 

so the answer will be "3"

just wondering what is the most efficient way to do this in java?

+7
java
source share
5 answers

I would probably save the words in the input to the HashSet, and then iterate over the array and see if each one is in the array. contained in a set.

Here it is in the code ... entrance " Around the world in 80 days ."

 import java.io.File; import java.io.FileNotFoundException; import java.util.ArrayList; import java.util.Arrays; import java.util.HashSet; import java.util.List; import java.util.Scanner; import java.util.Set; public class Main { public static void main(final String[] argv) throws FileNotFoundException { final File file; final String[] wordsToFind; file = new File(argv[0]); wordsToFind = getWordsToFind(file); a(file, wordsToFind); b(file, wordsToFind); c(file, wordsToFind); d(file, wordsToFind); } // this just reads the file into the disk cache private static String[] getWordsToFind(final File file) throws FileNotFoundException { final Scanner scanner; final Set<String> words; scanner = new Scanner(file); words = new HashSet<String>(); while(scanner.hasNext()) { final String word; word = scanner.next(); words.add(word); } return (words.toArray(new String[words.size()])); } // bad way, read intpo a list and then iterate over the list until you find a match private static void a(final File file, final String[] wordsToFind) throws FileNotFoundException { final long start; final long end; final long total; final Scanner scanner; final List<String> words; int matches; scanner = new Scanner(file); words = new ArrayList<String>(); while(scanner.hasNext()) { final String word; word = scanner.next(); words.add(word); } start = System.nanoTime(); { matches = 0; for(final String wordToFind : wordsToFind) { for(final String word : words) { if(word.equals(wordToFind)) { matches++; break; } } } System.out.println(matches); } end = System.nanoTime(); total = end - start; System.out.println("a: " + total); } // slightly better way, read intpo a list and then iterate over the set (which reduces the number of things you progbably // have to read until you find a match), until you find a match private static void b(final File file, final String[] wordsToFind) throws FileNotFoundException { final long start; final long end; final long total; final Scanner scanner; final Set<String> words; int matches; scanner = new Scanner(file); words = new HashSet<String>(); while(scanner.hasNext()) { final String word; word = scanner.next(); words.add(word); } start = System.nanoTime(); { matches = 0; for(final String wordToFind : wordsToFind) { for(final String word : words) { if(word.equals(wordToFind)) { matches++; break; } } } System.out.println(matches); } end = System.nanoTime(); total = end - start; System.out.println("b: " + total); } // my way private static void c(final File file, final String[] wordsToFind) throws FileNotFoundException { final long start; final long end; final long total; final Scanner scanner; final Set<String> words; int matches; scanner = new Scanner(file); words = new HashSet<String>(); while(scanner.hasNext()) { final String word; word = scanner.next(); words.add(word); } start = System.nanoTime(); { matches = 0; for(final String wordToFind : wordsToFind) { if(words.contains(wordToFind)) { matches++; } } System.out.println(matches); } end = System.nanoTime(); total = end - start; System.out.println("c: " + total); } // Nikita Rybak way private static void d(final File file, final String[] wordsToFind) throws FileNotFoundException { final long start; final long end; final long total; final Scanner scanner; final Set<String> words; int matches; scanner = new Scanner(file); words = new HashSet<String>(); while(scanner.hasNext()) { final String word; word = scanner.next(); words.add(word); } start = System.nanoTime(); { words.retainAll(new HashSet<String>(Arrays.asList(wordsToFind))); matches = words.size(); System.out.println(matches); } end = System.nanoTime(); total = end - start; System.out.println("d: " + total); } } 

(after several runs, each run is almost the same):

 12596 a: 2440699000 12596 b: 2531635000 12596 c: 4507000 12596 d: 5597000 

If you change it by adding "XXX" to each of the words in getWordsToFind (so no words found):

 0 a: 7415291000 0 b: 4688973000 0 c: 2849000 0 d: 7981000 

And, for completeness, I tried just to find the word "I", and the results:

 1 a: 235000 1 b: 351000 1 c: 75000 1 d: 10725000 
+5
source share

Something like that? Not sure about the β€œmost effective,” but simple enough.

 Set<String> s1 = new HashSet<String>(Arrays.asList("This is a test string and I have some stopwords in here".split("\\s"))); Set<String> s2 = new HashSet<String>(Arrays.asList("a", "and", "the", "them", "they", "I")); s1.retainAll(s2); System.out.println(s1.size()); 

Simple intersection of two sets of words.

+5
source share

the most efficient thing is to sort both the "test" and the "array", and then iterate over both: n.log (n) + n

test β†’ ['a', 'and', 'have', 'here', in, is, ..., 'This'] array β†’ ['a', 'and', 'the', 'them', 'they', 'I']

array checks' a '' a '1' a '' and '1' and '' and '2' and '' have '2' 'here' 2 '' '' in '2' '' 'is' 2. ..

+3
source share

A slight variation on Nikita's answer (up 1 for Nikita). If you use the list for s1, you get the number of entries (in case the word appears several times in the sentence).

 List<String> s1 = new ArrayList<String>(Arrays.asList("This is a test string and I have some stopwords in here".split("\\s"))); Set<String> s2 = new HashSet<String>(Arrays.asList("a", "and", "the", "them", "they", "I")); s1.retainAll(s2); System.out.println(s1.size()); 
0
source share

save your lines in a hash table (HashMap of (String and Integer)), then an iterator over the text and increase the integer value for the corresponding word in the hash table. then a hash table iterator and sum all integer values.

0
source share

All Articles