How to get the most frequent word on the map and the corresponding frequency using Java 8 threads?

Question

How to get the most frequent word on the map and the corresponding frequency using Java 8 threads?

I have an IndexEntry class that looks like this:

 public class IndexEntry implements Comparable<IndexEntry> { private String word; private int frequency; private int documentId; ... //Simple getters for all properties public int getFrequency() { return frequency; } ... }

I store objects of this class in Guava SortedSetMultimap (which allows multiple values for the key), where I map the word String to some IndexEntry s. Behind the scenes, it displays each word in a SortedSet<IndexEntry> .

I am trying to implement a kind of indexed structure of words in documents and the frequency of their appearance within documents.

I know how to get the score of the most common word, but I can't seem to get the word.

Here's what I need to get a counter for the most common term, where entries are SortedSetMultimap , as well as helper methods:

 public int mostFrequentWordFrequency() { return entries .keySet() .stream() .map(this::totalFrequencyOfWord) .max(Comparator.naturalOrder()).orElse(0); } public int totalFrequencyOfWord(String word) { return getEntriesOfWord(word) .stream() .mapToInt(IndexEntry::getFrequency) .sum(); } public SortedSet<IndexEntry> getEntriesOfWord(String word) { return entries.get(word); }

I am trying to learn the features of Java 8 because they seem really useful. However, I cannot get the thread to work the way I want. I want to be able to have both a word and a frequency at the end of the stream, but not assuming that if I have a word, I can easily get common occurrences of that word.

Currently, I continue to end up with Stream<SortedSet<IndexEntry>> , with which I cannot do anything. I don’t know how to get the most frequent word without frequencies, but if I have a frequency, I can’t track the corresponding word. I tried to create a WordFrequencyPair POJO class to store both, but then I just had Stream<SortedSet<WordFrequencyPair>> , and I could not figure out how to map it to something useful.

What am I missing?

+8

java multimap java-8 java-stream

Cache staheli May 13, '17 at 6:12

source share

2 answers

Native solution from JDK:

 entries.keySet().stream() .collect(groupingBy(IndexEntry::getWord, summingInt(IndexEntry::getFrequency))) .values().stream().max(Comparator.naturalOrder()).orElse(0L);

Or StreamEx

 StreamEx.of(entries.keySet()) .groupingBy(IndexEntry::getWord, summingInt(IndexEntry::getFrequency)) .values().stream().max(Comparator.naturalOrder()).orElse(0L);

0

user_3380739 May 18, '17 at 0:39

source share

Jacob G. · Accepted Answer · 2017-05-13T06:41:54+0000

I think it would be better to use documentId as a TreeMultimap key rather than word :

 import com.google.common.collect.*; public class Main { TreeMultimap<Integer, IndexEntry> entries = TreeMultimap.<Integer, IndexEntry>create(Ordering.arbitrary(), Ordering.natural().reverse()); public static void main(String[] args) { // Add elements to `entries` // Get the most frequent word in document #1 String mostFrequentWord = entries.get(1).first().getWord(); } } class IndexEntry implements Comparable<IndexEntry> { private String word; private int frequency; private int documentId; public String getWord() { return word; } public int getFrequency() { return frequency; } public int getDocumentId() { return documentId; } @Override public int compareTo(IndexEntry i) { return Integer.compare(frequency, i.frequency); } }

Then you can implement the methods that you had before:

 public static int totalFrequencyOfWord(String word) { return entries.values() .stream() .filter(i -> word.equals(i.getWord())) .mapToInt(IndexEntry::getFrequency) .sum(); } /** * This method iterates through the values of the {@link TreeMultimap}, * searching for {@link IndexEntry} objects which have their {@code word} * field equal to the parameter, word. * * @param word * The word to search for in every document. * @return * A {@link List<Pair<Integer, Integer>>} where each {@link Pair<>} * will hold the document ID as its first element and the frequency * of the word in the document as its second element. * * Note that the {@link Pair} object is defined in javafx.util.Pair */ public static List<Pair<Integer, Integer>> totalWordUses(String word) { return entries.values() .stream() .filter(i -> word.equals(i.getWord())) .map(i -> new Pair<>(i.getDocumentId(), i.getFrequency())) .collect(Collectors.toList()); }

How to get the most frequent word on the map and the corresponding frequency using Java 8 threads?

More articles: