I need a solution to identify invalid chapter titles in a book.
We are developing a swallowing system for books that do all sorts of checks, such as spell checking and offensive language filtering. Now we would like to mention the headings of the chapters, which seem inaccurate, given the body of the chapter. For example, if the title was “Spleen Function,” I would not expect the chapter to be about the liver.
I am familiar with fuzzy string matching algorithms, but this seems like a problem with NLP or classification. If I could match (or closely match) the phrase "spleen function", then this is a great trust. Otherwise, the high value of both the “function” and the “spleen” in the text also gives confidence. And, of course, the closer they are to each other, the better.
This must be done in memory, on the fly, and in Java.
My current naive approach is to simply label all the words, remove the noise words (e.g. prepositions), stop what is left, and then count the number of matches. At a minimum, I expect each word in the title to appear at least once in the text.
Is there another approach, ideally, that takes into account such things as proximity and order?
source
share