Synonyms text search and text parsing

We have a client who is looking for means to import and categorize large amounts of text data. This data should be classified, and it was suggested that the easiest way to do this would be to look at the description field and try to match the words contained there to find out if a category can be obtained for this particular entry.

It was believed that the best way to do this was to combine the words with keywords contained against each category, and if that was unsuccessful, then use some kind of synonym to see if it could be used instead. So, for example, if the word “car” was in a particular record, then a synonymous search could correspond to this word with the word “car”, which would be held against the category “car”.

Does anyone know about a web service or other dictionary search tools to find synonyms for a specific word? The project manager offered to buy a Google Enterprise Search license for this, but from what I can make out, he does not offer what these guys are looking for.

Any suggestions to get the client that they are looking for will be greatly appreciated.


Thanks! I will look in Wordnet.

Are you aware of any other types of text classification software products? I see there is some discussion of using Bayasian algorithms for this, but I do not see any real examples of this world.

+4
source share
3 answers

The first thing that comes to mind is Wordnet . Wordnet is a human-created database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces for Wordnet. I believe some of these are web services. You can also throw your own. Manning and Schutze chapter 5 (free PDF) shows ways to do this.

Having said that, are you solving the right problem? How do you build a list of categories? Is this a hierarchy? Tag Cloud? See Shirki's Clay Ontology is reevaluated for criticizing hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (e.g. Naive Bayes), rather than on single words.

+6
source

You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ for more information, but there are libraries available to integrate with them in many languages.

Go to your online tool to see its use in action: http://wordnetweb.princeton.edu/perl/webwn . If you look at a word, then press "S" next to each definition, you will get a list of semantically related words for that definition.

I also think that you should check out software that will allow you to perform “document clustering”. Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview . This should help you begin the process of creating a category.

I think this will help you go a long way to what you want!

+1
source

To classify the text, you can take a look at Apache Mahout .

0
source

All Articles