I am trying to develop a method that can classify a given number of English words into 2 sets - "rare" and "general" - a link to how much they are used in the language.
The number of words that I would like to classify is limited - currently about 10,000 and includes everything from articles to personal nouns that can be borrowed from other languages (and therefore are classified as "rare"). I did a frequency analysis from the inside of the case, and I have a distribution of these words (ranging from 1 use to peaks around 100).
My intuition for such a system was to use word lists (for example, the word BNC frequency case, wordnet, the frequency of the internal case) and assign weight to its appearance in one of them.
For example, a word that has a mid-level frequency in the case (say 50), but appears in the list of words W -, can be considered common, since it is one of the most frequent in the whole language. My question is: what's the best way to create a weighted score for something like that? Should I go discrete or continuously? In any case, which classification system will work best for this?
Or do you recommend an alternative method?
Thanks!
EDIT:
To answer Vinko’s question about the intended use of classification -
These words are indicated by a phrase (for example, the name of a book) - and the goal is to figure out a strategy for creating a search query string for a phrase, searching for text in the text. The query string can support several parameters, such as proximity, etc. - therefore, if the word is common, these parameters can be changed.
To answer Igor’s question -
(1) how big is your case? Currently, the list is limited to 10 thousand tokens, but this is just a training set. It can rise to a few 100 thousand as soon as I start testing it on a test suite.
2) do you have any expected share of common / rare words in the corpus? Hmm, I do not.