Classification of English words in rare and common

I am trying to develop a method that can classify a given number of English words into 2 sets - "rare" and "general" - a link to how much they are used in the language.

The number of words that I would like to classify is limited - currently about 10,000 and includes everything from articles to personal nouns that can be borrowed from other languages ​​(and therefore are classified as "rare"). I did a frequency analysis from the inside of the case, and I have a distribution of these words (ranging from 1 use to peaks around 100).

My intuition for such a system was to use word lists (for example, the word BNC frequency case, wordnet, the frequency of the internal case) and assign weight to its appearance in one of them.

For example, a word that has a mid-level frequency in the case (say 50), but appears in the list of words W -, can be considered common, since it is one of the most frequent in the whole language. My question is: what's the best way to create a weighted score for something like that? Should I go discrete or continuously? In any case, which classification system will work best for this?

Or do you recommend an alternative method?

Thanks!


EDIT:

To answer Vinko’s question about the intended use of classification -

These words are indicated by a phrase (for example, the name of a book) - and the goal is to figure out a strategy for creating a search query string for a phrase, searching for text in the text. The query string can support several parameters, such as proximity, etc. - therefore, if the word is common, these parameters can be changed.

To answer Igor’s question -

(1) how big is your case? Currently, the list is limited to 10 thousand tokens, but this is just a training set. It can rise to a few 100 thousand as soon as I start testing it on a test suite.

2) do you have any expected share of common / rare words in the corpus? Hmm, I do not.

+4
source share
3 answers

Assuming you have a way to evaluate your classification, you can use a “boosting” approach to computer learning. Classifier acceleration uses a set of weak classifiers in combination with a strong classifier.

Let's say you have your own table of words, etc. that you can use. Select N frequency thresholds. For example, you may have 10 thresholds: 0.1%, 0.2%, ..., 1.0%. For your corpus and each of the external word lists, create N "experts", one expert per threshold in the word list / corpus, the total number of experts is N * (K + 1). Each expert is a weak classifier with a very simple rule: if the frequency of a word is above its threshold, they consider this word to be "common." Each expert has a weight.

The training process is as follows: assign weight 1 to each specialist. Make experts vote for every word in your corps. Summarize your votes: 1 * weight (i) for "ordinary" votes and (-1) * weight (i) for "rare" votes. If the result is positive, mark this word as general.

Now the general idea is to evaluate the classification and increase the weight of experts who were right and reduce the weight of experts who were wrong. Then repeat the process over and over until your grade is good enough.

The specifics of weight adjustment depend on how you rate the classification. For example, if you do not have a word rating, you can rate the classification as “too many common” or “too many rare” words. In the first case, promote all the "rare" experts and omit all the "general" experts or vice versa.

+2
source

Your distribution is most likely the Pareto distribution (a superset of the Zipf law, as mentioned above). I am shocked that the most common word is used only 100 times - does this include the words "a" and "a" and such words? You should have a small case if this is the same.

In any case, you will have to choose a cutoff for "rare" and "ordinary". One possible choice is the average expected number of performances (see the related wiki article above to calculate the average). Due to the "thick tail" of the distribution, a fairly small number of words will look above average - this is "general." The rest are "rare." This will cause many other words to be rare than usual. Not sure if this is what you are going to do, but you can simply move the clipping up and down to get the desired distribution (say, all words s> 50% of the expected value are “common”).

+1
source

Although this is not the answer to your question, you should know that you are reinventing the wheel here. Information retrieval experts have developed ways to weight search words according to their frequency. A very popular weight is the TF-IDF , which uses the word frequency in the document and its frequency in the body. TF-IDF is also explained here .

An alternative score is Okapi BM25 , which uses similar factors.

See also Lucene affinity documentation for how TF-IDF is implemented in the popular search library.

0
source

All Articles