Can words be automatically stopped?

In NLP, removing stop words is a typical preprocessing step. And this is usually done in an empirical way based on what we consider stop words.

But, in my opinion, we must generalize the concept of stop words. And stop words may differ for cases from different domains. I wonder if we can mathematically determine stop words, for example, by its statistical characteristics. And then we can automatically extract stop words from cases for a specific domain.

Is there any similar thought and progress on this issue? Can anyone shed some light?

+8
machine-learning nlp data-mining text-mining
source share
5 answers

Usually stop words are much more common than other semantic words ... Therefore, building my application, I used a combination of both; fixed list and statistical method. I used NLTK and it already had a list of some common stop words; so I first deleted the words that appear on this list, but of course it did not delete all the stop words ... As you already mentioned, stop words are different from cases to cases. Then I evaluated the frequency of each word appearing in the corps, and deleted the words that have a frequency above the โ€œdefined limitโ€. This particular limit that I mention was the value that I corrected after observing the frequency of all words ... hence, this limit also depends on the cases to cases ... but you can easily calculate this as soon as you carefully observe the list all words in order of frequency ... This statistical method ensures that you delete stop words that do not appear in the list of regular stop words ... After that, I also used the POS marking to refine the data and deleted my own nouns, who are still soo exist after the first two steps.

+3
source share

I am not an expert, but I hope that my answer makes sense.

Statistically extracting stop words from the body sounds interesting! I would like to consider calculating the frequency of the reverse document, as indicated in other answers, besides using regular stop words from the general list of stop words, for example, in NLTK. Stop words not only vary from cases to cases, but can also vary from problem to problem. For example, in one of the problems I was working on, I used a corpus of news articles where you find a lot of time-sensitive and location-sensitive words. This is important information, and statistically deleting words such as โ€œtoday,โ€ โ€œhere,โ€ etc., would affect my results. Because news articles tell not only about one specific event, but also about similar events that occurred in the past or elsewhere.

In conclusion, I want to say that you will need to consider the problem, and not just the case.

Thanks Ramya

+3
source share

Stop words are ubiquitous. They will appear in every (or almost every) document. A good way to mathematically determine stop words for cases from different domains is to calculate the frequency of the reverse document (IDF) of the word.

IDF is the best way to calculate frequency to determine stop words, because simple frequency calculations are adversely affected by several specialized documents containing a special word many times. This method was used to automatically learn stop words in foreign languages โ€‹โ€‹(link machine learning using SVM and other kernel methods ).

+1
source share

In fact, the general approach to building stop words is simply to use the most common (in documents, i.e. DF) words. Create a list of 100, 200, 1000 words and view them. Just scroll through the list until you find a word that you think should not be temporary. Then consider either skipping it or breaking the list at this point.

In many datasets, you will have domain time limits. For example, if you use StackOverflow, "java" and "C #" may very well be temporary (and this will not actually hurt, in particular if you are still using tags). In other words, to stop a domain, there may be a "code", "implementation", "program".

+1
source share

Yes, stop words can be detected automatically.

Word frequencies in general

One way is to look at the frequency of words in general.

Calculate the frequency of all words in combined texts. Sort them in descending order and remove the top 20% or so.

You can also remove the bottom 5%. These are not stop words, but for a lot of machine learning, they are irrelevant. Perhaps even typos.

Words on the "document"

Another way is to parse the words into a "document".

In a set of documents, stop words can be detected by searching for words that exist in a large number of documents. They would be useless for categorizing or clustering documents in this particular set.

eg. a machine learning system that classifies scientific articles can, after analysis, mark the word "abstract" as a stop word, although it can exist only once per document. But, in all likelihood, almost all of them.

The same can be said of words that can only be found in a very limited number of documents. They are probably mistakenly written or so unique that they will never be seen again.

However, in this case, it is important that the distribution between groups of documents in the training set is even, or that the set divided into one large and one small group can lose all its significant words (since they can exist in too many documents or too few).

Another way to avoid problems with unevenly distributed groups in the training set is to only delete words that exist in all or almost all documents. (For example, our favorite stop words like "a", "it", "the", "an", etc. Will exist in all English texts).

Zipf law

When I studied machine learning and discussed stop words, Zipf's Law was mentioned. However, today I could not tell you how and why, but perhaps this is a general principle or mathematical foundation that you would like to study ...

I googled "Automatic Zipf Law Stop Word Detection", and a quick pick found me two PDF files that might be of interest ...

+1
source share

All Articles