Automatically select tags from context using Python

How to select tags from an article or user post using Python?

Is the following method OK?

  • Create a list of words from the text and sort them.

  • Remove some common words and select the top 10 words in the list as tags.

If the above method is ok, then which library can determine if these words are common, for example, "the, if, you, etc" and which are descriptive words?

+4
source share
5 answers

Here's an article about deleting stop words . The link to the list of stop words in the article is broken, but here is another .

+4
source

The Natural Language Toolkit offers a wide selection of methods for this kind of thing. I can’t give you practical advice, since I am not familiar with this topic, but I think it’s worth trying to read a few articles about this topic first before you start: just choosing words from the text directly will not make you very far. I think you should probably try to find similar words for those that already exist. And, of course, you need to filter out common language words such as "and" and so on. Again, this Python library can help you with this, at least for several common languages.

+3
source

I suggest you upload a data dump . There you will receive many messages in the real world with appropriate tags for testing various tag selection algorithms.

But as a rule, I doubt that it will work too well. For your own question, “words” is the clear winner in word counting, followed by a list of words with two seeming characters, such as “general”, “list”, “method”, “choice” and “tags”. Which one would you choose automatically as tags? The tags that you manually selected also contain "python" and "context", none of which are displayed with a high frequency of words.

+2
source

Immerse Bayes or the Fischer filter with data already marked (for example, with the Stackoverflow data dump suggested by sth) and use it to classify new messages. I would recommend reading Toby Segaran’s excellent Collective Programming Intelligence book for more information and python examples on this topic.

+1
source

Instead of blacklisting words that shouldn't be tags, why don't you instead create a whitelist of words that would make for good tags?

Start with a few tags that you would like to have, for example, Python , off-topic , football , rickroll or whatnot (depending on the type of site you are building!) And have a system just suggest between them, then let users pick the appropriate tags, and also let them enter their own tags.

When a sufficient number of users offers a tag, it falls into the pool of "known good" tags for automatic suggestions - maybe after some moderation, so you can mark silly tags like the , lolol , or sealed tags, for example objectoriented , if you have object-oriented .

Show only a few offers. Suggest autofill. Limit the number of tags per element. If this is related to coding, maybe some kind of language definition system (the file linux command is not too broken) will help your sentence system.

0
source

All Articles