Python: Search Engine Keyword Clustering

Python: Keyword Clustering in a Search Engine

Hi, I have a CSV, up to 20,000 lines (I had 100,000+ for different sites), each line contains the referring keyword (i.e. the keyword that someone typed in the search engine to find the corresponding web site) and several visits.

What I want to do is cluster these keywords into clusters of "similar value" and create a hierarchy of clusters (structured in the order of summing the total number of queries per cluster).

An example cluster - “women's clothing” - would ideally contain keywords on these lines: women's clothing, 1000 women's clothing, 300 women's clothing, 50 women's clothing, 6 women's clothing, 2

I could use something like the Python Natural Language Toolkit: http://www.nltk.org/ and WordNet, but I assume that for some websites the referring keywords will be words / phrases about which WordNet is nothing does not know. For example, if the website is a celebrity website, WordNet is unlikely to know anything about "Lady Gaga," which is worse if the site is a news site.

So, I also assume that the solution should be one that tries to use only the source data.

My query is very similar to the one that was raised in How to copy keywords in a search engine? I'm just looking to start somewhere, but using Python instead of Java.

I also wondered if Google Predict and / or Google Refine could be useful.

, / ,

,

+5
2

, . , - nltk wordnet ( )

( )

/

POS, ( tagger ) , , wordnet . , lavenshtein, B/K ..

/.

, Python (, PyML, Reverend ..) . , Google ngram

0

All Articles