What algorithm can I use to search for common related words / pattern recognition?

I have a large table in my database with a lot of words from different texts in textual order. I want to find the number of times / frequency with which multiple sets of words appear.

An example . Suppose I have these 4 words in many texts: United | States | of | America United | States | of | America United | States | of | America I will get the result:

USA : 50
United States : 45
United States : 40

(This is just an example with 4 words, but may be less and more than 4).

Is there any algorithm that can do this or something like this?

Edit: Some R or SQL code showing how to do this is welcome. I need a practical example of what I need to do.

Table structure

I have two tables: Token , which has id and text . The text is UNIQUE , and each entry in this table represents a different word.

TextBlockHasToken is a table in which the order of the text is stored. Each line represents a word in the text.

It saves textblockid , which is the block of text to which the token belongs. sentence , that is, the offer of the token, position , that is, the position of the token inside the proposal and tokenid , which is a reference to the token table.

+7
source share
4 answers

It is called N-gram; in your case - 4 grams. It can really be obtained as a by-product of the Markov chain, but you can also use a sliding window (size 4) to traverse the linear text when updating the 4-dimensional “histogram”.

UPDATE 2011-11-22: The stamp chain is a way of modeling the probability of a transition to a new state based on the current state. This is the stochastic equivalent of a “state machine”. In the case of natural language, the "state" is formed by the "previous N words", which implies that you believe that the previous probability (up to the previous N words) is equal_to_one. Computer people are likely to use the tree to implement Markov chains in the NLP case. The “state” is just the path taken from the root to the current node, and the probabilities of words in sequence are the probabilities of the current descendant of node. But each time we select a new child element of a node, we actually go down the tree and “forget” the root node, and the window is only N words wide, which translates N levels deep into the tree.

You can easily see that if you go along the Markov chain / tree, then at any time the probability before the first word is 1, the probability after the first word is P (w1), after the second word = P (w2) || w1 etc. So, when processing the case, you build a Markov tree (: = update the frequencies in the nodes), at the end of the trip you can assess the probability of a given choice of a word by frequency (word) / SUM (FREQ (brothers)). For a word deeper than 5 in the tree, this is the probability of the word, given the previous 4 words . If you want the probability of N-grams, you want the product of all the probabilities of the path from the root to the last word.

+14
source

This is a typical use case for Markov chains. Evaluate the Markov model from your text base and find high probabilities in the conversion table. Because they indicate the probabilities that one word will follow another, phrases will appear as high transition probabilities.

By counting the number of times a word at the beginning of a phrase appears in the texts, you can also get absolute numbers.

+4
source

Here is a small snippet that computes all text combinations / ngrams for a given set of words. To work with large data sets, it uses a hash library, although it is probably still quite slow ...

 require(hash) get.ngrams <- function(text, target.words) { text <- tolower(text) split.text <- strsplit(text, "\\W+")[[1]] ngrams <- hash() current.ngram <- "" for(i in seq_along(split.text)) { word <- split.text[i] word_i <- i while(word %in% target.words) { if(current.ngram == "") { current.ngram <- word } else { current.ngram <- paste(current.ngram, word) } if(has.key(current.ngram, ngrams)) { ngrams[[current.ngram]] <- ngrams[[current.ngram]] + 1 } else{ ngrams[[current.ngram]] <- 1 } word_i <- word_i + 1 word <- split.text[word_i] } current.ngram <- "" } ngrams } 

So the next input ...

 some.text <- "He states that he loves the United States of America, and I agree it is nice in the United States." some.target.words <- c("united", "states", "of", "america") usa.ngrams <- get.ngrams(some.text, some.target.words) 

... will result in the following hash:

 >usa.ngrams <hash> containing 10 key-value pair(s). america : 1 of : 1 of america : 1 states : 3 states of : 1 states of america : 1 united : 2 united states : 2 united states of : 1 united states of america : 1 

Note that this function is case insensitive and registers any permutation of target words, for example:

 some.text <- "States of united America are states" some.target.words <- c("united", "states", "of", "america") usa.ngrams <- get.ngrams(some.text, some.target.words) 

... leads to:

 >usa.ngrams <hash> containing 10 key-value pair(s). america : 1 of : 1 of united : 1 of united america : 1 states : 2 states of : 1 states of united : 1 states of united america : 1 united : 1 united america : 1 
+2
source

I'm not sure if it helps you, but here is a small python program that I wrote about a year ago that counts N-grams (well, only mono-, bi- and trigrams). (It also calculates the entropy of each N-gram). I used it to count these N-grams in a large text. Link

+1
source

All Articles