I am trying to develop a Python script to examine each sentence in a second call to Barack Obama and find similar sentences in past inaugurations. I have developed a very crude fuzzy match, and I hope to improve it.
I start by reducing all inaugurations to offer lists without pauses. Then I create a frequency index.
Then I compare each sentence in Obama's 2013 address to each sentence of every other address and evaluate this similarity:
#compare two lemmatized sentences. Assumes stop words already removed. frequencies is dict of frequencies across all inaugural def compare(sentA, sentB, frequencies): intersect = [x for x in sentA if x in sentB] N = [frequencies[x] for x in intersect]
Finally, I filter out the results based on arbitrary truncations for n and c.
It works better than one might think by defining sentences that share unusual words in a small proportion to common words.
For example, he took these matches:
Obama, 2013: For history tells us that although these truths may be obvious, they have never been self-fulfilling; that, although freedom is a gift from God, it must be provided by His people here on Earth.
Kennedy, 1961: With a clear conscience, our only true award, with history, the last judge of our affairs, will go forward to head the land that we love, asking for His blessing and His help, but knowing that here on earth the work of God must really be ours.
Obama, 2013 Through blood drawn with eyelashes and blood drawn with a sword, we learned that not a single union based on the principles of freedom and equality can survive half slave and half free.
Lincoln, 1861 However, if God wants him to continue until all the riches made up by the slaves have reached two hundred and fifty years of irresponsible labor, must be drowned, and until every drop of blood drawn by the whip, it will not be paid by another, drawn by a sword, as it was said three thousand years ago, so what else needs to be said: "the judgments of the Lord are true and righteous in general.
Obama, 2013 This generation of Americans has been tested by crises that have strengthened our resolve and proven our resilience.
Kennedy, 1961 Since this country was founded, each generation of Americans was called to give evidence of their national loyalty.
But this is very rude.
I don't have cuts for a major machine learning project, but I want to apply more theory, if possible. I understand that I was looking for bigram, but I'm not sure if this will work here - it's not so much the exact bigrams that interest us, but the general proximity of the two words that are separated between quotation marks. Is there a comparison of fuzzy sentences that looks at the probability and distribution of words without being too harsh? The nature of the allusion is that it is very approximate.
Current effort available on Cloud9IDE
UPDATE, 1/24/13 According to the accepted answer here is a simple Python function for bigram windows:
def bigrams(tokens, blur=1): grams = [] for c in range(len(tokens) - 1): for i in range(c + 1, min(c + blur + 1, len(tokens))): grams.append((tokens[c], tokens[i])) return grams