Match strings containing permutations

Question

Match strings containing permutations

Say you have a large table containing a varchar column.

How would you match strings containing the word "preferred" in varchar col, but the data is somewhat noisy and contains random spelling errors, for example:

['$2.10 Cumulative Convertible Preffered Stock, $25 par value', '5.95% Preferres Stock', 'Class A Preffered', 'Series A Peferred Shares', 'Series A Perferred Shares', 'Series A Prefered Stock', 'Series A Preffered Stock', 'Perfered', 'Preffered C']

The permutations of the word “preferable” in the spelling errors above seem to show a family resemblance , but very little that they all have in common. Note that splitting each word and doing levenshtein on each word on each line will be overly expensive.

UPDATE:

There are several more similar examples, for example. with "limited":

 ['Resticted Stock Plan', 'resticted securities', 'Ristricted Common Stock', 'Common stock (restrticted, subject to vesting)', 'Common Stock (Retricted)', 'Restircted Stock Award', 'Restriced Common Stock',]

+4

nlp information-retrieval

ʞɔıu Apr 08 '09 at 10:12

source share

4 answers

tpdi · Answer 1 · 2009-04-08T22:31:53+0000

Create two more tables, spelling and possible wording:

- you can define types

 create table spelling ( id, word ) ; create table possible_spelling ( id, spelling_id references spelling(id), spelling ) -- possible spelling also includes the correct spelling -- all values are lowercase insert into spelling( word ) values ('preferred'); insert into possible_spelling( spelling_id, spelling ) select 1, '%preferred%' union select 1, '%prefered%' union ....; select * from bigtable a join possible_spelling b on (lower(a.data) like b.spelling ) join spelling c on (b.spelling_id = c.id) where c.word = 'preferred';

Objection: this will be slow and requires installation. Answer: not so slow, and it should be one time to classify and correct your data. Once for customization, once for each row to classify.

Simon nickerson · Answer 2 · 2009-04-08T22:18:23+0000

Could you try to study it on a small sample of your table to find possible spelling errors (using split + Levenshtein), and then use the resulting list of words in the full table?

bytebender · Answer 3 · 2009-04-08T22:22:54+0000

Trying to do it in TSQL or in what language?

You can hit most of them with regex.

some change next

 "p(er|re|e)f{1,2}er{1,2}ed" "r(e|i)s?t(ri|ir|rti|i)ct?ed"

you want to make sure that this cap is not sensitive ...

unmounted · Answer 4 · 2009-04-10T07:59:34+0000

I could do something like this: if you can leave with Levenshtein once - here is a wonderful spelling implementation from Peter Norvig

 import re, collections def words(text): return re.findall('[az]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): s = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in s if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] inserts = [a + c + b for a, b in s for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)

He creates a training set here: http://norvig.com/big.txt Here is an example output:

 >>> correct('prefferred') 'preferred' >>> correct('ristricted') 'restricted' >>> correct('ristrickted') 'restricted'

In your case, you can copy the original column to a new one, but skip it through the spellcheck when you do. Then place the fulltext pointer in the column with the correct entry and match your queries with it, but return the results from the original column. You only need to do this once, as opposed to calculating distances each time. You can also check the input or check the corrected version only as a backup. In any case, it is worth exploring the example of Norwig.

Match strings containing permutations

More articles: