Effective Mass String Search Problem

Problem: A large static list of strings is provided. A pattern string consisting of data and wildcard elements (* and?). The idea is to return all the strings matching the pattern - quite simple.

Current solution: I am currently using a linear approach to scan a large list and smooth out each entry by template.

My question is: Are there any suitable data structures in which I can store a large list so that the search complexity is less than O (n)?

Perhaps something similar to the suffix-trie? I also examined the use of bi-and three-grams in a hash table, but the logic needed for conformity assessment, based on merging the list of returned words and the template, is a nightmare, in addition, I'm not sure if it is the right approach.

+5
source share
6 answers

You can create a regular trie and add border frames. then your complexity will be O (n), where n is the length of the pattern. You first need to replace the runs **with *in in the template (also operation O (n)).

If the list of words was , I’m a bull , then a trie would look something like this:

  (I ($ [I])
   a (m ($ [am])
      n ($ [an])
      ? ($ [am an])
      * ($ [am an]))
   o (x ($ [ox])
      ? ($ [ox])
      * ($ [ox]))
   ? ($ [I]
      m ($ [am])
      n ($ [an])
      x ($ [ox])
      ? ($ [am an ox])
      * ($ [I am an ox]
         m ($ [am]) ...)
   * ($ [I am an ox]
      I ...
    ...

python:

import sys

def addWord(root, word):
    add(root, word, word, '')

def add(root, word, tail, prev):
    if tail == '':
        addLeaf(root, word)
    else:
        head = tail[0]
        tail2 = tail[1:]
        add(addEdge(root, head), word, tail2, head)
        add(addEdge(root, '?'), word, tail2, head)
    if prev != '*':
        for l in range(len(tail)+1):
           add(addEdge(root, '*'), word, tail[l:], '*')

def addEdge(root, char):
    if not root.has_key(char):
        root[char] = {}
    return root[char]

def addLeaf(root, word):
    if not root.has_key('$'):
        root['$'] = []
    leaf = root['$']
    if word not in leaf:
        leaf.append(word)

def findWord(root, pattern):
    prev = ''
    for p in pattern:
        if p == '*' and prev == '*':
            continue
        prev = p
        if not root.has_key(p):
            return []
        root = root[p]
    if not root.has_key('$'):
        return []
    return root['$']

def run():
    print("Enter words, one per line terminate with a . on a line")
    root = {}
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        addWord(root, line)
    print(repr(root))
    print("Now enter search patterns. Do not use multiple sequential '*'s")
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        print(findWord(root, line))

run()
+1

, trie - , , , , . Theyre . , .

, parallelism. .

+1

, , , , , ['hello', 'world'], store :

[('d'    , 'world'),
 ('ello' , 'hello'),
 ('hello', 'hello'),
 ('ld'   , 'world'),
 ('llo'  , 'hello'),
 ('lo'   , 'hello'),
 ('o'    , 'hello'),
 ('orld' , 'world'),
 ('rld'  , 'world'),
 ('world', 'world')]

, .

, *or*, ('orld' , 'world'), or, .

, , h*o, h o .

0

, . - ? blah* , bl?h ( , ) ?

O (1), , .

0

, . b b abba*, . , , , ; , , . , n- .

, , , ( n-) . .

, , . . , , , .

0

. Google " ", . "" ( , : Wu-Manber, : Horspool). Aho-Corasick - , () , , . , Snort, , " ". , Aho-Corasick, ACISM - Aho-Corasick.

0
source

All Articles