Is there an efficient algorithm for performing inverted full-text search?

I have a limited list of thousands of keywords (one or more words in each keyword) in the database. I want to effectively find which of these keywords are listed in the input text, without having to check each of these keywords one by one (full table scan). Allowing the matching of some words with errors in the text would be even better, but not essential. Any suggestions on an algorithm / article to solve this problem?

+5
source share
6 answers

I think some of the answers still do not understand the given problem. I understand that you have a (rather large) list of words and (rather large) text. You want to know what words are on both lists, is this correct?

If so, this is not at all a problem with the full text. Basically, you only have two word lists (source keywords and a list of words from the input text). If you sort both lists, you can view both lists at the same time and extract common words.

Assuming the keyword list has already been sorted, you can extract and sort words from the text text for O (n logn) time, and then simultaneously view both lists: O (n + m) (where n is the number of words in the text and m - the number of words in the list of keywords).

+4
source

, , , Oracle, Oracle Text ( ). , CONTAINS Oracle Text, , , " ". CONTAINS , Levenshtein . Oracle Text FUZZY.

Oracle .

, , - . , :

, / , ( )

: , , , . Oracle Text CONTAINS, MATCHES (. 4.1.3), : , , . , ( ):

create table queries (
  query_id      number,
  query_string  varchar2(80)
);

// Here we populate the table with the keywords
insert into queries values (1, 'oracle');
insert into queries values (2, 'larry or ellison');
insert into queries values (3, 'oracle and text');
insert into queries values (4, 'market share');

create index queryx on queries(query_string)
  indextype is ctxsys.ctxrule;

// This query will return the ids of the matched keywords
select query_id from queries
 where matches(query_string, 
               'Oracle announced that its market share in databases 
                increased over the last year.')>0

, , .

Edit2: , , .

+1

, Apache Lucene Apache Solr ( HTTP, Lucene) .

Apache Lucene API , .

, :

  • Lucene: , . , Lucene . , , Lucene .
  • Solr: . , Solr - . , Lucene, . HTTP XML , .

, Solr ""

/:

  • Solr: Wiki - , .
  • Lucene: StackOverflow ,
+1

, ( )

, :

  • ( , ..)
  • ( , , )

.

keyphraseID: 1 : " "

wordID: 1 Word:

wordID: 2 :

mapID: 1 keyphraseID: 1 wordID: 1 : 1

mapID: 2 keyphraseID: 1 wordID: 2 : 2

, - .

/ , , , - , , , .

Edit: -

, , , . "fix". .

+1

- Automaton.

( ) , .

, , , .

So, from a purely theoretical point of view:

  • Create an automaton from a set of keywords
  • Scan text due to Automaton

The last part is O (n), the first may be a little more complicated.

0
source

Internet search of the Aho-Korasik algorithm.

0
source

All Articles