Search multiple lines

Question

Search multiple lines

I know effective ways to find one line in a file (kmp) or different lines in a file (trie)

But, for many years, I was wondering if there is a way (and, by all accounts, this is not possible) to search for multiple files for multiple lines

Let's say I have a million files, and I want to answer queries like “find files with the strings“ banana ”,“ motor boat ”and“ white fox. ”What would be an efficient algorithm? Is there one?

Of course, you can perform such a search in linear time by the size of the files to search. But this seems very unjustified for a large number of large files. The existence of google seems to indicate that there is actually a very fast algorithm for this. Perhaps even one such that each request depends on the size of the request, and not on the text size database (of course, such an algorithm will include some preliminary processing of the input files)

I think there should be one such algorithm (google does it!), But my searches did not find anything.

+5

algorithm complexity-theory search full-text-search

josinalvo Feb 11 '14 at 16:37

source share

5 answers

, . . .

0

Jan Nielsen 23 . '14 4:32

GhostGambler · Answer 1 · 2014-02-23T12:13:04+0000

Concurrent programming

: , , . , Google. . ( Google.) .

"MapReduce"

Google , , MapReduce, . ( ) . .

:

. . , node. .
: .

( , " grep", .)

, , , " ", ., , Rabin-Karp -- ( -). , .

. , , Google File System (GFS), . .

, .

.

MapReduce: , , . MapReduce, parallelism , .

.

(, , MapReduce, ).
, , ( ) .
, .

, , , , -, , , , .

blubb · Answer 2 · 2014-02-22T15:00:15+0000

( . . , .)

Model

...

f ,
w ,
d (d - , ),
q
r .

, q < < d < f < w (.. " ", ), q , O(1). , , O(f) O(w), .

, , O(r), , .

, - , :

index = {}
for file in files:
  for word in file:
    index[word] += file

O(w), ( ). , query, :

wordWithLeastFilesMatching = min(query, key=lambda word: len(index[word]))
result = set(index[wordWithLeastFilesMatching])
for word in query:
  result = result.intersection(index[word])
return result

q, . , , O(log(f)) , . O(log(f)).

, , O(f) , (, , r) . - O(f).

Mark Setchell · Answer 3 · 2014-02-11T17:38:59+0000

, , , - .

, , 1000 000 , 250 000 , , 4 , .

- , , ".txt":

#!/bin/bash
find . -name "*.txt" | while IFS= read a
do
  grep -l banana "$a" | while IFS= read b
  do
    grep -l motorboat "$b" | while IFS= read c
    do
      grep -l "the white fox" "$c"
    done
  done
done

.

, awk 3 , , .

, , , . , " ", , , , . , , , .

, ".txt" , . , , , ( ) script :

#!/usr/bin/perl
use strict;
use warnings;

my %words;

# Load all files ending in ".txt"
my @files=<*.txt>;
foreach my $file (@files){
   print "Loading: $file\n";
   open my $fh, '<', $file or die "Could not open $file";
   while (my $line = <$fh>) {
     chomp $line;
     foreach my $str (split /\s+/, $line) {
        $words{$str}{$file}=1;
     }
   }
   close($fh);
}

foreach my $str1 (keys %words) {
  print "Word: \"$str1\" is in : ";
  foreach my $str2 (keys $words{$str1}) {
    print "$str2 ";
  }
  print "\n";
}

, , :

./go
Loading: a.txt
Loading: b.txt
Loading: c.txt
Loading: d.txt
Word: "the" is in : c.txt d.txt 
Word: "motorboat" is in : b.txt d.txt 
Word: "white" is in : c.txt d.txt 
Word: "banana" is in : c.txt d.txt a.txt 
Word: "fox" is in : c.txt d.txt

korhner · Answer 4 · 2014-02-12T21:08:53+0000

, , trie, trie ?

, , , , . , google -, - .

Search multiple lines

Concurrent programming

"MapReduce"

Model

More articles: