C # Code / Algorithm for finding text for conditions

Question

C # Code / Algorithm for finding text for conditions

We have 5 mb of typical text (simple words). We have 1000 words / phrases to use as search terms in this text.

What is the most efficient way to do this in .NET (ideally C #)?

Our ideas include regex (one, many of them) plus even String.Contains stuff.

The input is a text string ranging in size from 2 to 5 mb - all text. A few hits are good, as in every term (out of 1000), which corresponds to what we want to know about it. Performance all the time to perform, do not care about the track. The current algorithm gives about 60 seconds + using naive string.contains. We do not want the “cat” to coincide with the “category” or even “cats” (that is, the whole word of the word should be stressed, did not occur).

We expect the text to achieve a <5% stroke ratio. The results would ideally be simply terms that matched (as long as the position or frequency is not needed). We get a new line of 2-5mb every 10 seconds, so we can’t assume that we can index the input. 1000 terms are dynamic, although they have a rate of change of about 1 change per hour.

+3

c # algorithm .net search

user47892 Dec 19 '08 at 21:28

source share

9 answers

? - Lucene.NET?

+3

Kent Boogaart 20 . '08 10:05

:

? , "", . -- string.contains "concatinate". true ( ). ?
, . "", " ". , "", ? , , ( → ), " ". , (, Stemmer Algroithm)
, (, Natrual Langauge Processing), , , "a, have, you, I, me, some, to" < , , - ?

" # ", 10 000 , 10 000 x .

" # today" < - , , .t.

.

+2

dbones 19 . '08 21:57

- , , , . , , O (1).

0

Vilx- 19 . '08 21:46

, ? , , , , .

, , ... .

, , - , . 1000 , 1000 .

, Regex KMP, , , , ( ), , .

: http://webglimpse.net/pubs/TR94-17.pdf

0

BenAlabaster 19 . '08 21:53

: :

class Word
{
    string Word;
    List<int> Positions;
}

. Positions ( , ) , .

, . , , - . , , , SortedDictionary, List<Word>.

. ( O (log (n))). , , ( Positions). ( , O (1)) , .

0

Vilx- 19 . '08 21:55

? ? 5 MiB . , . O (n + m), n - , m - . , .

, , Wu Manber. ++- .

0

Konrad Rudolph 19 . '08 22:11

, (psuedocode):

foreach (var term in allTerms)
{
    string pattern = term.ToWord(); // Use /b word boundary regex
    Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
    if (regex.IsMatch(bigTextToSearchForTerms))
    {                    
        result.Add(term);                                        
    }
}

( !) , 1000 , 1000 , "/b term1/b |/b term2/b |/b termN/b" regex.Matches.Count

0

user47892 19 . '08 22:41

? LINQ, , ...

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));

FIND, LINQ:

static bool Match(string checkItem)
{
  return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}

static void Main(string[] args)
{
  List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
  List<String> matches = allTerms.Find(Match);
}

, , LINQ, , :

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));

I have not tested any of them for performance, but they all implement your idea of iterating through a search list using a regular expression. These are just different ways to implement it.

0

Benalabaster Dec 19 '08 at 23:39

source share

Mark Brackett · Accepted Answer · 2008-12-19T22:18:34+0000

Naive line. Consists of 762 words (final page) "War and Peace" (3.3 MB) for me in about 10 seconds. Switching to 1000 GUIDs takes about 5.5 seconds.

Regex.IsMatch 762 ( , , ) 0,5 , GUID 2,5 .

, ... .

C # Code / Algorithm for finding text for conditions

More articles: