C # Code / Algorithm for finding text for conditions

We have 5 mb of typical text (simple words). We have 1000 words / phrases to use as search terms in this text.

What is the most efficient way to do this in .NET (ideally C #)?

Our ideas include regex (one, many of them) plus even String.Contains stuff.

The input is a text string ranging in size from 2 to 5 mb - all text. A few hits are good, as in every term (out of 1000), which corresponds to what we want to know about it. Performance all the time to perform, do not care about the track. The current algorithm gives about 60 seconds + using naive string.contains. We do not want the “cat” to coincide with the “category” or even “cats” (that is, the whole word of the word should be stressed, did not occur).

We expect the text to achieve a <5% stroke ratio. The results would ideally be simply terms that matched (as long as the position or frequency is not needed). We get a new line of 2-5mb every 10 seconds, so we can’t assume that we can index the input. 1000 terms are dynamic, although they have a rate of change of about 1 change per hour.

+3
source share
9 answers

Naive line. Consists of 762 words (final page) "War and Peace" (3.3 MB) for me in about 10 seconds. Switching to 1000 GUIDs takes about 5.5 seconds.

Regex.IsMatch 762 ( , , ) 0,5 , GUID 2,5 .

, ... .

+3
+3

:

  • ? , "", . -- string.contains "concatinate". true ( ). ?

  • , . "", " ". , "", ? , , ( → ), " ". , (, Stemmer Algroithm)

  • , (, Natrual Langauge Processing), , , "a, have, you, I, me, some, to" < , , - ?

" # ", 10 000 , 10 000 x .

" # today" < - , , .t.

.

+2

- , , , . , , O (1).

0

, ? , , , , .

, , ... .

, , - , . 1000 , 1000 .

, Regex KMP, , , , ( ), , .

: http://webglimpse.net/pubs/TR94-17.pdf

0

: :

class Word
{
    string Word;
    List<int> Positions;
}

. Positions ( , ) , .

, . , , - . , , , SortedDictionary, List<Word>.

. ( O (log (n))). , , ( Positions). ( , O (1)) , .

0

? ? 5 MiB . , . O (n + m), n - , m - . , .

, , Wu Manber. ++- .

0

, (psuedocode):

foreach (var term in allTerms)
{
    string pattern = term.ToWord(); // Use /b word boundary regex
    Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
    if (regex.IsMatch(bigTextToSearchForTerms))
    {                    
        result.Add(term);                                        
    }
}

( !) , 1000 , 1000 , "/b term1/b |/b term2/b |/b termN/b" regex.Matches.Count

0

? LINQ, , ...

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));

FIND, LINQ:

static bool Match(string checkItem)
{
  return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}

static void Main(string[] args)
{
  List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
  List<String> matches = allTerms.Find(Match);
}

, , LINQ, , :

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));

I have not tested any of them for performance, but they all implement your idea of ​​iterating through a search list using a regular expression. These are just different ways to implement it.

0
source

All Articles