Algorithm to see if keywords exist inside a string

Let's say I have a set of keywords in the array {"olympic games", "best sports tennis", "tennis", "tennis rules")

Then I have a large list (up to 50 pieces) of lines (or actually tweets), so they are no more than 140 characters.

I want to look at each line and see what keywords are there. In the case where a keyword consists of several words, such as "best sports tennis", the words do not have to be together in a line, but they all should appear.

I am having trouble finding an algorithm that does this efficiently.

Do you have any suggestions for this? Thank!

Edit: To explain a little better, each keyword has an identifier associated with it, so {1: "olympics", 2: "best sports tennis", 3: "tennis", 4: "tennis rules"}

I want to view a list of lines / tweets and see which group of keywords matches. The result should be, this tweet belongs to keyword # 4. (several matches can be made, so that everything that matches keyword 2 will also match 3 - since they both contain tennis).

When a keyword has multiple words, for example. "best sports tennis," they should not appear together, but should appear. for example, it will correspond correctly: “I just played tennis, I like sports, its best” ... since this line contains “best sports tennis”, it will correspond and be associated with the keyword ID (which for this example is 2 )

Change 2: case insensitive.

+5
source share
6 answers
IEnumerable<string> tweets, keywords;

var x = tweets.Select(t => new
                           {
                               Tweet = t,
                               Keywords = keywords.Where(k => k.Split(' ')
                                                               .All(t.Contains))
                                                  .ToArray()
                           });
+6
source

, Aho-Corasick ( trie) Wu and Manber.

, . 50 , .

+1

, - ?

        string[] keywords = new string[] {"olympics", "sports tennis best", "tennis", "tennis rules"};

        string testString = "I like sports and the olympics and think tennis is best.";

        string[] usedKeywords = keywords.Where(keyword => keyword.Split(' ').All(s => testString.Contains(s))).ToArray();
+1

, (, ) .

- :

Dim matchingStrings As Dictonary(String, String);
For Each stringToSearch As String In tweetList
   For Each keyword As String In keywordList
      If stringToSearch.Contains(keyword)
        matchingString.Add(stringToSearch, keyword);

break;        End IF         End For

MatchingString

EDIT: #

Dictionary<string, string> matchingString = New Dictionary<string, string>; 
foreach (String stringToSearch In tweetList){
   foreach (String keyword In keywordList){
        If(stringToSearch.Contains(keyword){
            matchingString.Add(stringToSearch, keyword);
            break;
}
else if{
    List<string> split = keyword.Split(" ")
   foreach(String sKeyword In split){
          If(stringToSearch.Contains(keyword){
             matchingString.Add(stringToSearch, keyword);
             break;
          }
    }

 }

} }

0

.

  foreach (var s in strings)
  {
      foreach (var keywordList in keywordSet) 
      {
          if (s.ContainsAll(keywordList))
          {
              // hit!
          }
      }
  }

...

private bool ContainsAll(this string s, string keywordList)
{    
    foreach (var singleWord in keywordList.Split(' '))
    {
        if (!s.Contains(singleWord)) return false;
    }
    return true;
}
0

There are ways to pre-process the strings to make the search more efficient, but I believe that the overhead is more than the gain for such short strings. This is not much data, so I just scrolled the lines:

foreach (string tweet in tweets) {
  foreach (string keywords in theArray) {[
    string[] keyword = keywords.Split(' ');
    bool found = true;
    foreach (string word in keyword) {
      if (tweet.indexOf(word) == -1) {
        found = false;
        break;
      }
    }
    if (found) {
      // all words exist in the tweet
    }
  }
}
0
source

All Articles