Matching two large sets of strings in C #

Here is the situation:

I have a webpage that I cleared as a string.

I have several fields in an MSSQL database. For example, a car model, it has an identifier and a name, for example, Mustang or Civic. It is pre-populated with most car models.

I want to find any match for any row in my model table. Therefore, if I have Civic, Mustang and E350 in my model table, I want to find any information about any of the three on the page that I cleared.

What is an effective way to do this in C #. I am using LINQ to SQL to interact with db.

Does the dictionary of all models, page tokenization, and iteration through tokens make sense? Or should I just iterate over the tokens and use the WHERE clause and query the database if there is a match?

//Dictionary dic contains all models from the DB, with the name being the key and the id being the value... foreach(string pageToken in pageTokens) { if(dic.ContainsKey(pageToken)) { //Do what I need to do } } 

Both of these methods seem terrible to me. Any suggestions on what I should do? Something with an established intersection, I would think maybe nice?

None of these methods relate to what happens when the model name has more than one word ... for example, "F150 Extended Cab". Thoughts on this?

+4
source share
4 answers

Finding multiple lines in a larger text is a well-understood problem, and significant research has been done to speed up the work. The two most popular and effective methods for this are the Aho-Corasick algorithm (I would recommend this), and the Rabin-Karp algorithm . They use a little preprocessing, but several orders of magnitude smaller and faster than the naive method (the naive method has the worst case O (m * n ^ 2 * p), where m is the length of the long line [web page where you scraped] , and n is the average length of the needles, and p is the number of needles). Aho-Korsaik is linear. A C # implementation can be found in CodeProject for free.

Edit: Unfortunately, I was mistaken in the complexity of Aho-Corasick - it is linear in the number and length of input lines + the size of the analyzed line [scraped text] plus the number of matches. But it is still linear and linear is much better than cubic :-).

+5
source

My first approach would be super simple:

 foreach(string carModel in listOfCarModelsFromDatabase) { if(pageText.Contains(carModel) { // do something } } 

I would just start to worry about making it faster if the above was not fast enough. The list of car models simply cannot be so large (<10000?), And this is only one page of text.

+3
source

You should use Regex, not space-based tokenization.

With Regex, you can use spaces and be fine, and I believe that it will be faster than tokenization and a loop through a list of possible values.

How you construct this is Regex, although I'm not sure.

Most likely, you can just create a regex with each model, for example

 (Model 1|Model 2|Model 3) 

But I'm sure there are more efficient ways to do this in regex.

0
source

For a really simple solution that matches a substring (which should do pretty well), you can use a parameterized SQL query, for example:

 select ModelID, ModelName from Model where ? like '%' + ModelName + '%' 

where ? is a parameter that is replaced by the text of the entire web page.

0
source

All Articles