Effectively check a string for one of several hundred possible suffixes

I need to write a C / C ++ function that would quickly check if a line ends with one of ~ 1000 predefined suffixes. In particular, the string is the host name, and I need to check if it belongs to one of several hundred predefined second-level domains.

This function will be called a lot, so you need to write it as efficiently as possible. Bitwise hacks, etc. Everything goes as long as it turns out quickly.

The set of suffixes is predefined at compile time and does not change.

I’m thinking about implementing the Rabin-Karp variant or writing a tool that will generate a function with nested ifs and switches that will be configured specifically for a specific set of suffixes. Since the application in question is 64-bit to speed up comparisons, I could store suffixes up to 8 bytes in length as an array sorted by lines and perform a binary search inside it.

Are there any other reasonable options?

+6
c ++ string algorithm url 64bit
source share
7 answers

After some research and discussion, I decided to go with a tripartite / state machine.

A string is parsed starting with the last character going back using TRIE if the part of the suffix that has been parsed so far can match multiple suffixes. At some point, we either press the first character of one of the possible suffixes, which means that we have a match, we get into a dead end, which means that there are no more possible matches or getting into a situation where there is only one candidate suffix. In this case, we simply compare the remainder of the suffix.

Since trie lookups are constant time, the worst difficulty is o (maximum suffix length). The function turned out to be pretty fast. At the 2.8 GHz Core i5, it can check 33,000,000 lines per second for 2K possible suffixes. 2K suffixes for a total of 18 kilobytes, expanded to 320kb trie / state machine table. I suppose I could store it more efficiently, but this solution still works quite well.

Since the list of suffixes was so large, I did not want it in a code way manually, so I ended up writing a C # application that generated C code for the suffix check function:

public static uint GetFourBytes(string s, int index) { byte[] bytes = new byte[4] { 0, 0, 0, 0}; int len = Math.Min(s.Length - index, 4); Encoding.ASCII.GetBytes(s, index, len, bytes, 0); return BitConverter.ToUInt32(bytes, 0); } public static string ReverseString(string s) { char[] chars = s.ToCharArray(); Array.Reverse(chars); return new string(chars); } static StringBuilder trieArray = new StringBuilder(); static int trieArraySize = 0; static void Main(string[] args) { // read all non-empty lines from input file var suffixes = File .ReadAllLines(@"suffixes.txt") .Where(l => !string.IsNullOrEmpty(l)); var reversedSuffixes = suffixes .Select(s => ReverseString(s)); int start = CreateTrieNode(reversedSuffixes, ""); string outFName = @"checkStringSuffix.debug.h"; if (args.Length != 0 && args[0] == "--release") { outFName = @"checkStringSuffix.h"; } using (StreamWriter wrt = new StreamWriter(outFName)) { wrt.WriteLine( "#pragma once\n\n" + "#define TRIE_NONE -1000000\n"+ "#define TRIE_DONE -2000000\n\n" ); wrt.WriteLine("const int trieArray[] = {{{0}\n}};", trieArray); wrt.WriteLine( "inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {\n"+ " int len = trie[0];\n"+ " if (curr - str < len) return false;\n"+ " const char* cmp = (const char*)(trie + 1);\n"+ " while (len-- > 0) {\n"+ " if (*--curr != *cmp++) return false;\n"+ " }\n"+ " return true;\n"+ "}\n\n"+ "bool checkStringSuffix(const char* str, int len) {\n" + " if (len < " + suffixes.Select(s => s.Length).Min().ToString() + ") return false;\n" + " const char* curr = (str + len - 1);\n"+ " int currTrie = " + start.ToString() + ";\n"+ " while (curr >= str) {\n" + " assert(*curr >= 0x20 && *curr <= 0x7f);\n" + " currTrie = trieArray[currTrie + *curr - 0x20];\n" + " if (currTrie < 0) {\n" + " if (currTrie == TRIE_NONE) return false;\n" + " if (currTrie == TRIE_DONE) return true;\n" + " return checkSingleSuffix(str, curr, trieArray - currTrie - 1);\n" + " }\n"+ " --curr;\n"+ " }\n" + " return false;\n"+ "}\n" ); } } private static int CreateTrieNode(IEnumerable<string> suffixes, string prefix) { int retVal = trieArraySize; if (suffixes.Count() == 1) { string theSuffix = suffixes.Single(); trieArray.AppendFormat("\n\t/* {1} - {2} */ {0}, ", theSuffix.Length, trieArraySize, prefix); ++trieArraySize; for (int i = 0; i < theSuffix.Length; i += 4) { trieArray.AppendFormat("0x{0:X}, ", GetFourBytes(theSuffix, i)); ++trieArraySize; } retVal = -(retVal + 1); } else { var groupByFirstChar = from s in suffixes let first = s[0] let remainder = s.Substring(1) group remainder by first; string[] trieIndexes = new string[0x60]; for (int i = 0; i < trieIndexes.Length; ++i) { trieIndexes[i] = "TRIE_NONE"; } foreach (var g in groupByFirstChar) { if (g.Any(s => s == string.Empty)) { trieIndexes[g.Key - 0x20] = "TRIE_DONE"; continue; } trieIndexes[g.Key - 0x20] = CreateTrieNode(g, g.Key + prefix).ToString(); } trieArray.AppendFormat("\n\t/* {1} - {2} */ {0},", string.Join(", ", trieIndexes), trieArraySize, prefix); retVal = trieArraySize; trieArraySize += 0x60; } return retVal; } 

So, it generates the code as follows:

  inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) { int len = trie[0]; if (curr - str < len) return false; const char* cmp = (const char*)(trie + 1); while (len-- > 0) { if (*--curr != *cmp++) return false; } return true; } bool checkStringSuffix(const char* str, int len) { if (len < 5) return false; const char* curr = (str + len - 1); int currTrie = 81921; while (curr >= str) { assert(*curr >= 0x20 && *curr <= 0x7f); currTrie = trieArray[currTrie + *curr - 0x20]; if (currTrie < 0) { if (currTrie == TRIE_NONE) return false; if (currTrie == TRIE_DONE) return true; return checkSingleSuffix(str, curr, trieArray - currTrie - 1); } --curr; } return false; } 

Since there was no more than 9 for my specific len data set in checkSingleSuffix, I tried to replace the comparison loop with the switches (len) and hard comparison programs that compared up to 8 bytes of data at a time, but this is not so, t affect the overall performance anyway.

Thanks to everyone who contributed their ideas!

0
source share

If the suffixes do not contain any extensions / rules (e.g. regex), you can create Trie suffixes in reverse order, and then match the string based on this. For example,

 suffixes: foo bar bao reverse order suffix trie: o -ab (matches bao) -of (matches foo) rab (matches bar) 

Then they can be used to match your string:

 "mystringfoo" -> reverse -> "oofgnirtsym" -> trie match -> foo suffix 
+11
source share

You mentioned that you only look at second-level domain names, so even without knowing the exact set of matching domains, you can extract the corresponding part of the input string.

Then just use the hash table. Measure it so that there are no collisions, so you do not need buckets; the search will be exactly O (1). For small hash types (e.g. 32 bits) you want to check if the lines really match. For a 64-bit hash, the probability that another domain that encounters one of the hashes in your table is already so low (of the order of 10 ^ -17) that you can probably live with it.

+4
source share

I would cancel all suffix lines, build their prefix tree, and then check the inverse of your IP string.

+3
source share

I think creating your own automata would be the most efficient way. This is a kind of second solution, according to which, starting with a finite set of suffixes, he creates an automaton installed for these suffixes.

I think you can easily use flex to take care of the reverse input or processing in a special way that you are only looking for suffixes (effective questions only).

By the way, using the Rabin-Karp approach would be effective as your suffixes would be short. You can set the hash with all the necessary suffixes, and then

  • take a string
  • accept suffix
  • calculate suffix hash
  • check if there is a suffix in the table
+2
source share

Just create an array of 26x26 from a set of domains. for example thisArray [0] [0] will be domains that end in 'aa', thisArray [0] [1] is all domains that end in 'ab, etc ...

After that, just find your array for this [2nd last char hostname] [last char hostname] array to get the possible domains. If there is more than one at this point, just go through the rest.

0
source share

I think the solution should be very different depending on the type of input lines. If strings are some type of string class that can be iterated from the end (for example, stl strings), this is much simpler than if they were C strings with NULL termination.

Row class

Iterate the line back (do not make a reverse copy - use some kind of reverse iterator). Create a Trie, where each node consists of two 64-bit words, one pattern and one bitmask. Then check 8 characters at a time in each level. The mask is used if you want to combine less than 8 characters - for example. deny "* .org" will give a mask with 32 bits. A mask is also used as completion criteria.

Lines C

Create an NDFA that matches the rows in one pass above them. This way, you do not need to iterate over to the end first, but instead use it in a single pass. NDFA can be converted to DFA, which is likely to make the implementation more efficient. Both NDFA design and DFA conversion are likely to be so complex that you will have to write tools for this.

0
source share

All Articles