Search speed for byte [] in byte [] and string in string - why is the latter faster?

Question

Search speed for byte [] in byte [] and string in string - why is the latter faster?

I have a task where I need to search for sequences in a file. When running the test application, I read the file as a string (File.ReadAllText) and used string.IndexOf to find the sequence. When I tried to implement the same algorithm with bytes (reading a file as a byte array and searching for a byte array in a byte array), I noticed that finding byte [] in byte [] is about 3 times slower than finding a string in a string, I be sure to check it and the exact same code using byte [] and others using a string three times as long to execute - for example, 16 seconds for a byte vs ~ 5 for a string.

To search for byte arrays, I used the methods described here byte [] search for array patterns . To search for strings, I used the built-in IndexOf function of the string class. Here is one of the IndexOf implementations for the byte [] I tried:

public int IndexOf(byte[] source, byte[] pattern, int startpos = 0) { int search_limit = source.Length - pattern.Length; for (int i = startpos; i < search_limit; i++) { if (source[i] == pattern[0]) { bool found = true; for (int j = 1; j < pattern.Length; j++) { if (source[i + j] != pattern[j]) { found = false; break; } } if (found) return i; } } return -1; }

Basically, finding the next match for bytes in bytes in bytes takes three times, while finding the next match for a sequence of characters in a string. My question is WHY?

Does anyone know how .Net processes strings in strings, what kind of optimization does it do, what algorithm does it use? Is there a faster algorithm than the one I use here? Maybe someone has an idea what I'm doing wrong here to take more time than necessary? I really don’t understand how searching a string in a string can be 3 times faster than finding bytes [] in bytes [] ...

UPDATE: I tried an unsafe algorithm as suggested. It was as follows:

 public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0) { long i = startpos; fixed (byte* H = Haystack) fixed (byte* N = Needle) { for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++) { bool Found = true; for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ; if (Found) return i; } return -1; } } }

strange thing, it actually turned out to be twice as slow! I changed it to add my personal setting (checking the first letter before trying to iterate over the needle), and now it looks like this:

 public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0) { long i = startpos; fixed (byte* H = Haystack) fixed (byte* N = Needle) { for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++) { if (*hNext == *N) { bool Found = true; for (byte* hInc = hNext+1, nInc = N+1, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ; if (Found) return i; } } return -1; } }

Now it takes exactly the same amount of time to execute in safe mode. My question is again - any ideas why? Isn’t it faster because it is unsafe and works with pointers compared to safe ones?

+7

string c # search byte bytearray

Isstrebitel May 31, '13 at 13:01

source share

2 answers

Eric Lippert · Answer 1 · 2013-05-31T16:41:22+0000

Basically, finding the next match for bytes in bytes in bytes takes three times, while finding the next match for a sequence of characters in a string. My question is WHY?

Because the string search algorithm has been highly optimized; it is written in hard unmanaged code that does not waste time checking the boundaries of an array. If you optimized the byte search algorithm in the same way, it would be just as fast; The string search algorithm uses the same naive algorithm that you use.

Your algorithm is fine - this is the standard “naive” search, and Kevin claims, despite this, the naive algorithm in practice is almost always the fastest according to typical data. A flaming array looking for bytes works incredibly fast on modern hardware. It depends on the size of your problem; if you're looking for a long strand of DNA in the middle of the human genome, then Boyer-Moore is totally worth the cost of pre-processing. If you are looking for 0xDEADBEEF in a file with 20 KB, you are not going to beat the naive algorithm if it is effectively implemented.

For maximum speed, you must disable the security system and write code using unsafe pointer arithmetic.

Kevin · Answer 2 · 2013-05-31T13:10:44+0000

Your search algorithm for bytes is extremely inefficient!

The original algorithm with which all other search strings are compared is Boyer-Moore . I would promise that search queries in .NET use his or his variant. There are others , but the Boyer-Moore implementation for bytes will give you much better performance.

Edit: SO to the rescue! Here is a simple C # Boyer-Moore implementation for byte arrays

Edit using temporary numbers: I really liked Eric's comments because I always heard that searching on the .Net string uses Boyer-Moore. I really appreciated the one who actually knew that I was saying this differently. It makes sense to think about it. I decided to make a choice of time to search for Boyer-Moore vs Naive and find out that Eric is absolutely right for a small needle and a small haystack, a naive search more than 13 times faster. However, I was surprised, although the breakeven point was much lower than I expected. Boyer-Moore significantly improves the size of the pattern (or the size of the needle in my timings), so the larger the pattern you are looking for, the faster it is looking. For an 8-byte needle, Naive searches against Boyer-Moore were linked by a search through a haystack of 500-600 bytes. For a larger haystack, Boyer-Moore is significantly better, especially with a larger needle. For a stack of hay 20 kB and a 64-byte needle, Boyer-Moore was 10 times faster.

Full temporary numbers below for all interested.

All tests used the simple Boyer-Moore, linked above, and the naive byte search algorithm Op, which he published after iterating through 1M.

 1000000 iterations, haystack size = 20 bytes, needle size = 8 bytes 20ms total : Naive Search 268ms total : Boyer-Moore Search 1000000 iterations, haystack size = 600 bytes, needle size = 8 bytes 608ms total : Naive Search 582ms total : Boyer-Moore Search 1000000 iterations, haystack size = 2000 bytes, needle size = 8 bytes 2011ms total : Naive Search 1339ms total : Boyer-Moore Search 1000000 iterations, haystack size = 2000 bytes, needle size = 32 bytes 1865ms total : Naive Search 563ms total : Boyer-Moore Search 1000000 iterations, haystack size = 2000 bytes, needle size = 64 bytes 1883ms total : Naive Search 466ms total : Boyer-Moore Search 1000000 iterations, haystack size = 20000 bytes, needle size = 8 bytes 18899ms total : Naive Search 10753ms total : Boyer-Moore Search 1000000 iterations, haystack size = 20000 bytes, needle size = 32 bytes 18639ms total : Naive Search 3114ms total : Boyer-Moore Search 1000000 iterations, haystack size = 20000 bytes, needle size = 64 bytes 18866ms total : Naive Search 1807ms total : Boyer-Moore Search

Search speed for byte [] in byte [] and string in string - why is the latter faster?

More articles: