I have a task where I need to search for sequences in a file. When running the test application, I read the file as a string (File.ReadAllText) and used string.IndexOf to find the sequence. When I tried to implement the same algorithm with bytes (reading a file as a byte array and searching for a byte array in a byte array), I noticed that finding byte [] in byte [] is about 3 times slower than finding a string in a string, I be sure to check it and the exact same code using byte [] and others using a string three times as long to execute - for example, 16 seconds for a byte vs ~ 5 for a string.
To search for byte arrays, I used the methods described here byte [] search for array patterns . To search for strings, I used the built-in IndexOf function of the string class. Here is one of the IndexOf implementations for the byte [] I tried:
public int IndexOf(byte[] source, byte[] pattern, int startpos = 0) { int search_limit = source.Length - pattern.Length; for (int i = startpos; i < search_limit; i++) { if (source[i] == pattern[0]) { bool found = true; for (int j = 1; j < pattern.Length; j++) { if (source[i + j] != pattern[j]) { found = false; break; } } if (found) return i; } } return -1; }
Basically, finding the next match for bytes in bytes in bytes takes three times, while finding the next match for a sequence of characters in a string. My question is WHY?
Does anyone know how .Net processes strings in strings, what kind of optimization does it do, what algorithm does it use? Is there a faster algorithm than the one I use here? Maybe someone has an idea what I'm doing wrong here to take more time than necessary? I really donβt understand how searching a string in a string can be 3 times faster than finding bytes [] in bytes [] ...
UPDATE: I tried an unsafe algorithm as suggested. It was as follows:
public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0) { long i = startpos; fixed (byte* H = Haystack) fixed (byte* N = Needle) { for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++) { bool Found = true; for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ; if (Found) return i; } return -1; } } }
strange thing, it actually turned out to be twice as slow! I changed it to add my personal setting (checking the first letter before trying to iterate over the needle), and now it looks like this:
public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0) { long i = startpos; fixed (byte* H = Haystack) fixed (byte* N = Needle) { for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++) { if (*hNext == *N) { bool Found = true; for (byte* hInc = hNext+1, nInc = N+1, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ; if (Found) return i; } } return -1; } }
Now it takes exactly the same amount of time to execute in safe mode. My question is again - any ideas why? Isnβt it faster because it is unsafe and works with pointers compared to safe ones?