This is just a personal project that I burst into. Basically, I am parsing a text file (say, from 20 mb to 1 gb) using StreamReader. The performance is pretty solid, but still ... I was itching to find out what would happen if I analyzed it in binary format. Do not get me wrong, I will not prematurely optimize. I purposefully micro-optimize just "see."
So, I am reading a text file using byte arrays. Come to find out if new lines can be (Windows) standard CR / LF or CR or LF ... pretty dirty. I was hoping I could use Array.IndexOf on CR and then skip LF. Instead, I find that I am writing code very similar to IndexOf, but checking it and returning the array as needed.
So, the bottom line: using very similar code for IndexOf, my code still ends insanely slower. Simply put, using the 800mb file:
- Using IndexOf and searching for CR: ~ 320 mb / s
- Using StreamReader and ReadLine: ~ 180 mb / s
- for replication of the IndexOf cycle: ~ 150 mb / s
here is the code with a for loop (~ 150 mb / s):
IEnumerator<byte[]> IEnumerable<byte[]>.GetEnumerator() { using(FileStream fs = new FileStream(_path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, _bufferSize)) { byte[] buffer = new byte[_bufferSize]; int bytesRead; int overflowCount = 0; while((bytesRead = fs.Read(buffer, overflowCount, buffer.Length - overflowCount)) > 0) { int bufferLength = bytesRead + overflowCount; int lastPos = 0; for(int i = 0; i < bufferLength; i++) { if(buffer[i] == 13 || buffer[i] == 10) { int length = i - lastPos; if(length > 0) { byte[] line = new byte[length]; Array.Copy(buffer, lastPos, line, 0, length); yield return line; } lastPos = i + 1; } } if(lastPos > 0) { overflowCount = bufferLength - lastPos; Array.Copy(buffer, lastPos, buffer, 0, overflowCount); } } } }
this is a faster code block (~ 320 Mb / s):
while((bytesRead = fs.Read(buffer, overflowCount, buffer.Length - overflowCount)) > 0) { int bufferLength = bytesRead + overflowCount; int pos = 0; int lastPos = 0; while(pos < bufferLength && (pos = Array.IndexOf<byte>(buffer, 13, pos)) != -1) { int length = pos - lastPos; if(length > 0) { byte[] line = new byte[length]; Array.Copy(buffer, lastPos, line, 0, length); yield return line; } if(pos < bufferLength - 1 && buffer[pos + 1] == 10) pos++; lastPos = ++pos; } if(lastPos > 0) { overflowCount = bufferLength - lastPos; Array.Copy(buffer, lastPos, buffer, 0, overflowCount); } }
(No, this is not ready for production, some cases will make it explode, I use a 128 kb buffer to ignore most of them.)
So my big question is: why is Array.IndexOf so much faster? This is essentially the same for a loop going through an array. Is there anything about the way mscorlib code is executed? Even modifying the above code to really replicate IndexOf and search only for CR, and then skip LF, as if I used IndexOf, it will not help. Errr ... I am undergoing various permutations, and it's late enough, maybe there is some blatant mistake that I miss?
By the way, I looked in ReadLine and noticed that it uses a switch block, not an if block ... when I do something like that, strange, it increases performance by about 15 Mb / s. This is another question another time (why is the switch faster than if?), But I decided that I would indicate that I really looked at it.
In addition, I am testing release builds outside of VS, so no debugging occurs.