What is the optimal (fast) way to parse a large (> 4 GB) text file with many lines?

Question

What is the optimal (fast) way to parse a large (> 4 GB) text file with many lines?

I am trying to determine what is the fastest way to read in large text files with many lines, do some processing and write them to a new file. StreamReader appears in C # /. NET - this would seem to be a quick way to do this, but when I try to use this file (read line by line), it is about 1/3 of python's I / O speed (which bothers me because I hear all the time that Python 2.6 IO was relatively slow).

If there is no faster .NET solution for this, would it be possible to write a solution faster than StreamReader, or does it already use a complex buffer / algorithm / optimization that I would never want to beat?

+6

c # .net text parsing buffer

llamaoo7 Jan 05 '09 at 23:24

source share

7 answers

Nathan tregillus · Answer 1 · 2009-01-05T23:35:04+0000

Do you have a sample code of what you are doing, or a file format that you are reading?

Another good question: how much of the stream do you store in memory at a time?

Jon skeet · Answer 2 · 2009-01-05T23:33:14+0000

StreamReader is pretty good - how did you read it in Python? Perhaps if you specify a simpler encoding (e.g. ASCII), this may speed up the process. How much processor does the process take?

You can increase the buffer size using the appropriate StreamReader constructor, but I don’t know how much this can affect.

Norman ramsey · Answer 3 · 2009-01-06T02:48:41+0000

If your own code checks one character at a time, you want to use a sentinel to mark the end of the buffer or the end of the file so that you only have one test in your inner loop . In your case, that one test will be for the end of the line, so you want to temporarily bind a new line at the end of each buffer, for example.

The Wikipedia article on guards is not needed at all; he does not describe this case. You can find a description in any of Robert Sedgwick's algorithm textbooks.

You can also watch re2c , which can generate very fast code for scanning text data. It generates C code, but you can adapt it, and you can, of course, learn the technique by reading your article on re2c .

Msn · Answer 4 · 2009-01-05T23:39:24+0000

General Note:

High-performance streaming is not complicated. Usually you need to change the logic that uses streaming data; .

Actually, this.

Mike dunlavey · Answer 5 · 2009-01-05T23:46:52+0000

Sorry if I'm not a .NET guru, but in C / C ++, if you have good big buffers, you should parse it with the LL1 parser not much slower than you can scan bytes. I can give more details if you want.

pro · Answer 6 · 2009-01-05T23:53:37+0000

Try BufferedReader and BufferedWriter to speed up processing.

DSO · Answer 7 · 2009-01-06T01:22:15+0000

The default buffer size used by StreamReader / FileStream may not be optimal for the lengths of records in your data, so you can try to tweak them. You can override the default buffer lengths in the constructors of both FileStream and StreamReader that wrap it. You should probably make them the same size.

What is the optimal (fast) way to parse a large (> 4 GB) text file with many lines?

More articles: