Parsing a large data file from disk is much slower than parsing in memory?

Question

Parsing a large data file from disk is much slower than parsing in memory?

When writing a simple library for parsing game data files, I noticed that reading the entire data file into memory and parsing there was much faster (up to 15x, 106 with v 7).

The analysis is usually performed sequentially, but queries will be run from time to time to read some data stored elsewhere in the file associated with the offset.

I understand that analysis from memory will certainly be faster, but something is wrong if the difference is so significant. I wrote code to simulate this:

public static void Main(string[] args) { Stopwatch n = new Stopwatch(); n.Start(); byte[] b = File.ReadAllBytes(@"D:\Path\To\Large\File"); using (MemoryStream s = new MemoryStream(b, false)) RandomRead(s); n.Stop(); Console.WriteLine("Memory read done in {0}.", n.Elapsed); b = null; n.Reset(); n.Start(); using (FileStream s = File.Open(@"D:\Path\To\Large\File", FileMode.Open)) RandomRead(s); n.Stop(); Console.WriteLine("File read done in {0}.", n.Elapsed); Console.ReadLine(); } private static void RandomRead(Stream s) { // simulate a mostly sequential, but sometimes random, read using (BinaryReader br = new BinaryReader(s)) { long l = s.Length; Random r = new Random(); int c = 0; while (l > 0) { l -= br.ReadBytes(r.Next(1, 5)).Length; if (c++ <= r.Next(10, 15)) continue; // simulate seeking long o = s.Position; s.Position = r.Next(0, (int)s.Length); l -= br.ReadBytes(r.Next(1, 5)).Length; s.Position = o; c = 0; } } }

As input I used one of the game data files. This file was about 102 MB, and it produced this result ( Memory read done in 00:00:03.3092618. File read done in 00:00:32.6495245. ), Memory read done in 00:00:03.3092618. File read done in 00:00:32.6495245. memory reads about 11 times faster than the file.

Read memory was performed before the file was read in order to try to improve speed through the file cache. It is still much slower.

I tried to increase or decrease the size of the FileStream buffer; nothing yielded significantly better results, and increasing or decreasing too much only worsened speed.

Is there something I'm doing wrong, or is this to be expected? Is there any way to at least make the slowdown less significant?

Why is the entire file read at once, and then it is analyzed much faster than reading and parsing at the same time?

I really compared with a similar library written in C ++ that uses the native CreateFileMapping and MapViewOfFile Windows to read files, and it is very fast. Could this be a constant switch from managed to unmanaged and involved marshaling that causes this?

I also tried MemoryMappedFile in .NET 4; the gain was only one second.

+4

c # .net .net-3.5

angelsl May 10 '12 at 13:00

source share

3 answers

Is there something I'm doing wrong, or is this to be expected?

The hard drive has, compared to RAM, a huge access time. Sequential reads are pretty fast, but as soon as the head needs to move (because the data is fragmented), it takes many milliseconds to get the next bit of data, during which your application is idling.

Is there any way to at least make the slowdown less significant?

Buy an SSD.

You can also see files with memory for. NET :

MemoryMappedFile.CreateFromFile() .

As for your editing: I would go with @Oded and say that reading the file first adds a penalty. However, this should not call a method that first reads the entire file seven times slower than the process-as-you-read.

+2

Codecaster May 10 '12 at 13:04

source share

I decided to do some tests comparing different ways of reading a file in C ++ and C #. First I created a 256 MB file. In C ++ tests, buffering means that I first copied the entire file to the buffer, and then read the data from the buffer. All benchmarks directly or indirectly read the file bytes sequentially and calculate the checksum. All the time is measured from the moment the file is opened until I completely finish and the file is closed. All tests were performed several times to support sequential caching of OS files.

C ++
Unbuffered file with memory mapping: 300 ms
Buffered file with memory mapping: 500ms
Unbuffered Food: 23,000 ms
Buffered Item: 500 ms
Unbuffered ifstream: 26,000 ms
Buffered Stream: 700 ms

FROM#
MemoryMappedFile: 112,000 ms
FileStream: 2800 ms
MemoryStream: 2,300 ms
ReadAllBytes: 600 ms

Interpret the data as you like. Display files with C # memory are slower than even the code with the smallest C ++ code, while matching files with C ++ memory are the fastest things. C # ReadAllBytes works pretty fast, only twice as slow as C ++ memory mapped files. Therefore, if you need better performance, I recommend that you use ReadAllBytes and then access the data directly from the array without using a stream.

0

retep998 May 10 '12 at 14:42

source share

Odded · Accepted Answer · 2012-05-10T13:03:21+0000

Is there something I'm doing wrong, or is this to be expected?

There is nothing bad. This is quite expected. The fact that disk access is an order of magnitude slower than memory access is more than reasonable.

Update:

That one reading of the file, followed by processing, is faster than processing, while reading is also expected.

Performing a large I / O operation and processing in memory will be faster than getting a bit from the disk, processing it, calling the disk again (waiting for I / O to complete), processing this bit, etc.

Parsing a large data file from disk is much slower than parsing in memory?

More articles: