It seems to me that you will need to load the file into memory if you want to avoid IO competition. The operating system will do some buffering, but if you find that this is not enough, you will have to do it yourself.
Do you really need 32 threads? Presumably you don't have many cores, so use fewer threads and you will get fewer context switches, etc.
Do all your threads process the file from start to finish? If so, can you effectively split the file into pieces? Read the first (say) 10 MB of data in memory, let all the threads process it, then move on to the next 10 MB, etc.
If this does not work for you, how much memory did you compare with the file size? If you have a lot of memory, but you do not want to allocate one huge array, you can read the entire file in memory, but into many separate lower byte arrays. Then you need to write an input stream that spans all these byte arrays, but this should be doable.
Jon skeet
source share