To test compression, I need to be able to create large files, ideally in text, binary and mixed formats.
- The contents of the files should be neither completely random nor uniform.
A binary file with all zeros does not fit. A binary file with completely random data is also not very good. For text, a file with completely random ASCII sequences is not very good - text files should have patterns and frequencies that mimic natural language or source code (XML, C #, etc.). Pseudo-real text. - The size of each individual file is not critical, but for a set of files I need the sum to be ~ 8 GB.
- I would like to keep the number of files at a manageable level, say o (10).
To create binaries, I can create a new large buffer and make System.Random.NextBytes, followed by FileStream.Write in a loop, for example:
Int64 bytesRemaining = size; byte[] buffer = new byte[sz]; using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write)) { while (bytesRemaining > 0) { int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining; if (!zeroes) _rnd.NextBytes(buffer); fileStream.Write(buffer, 0, sizeOfChunkToWrite); bytesRemaining -= sizeOfChunkToWrite; } fileStream.Close(); }
With a sufficiently large buffer, say 512k, it is relatively fast, even for files larger than 2 or 3gb. But the content is completely random, which I do not want.
For text files, the approach I used is to use Lorem Ipsum and transfer it multiple times through StreamWriter to a text file. The content is nonrandom and uneven, but it has many identical repeating blocks, which is unnatural. In addition, since the Lorem Ispum block is so small (<1k), it takes many cycles and a very, very long time.
None of them are quite satisfactory to me.
Have I seen the answers to Quickly create a large file on a Windows system? . These approaches are very quick, but I think they just fill the file with zeros or random data, none of which is what I want. I have no problem starting an external process such as contig or fsutil, if necessary.
Testing is done on Windows.
Instead of creating new files, does it make sense to just use files that already exist on the file system? I don't know a single one big enough.
How to start from one existing file (possibly c: \ windows \ Microsoft.NET \ Framework \ v2.0.50727 \ Config \ enterpriseec.config.cch for a text file) and duplicate its contents? This will work with either a text or binary file.
I currently have an approach that works, but it works too long.
Has anyone else solved this?
Is there a faster way to write a text file than through StreamWriter?
Suggestions?
EDIT . I like the idea of ββthe Markov chain to create a more natural text. Nevertheless, you still need to solve the problem of speed.