How to quickly create large (> 1gb) text + binary files with "natural" content? (FROM#)

Question

How to quickly create large (> 1gb) text + binary files with "natural" content? (FROM#)

To test compression, I need to be able to create large files, ideally in text, binary and mixed formats.

The contents of the files should be neither completely random nor uniform.
A binary file with all zeros does not fit. A binary file with completely random data is also not very good. For text, a file with completely random ASCII sequences is not very good - text files should have patterns and frequencies that mimic natural language or source code (XML, C #, etc.). Pseudo-real text.
The size of each individual file is not critical, but for a set of files I need the sum to be ~ 8 GB.
I would like to keep the number of files at a manageable level, say o (10).

To create binaries, I can create a new large buffer and make System.Random.NextBytes, followed by FileStream.Write in a loop, for example:

Int64 bytesRemaining = size; byte[] buffer = new byte[sz]; using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write)) { while (bytesRemaining > 0) { int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining; if (!zeroes) _rnd.NextBytes(buffer); fileStream.Write(buffer, 0, sizeOfChunkToWrite); bytesRemaining -= sizeOfChunkToWrite; } fileStream.Close(); }

With a sufficiently large buffer, say 512k, it is relatively fast, even for files larger than 2 or 3gb. But the content is completely random, which I do not want.

For text files, the approach I used is to use Lorem Ipsum and transfer it multiple times through StreamWriter to a text file. The content is nonrandom and uneven, but it has many identical repeating blocks, which is unnatural. In addition, since the Lorem Ispum block is so small (<1k), it takes many cycles and a very, very long time.

None of them are quite satisfactory to me.

Have I seen the answers to Quickly create a large file on a Windows system? . These approaches are very quick, but I think they just fill the file with zeros or random data, none of which is what I want. I have no problem starting an external process such as contig or fsutil, if necessary.

Testing is done on Windows.
Instead of creating new files, does it make sense to just use files that already exist on the file system? I don't know a single one big enough.

How to start from one existing file (possibly c: \ windows \ Microsoft.NET \ Framework \ v2.0.50727 \ Config \ enterpriseec.config.cch for a text file) and duplicate its contents? This will work with either a text or binary file.

I currently have an approach that works, but it works too long.

Has anyone else solved this?

Is there a faster way to write a text file than through StreamWriter?

Suggestions?

EDIT . I like the idea of the Markov chain to create a more natural text. Nevertheless, you still need to solve the problem of speed.

+7

c # windows filesystems .net testing

Cheeso Jun 24 '09 at 11:13

source share

8 answers

For text, you can use the community dump , there are 300 million data. It only takes about 6 minutes to download the application that I wrote in db, and probably around the same time to dump all messages into text files, which will easily provide you somewhere between 200 and 1 million between text files , depending on your approach (with the added bonus of having a source and xml).

You can also use something like wikipedia dump , it seems to come in MySQL format, which will make it easier to work with.

If you are looking for a large file that you can split, for binary purposes you can use VM vmdk or DVD ripped locally.

Edit

Mark mentions downloading the gutenberg project, it is also a very good source of text (and audio), which is available for download via bittorrent .

+14

Sam saffron Jun 24 '09 at 11:19

source share

You can always code yourself a little web crawler ...

UPDATE Calm down guys, that would be a good answer if he did not say that he already has a solution that "takes too much time."

A quick check here would show that downloading 8 GB in total would take a relatively long time.

+10

Benjol Jun 24 '09 at 11:16

source share

I think the Windows directory is likely to be a good enough source for your needs. If you are after the text, I rewrite each directory, looking for .txt files, and iterate over them, copying them to the output file as many times as necessary to get the desired file size.

You can then use a similar approach for binaries by looking for an EXE or DLL.

+3

Kirschstein Jun 24 '09 at 11:31

source share

For text files, you may have some success if you have an English word list and simply extract words from it at random. This will not produce real text in English, but I would suggest that it would create a frequency of letters similar to what you can find in English.

For a more structured approach, you can use the Markov chain prepared in some large free English text.

+1

Jack ryan Jun 24 '09 at 11:20

source share

Why don't you just take Lorem Ipsum and create a long line in memory before exiting. The text should expand at O (log n) speed if you double the amount of text you have each time. You can even calculate the total length of the data in front of the hand so as not to suffer from the need to copy the contents to a new line / array.

Since your buffer is only 512 thousand or all that you installed it, you only need to create so much data before writing it, as this is just the amount that you can click on the file at a time. You will write the same text over and over, so just use the original 512k that you created for the first time.

+1

kemiller2002 Jun 24 '09 at 11:20

source share

Wikipedia is great for testing compression for mixed text and binary. If you need comparative comparisons, the Hutter Award site can provide a high water rating for the first 100 MB of Wikipedia. The current record is 6.26, 16 mb.

+1

R ubben Jun 24 '09 at 14:02

source share

Thanks for the quick input. I decided to separately consider the problems of speed and "naturalness." To generate a natural text, I combined a couple of ideas.

To generate text, I start with several text files from the project gutenberg directory , as suggested by Mark Rushakov.
I randomly select and upload one document of this subset.
Then I apply the Markov process as suggested by Noldorin , using this downloaded text as input.
As an example, I wrote a new Markov chain in C # using Pike's cost-effective Perl implementation . It generates text one word at a time.
For efficiency, rather than using a clean Markov chain to generate 1gb of text one word at a time, the code generates random text ~ 1mb, and then re-accepts arbitrary segments of this and combines them together.

UPDATE . Regarding the second problem, speed - I took the approach to eliminate as many IOs as possible, this is done on my poor laptop with a 5400 rpm mini spindle. This made me completely redefine the problem - instead of generating a FILE with random content, I really want this random content. Using the Stream wrapped around the Markov chain, I can generate text in memory and transfer it to the compressor, eliminating 8 g of writing and 8 g of reading. For this particular test, I do not need to check the joint journey with compression / decompression, so I do not need to keep the original content. Thus, the streaming approach worked well to speed up the mass distribution process. He reduced 80% of the time required.

I have not figured out how to make a binary generation yet, but it will most likely be similar.

Thanks to everyone, again, for all the useful ideas.

0

Cheeso Jun 24 '09 at 18:39

source share

Noldorin · Accepted Answer · 2009-06-24T11:18:10+0000

I think you can look for something like a Markov chain to create this data. This is both stochastic (randomized), but also structured, since it works based on the Markov Circuit Properties on the page.) I hope you should see how to create it, however - to implement it, it's actually a pretty simple concept. It is best to create a structure for the general Markov process, and then analyze either the natural language or the source code (depending on what you want your random data to imitate) to “educate” your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth the effort if you need these huge lengths of test data.

How to quickly create large (> 1gb) text + binary files with "natural" content? (FROM#)

More articles: