How to produce a huge amount of data?

I do some tests with nutch and hadoop and I need a huge amount of data. I want to start with 20 GB, switch to 100 GB, 500 GB and eventually reach 1-2 TB.

The problem is that I don’t have that much data, so I’m thinking about how to create it.

The data itself can be of any kind. One idea is to take the original dataset and duplicate them. But it is not enough, because you need files that are different from each other (identical files are ignored).

Another idea is to write a program that will create files with dummy data.

Any other idea?

+8
java hadoop nutch bigdata
source share
5 answers

This may be the best question for the StackExchange site statistics (see, for example, my question about the best methods for generating synthetic data ).

However, if you are not interested in the properties of data as an infrastructure for working with data and working with them, you can ignore the statistics site. In particular, if you are not focused on the statistical aspects of data and just want big data, then we can focus on how you can create a large pile of data.

I can offer some answers:

  • If you are only interested in random numerical data, create a large stream from your favorite Mersenne Twister implementation. There is also / dev / random (see this Wikipedia entry for more information ). I prefer the well-known random number generator, as the results can be reproduced ad nauseam by anyone else.

  • For structured data, you can look at the mapping of random numbers to indices and create a table that displays indices, for example, strings, numbers, etc., for example, you can see when creating a database of names, addresses, etc. If you have a sufficiently large table or a sufficiently rich map, you can reduce the risk of collisions (for example, the same names), although perhaps you would like to have several collisions, since this also happens in reality.

  • Keep in mind that with any generative method, you do not need to store the entire data set until you begin. While you are registering a state (e.g. RNG), you can choose where you left off.

  • For text data, you can view simple random string generators. You can create your own probability estimates for strings of different lengths or different characteristics. The same can be used for sentences, paragraphs, documents, etc. - just decide what properties you want to emulate, create an "empty" object and fill it with text.

+7
source share

If you only need to avoid exact duplicates, you can try a combination of your two ideas - create damaged copies of a relatively small data set. Corruption operations may include: replacement, insertion, deletion and replacement of characters.

+1
source share

I would write a simple program for this. The program should not be too clear, as the speed of writing to disk is likely to be your neck of the bottle.

0
source share

A long time ago comment: I recently expanded the disk partition, and I know very well how long it will take to move or create a large number of files. It would be much faster to request the OS a whole range of free disk space, and then create a new FAT record for this range without writing any content (reusing previously existing information). This will serve your purpose (since you do not care about the contents of the file) and will be as fast as deleting the file.

The problem is that this can be difficult to achieve in Java. I found an open source library called fat32-lib , but since it does not resort to its own code, I do not think this is useful here. For a given file system and using a lower level language (like C), if you have the time and motivation, I think that would be possible.

0
source share

Take a look at TPC.org , they have different database tests with data generators and predefined queries.

Generators have a scale factor that allows you to determine the size of the target data.

There is also countless research project ( paper ) that focuses on the distribution of "big data" data. Myriad has a steep learning curve, so you may need to contact the software authors for help.

0
source share

All Articles