What will change when your input is gigabytes / terabytes?

Question

What will change when your input is gigabytes / terabytes?

Today I first took my first children's step in real scientific calculations, when they showed me a data set where the smallest file contains 48,000 fields per 1600 lines (haplotypes for several people, for chromosome 22). And it is considered tiny.

I write Python, so I read about HDF5, Numpy, and PyTable for the past few hours, but I still feel like I don’t really understand what a terabyte-sized dataset actually means to me as a programmer.

For example, someone noted that with large data sets it becomes impossible to read all this in memory, not because the machine does not have enough RAM, but because there is not enough address space in the architecture! It blew my mind.

What other assumptions did I rely on in a class that just doesn't work with inputting this big one? What things do I need to start doing or thinking differently? (This should not be specific to Python.)

+21

python scientific-computing

Wang Jun 10 2018-10-10T00:

source share

4 answers

In a nutshell, the main differences IMO:

You need to know in advance if the bottleneck will be (I / O or CPU) and focus on the best algorithm and infrastructure to solve this problem. I / O is often a bottleneck.
The selection and fine-tuning of an algorithm often dominates any other choice made.
Even minor changes in algorithms and access patterns can be orders of magnitude. You will optimize a lot. The "best" solution will depend on the system.
Talk with your colleagues and other scientists to benefit from their experience with these data sets. In the textbooks, many tricks are impossible to find.
Precomputing and storage can be extremely successful.

Bandwidth and I / O

Initially, bandwidth and I / O are often a bottleneck. To give you perspective: with a theoretical limitation for SATA 3 , reading 1 TB takes about 30 minutes. If you need random access, read it several times or write, you want to do this in memory most of the time or need something much faster (for example, iSCSI with InfiniBand ). Ideally, your system should be able to do parallel I / O to get as close as possible to the theoretical limit of the interface you are using. For example, simply accessing different files in parallel in different processes or HDF5 over MPI-2 I / O is quite common. Ideally, you also perform calculations and I / O in parallel so that one of the two is "free."

Clusters

Depending on your case, I / O or CPU can be a bottleneck. No matter what it is, clusters can achieve huge performance gains if you can efficiently distribute your tasks ( MapReduce example). This may require completely different algorithms than typical textbook examples. The cost development time here is often the best time.

Algorithms

When choosing between algorithms, a large O algorithm is very important, but algorithms with a similar large O can dramatically vary in performance depending on the locality. The less local the algorithm (i.e. the more misses in the cache and misses in the main memory), the worse the performance will be - access to the storage is usually an order of magnitude slower than the main memory. Classic examples of improvements would be tiling for matrix multiplications or loop exchange .

Computer, language, specialized tools

If your bottleneck is I / O, this means that algorithms for large data sets can benefit from more memory (e.g. 64-bit) or programming languages / data structures with less memory consumption (e.g. in Python, __slots__ may be useful ), since more memory may mean less I / O per processor time. BTW, systems with TB main memory are not unheard of (e.g. HP Superdomes ).

Similarly, if your bottleneck is the processor, faster machines, languages, and compilers that allow you to use special architecture features (such as SIMD like SSE ) can increase performance by an order of magnitude.

How you find and access data, and store meta-information, can be very important for performance. You often use flat files or non-standard packages for storing data (for example, non-relational db directly), which allow you to access data more efficiently. For example, kdb + is a specialized database for large time series, and ROOT uses a TTree object to efficiently access data. The pyTables you mentioned will be another example.

+5

stephan Jun 10 '10 at 9:15

source share

While some languages naturally have less memory overhead in their types than others, it really doesn't matter for data of this size — you don’t keep your entire data set in memory, regardless of the language you use, therefore The "expense" of Python does not matter here. As you pointed out, there is simply not enough address space to even reference all this data, not to mention holding it.

This usually means: a) storing your data in a database, or b) adding resources as additional computers, which adds to your available address space and memory. In fact, you will end up doing both of these things. One of the key things to consider when using a database is that a database is not just a place to place your data while you are not using it, you can do WORK in the database and you should try to do it . The database technology you use has a big impact on the work you can do, but the SQL database, for example, is well suited to perform many mathematical tasks and does it efficiently (of course, this means that the design of the schema becomes a very important part of your overall architecture). Don't just suck out the data and manipulate it only in memory - try to use the computational capabilities of your database to do as much work as possible before you put the data into memory in your process.

+1

Nick Bastin Jun 10 '10 at 7:13

source share

The main assumptions are the amount of cpu / cache / ram / storage / bandwidth that you can have on one machine at an affordable price. There are many answers in stackoverflow that are still based on the old assumptions of a 32-bit machine with 4 GB of RAM and about a terabyte of memory and a 1 Gb network. With 16GB DDR-3 memory modules for 220 Euros, 512 GB of RAM, 48 main machines can be built at reasonable prices. Another important change is the transition from hard drives to SSDs.

0

Stephan Eggermont Apr 01 2018-12-12T00:

source share

High Performance Mark · Accepted Answer · 2010-06-10 07:55

I currently do high-performance computing in a small corner of the oil industry and regularly work with datasets that concern you. Here are a few things to keep in mind:

Databases do not have much traction in this domain. Almost all of our data is stored in files, some of these files are based on tape file formats developed in the 70s. I believe that part of the reason for not using databases is historical; 10, even 5 years ago, I think that Oracle and its relatives simply could not cope with the task of managing single O (TB) data sets, not to mention a database of 1000 such data sets.
Another reason is the conceptual mismatch between the normalization rules of effective database and design analysis and the nature of scientific data sets.
I think (although I'm not sure) that the reason for performance is much less convincing today. Of course, now the concept of “concept mismatch” is also less relevant, since most of the main available databases can handle spatial datasets, which are usually much more conceptual than other scientific datasets. I have seen the increasing use of databases for storing metadata with some reference to the file (s) containing sensor data.
However, I am still watching, actually watching, HDF5. I have several attractions for me (a) it's just a different file format, so I do not need to install a DBMS and deal with its difficulties, and (b) with the right equipment, I can read / write the HDF5 file in parallel, (Yes, I know, that I can read and write databases in parallel).
This relates to the second question: when working with very large data sets, you really need to think about parallel computing. I work mainly in Fortran, one of its strengths is array syntax, which is very well suited for many scientific calculations; the other is good support for parallelization. I believe that Python has all kinds of parallelization support, so this is probably not a bad choice for you.
Of course, you can add parallelism to serial systems, but it's much better to start development for parallelism. Take just one example: the best sequential algorithm for a problem is often not the best candidate for parallelization. You might be better off using a different algorithm that scales better on multiple processors. Which leads neatly to the next point.
I also think that you may have to agree to transfer any attachments you have (if any) to a lot of smart algorithms and data structures that work well when all your data is in memory. Very often, trying to adapt them to a situation where you cannot receive data in memory right away is much more difficult (and less effective) than brute force, and with respect to the whole file as one large array.
Performance begins to seriously depend on both the performance of program execution and the productivity of the developer. This does not mean that a 1 TB dataset requires 10 times more code than a 1 GB dataset, so you need to work faster, that some of the ideas you need to implement will be insanely complex and probably should be written by domain specialists , i.e. with the scientists you work with. Here, domain experts write in Matlab.

But it takes too long, I better get back to work

What will change when your input is gigabytes / terabytes?

More articles: