In a nutshell, the main differences IMO:
- You need to know in advance if the bottleneck will be (I / O or CPU) and focus on the best algorithm and infrastructure to solve this problem. I / O is often a bottleneck.
- The selection and fine-tuning of an algorithm often dominates any other choice made.
- Even minor changes in algorithms and access patterns can be orders of magnitude. You will optimize a lot. The "best" solution will depend on the system.
- Talk with your colleagues and other scientists to benefit from their experience with these data sets. In the textbooks, many tricks are impossible to find.
- Precomputing and storage can be extremely successful.
Bandwidth and I / O
Initially, bandwidth and I / O are often a bottleneck. To give you perspective: with a theoretical limitation for SATA 3 , reading 1 TB takes about 30 minutes. If you need random access, read it several times or write, you want to do this in memory most of the time or need something much faster (for example, iSCSI with InfiniBand ). Ideally, your system should be able to do parallel I / O to get as close as possible to the theoretical limit of the interface you are using. For example, simply accessing different files in parallel in different processes or HDF5 over MPI-2 I / O is quite common. Ideally, you also perform calculations and I / O in parallel so that one of the two is "free."
Clusters
Depending on your case, I / O or CPU can be a bottleneck. No matter what it is, clusters can achieve huge performance gains if you can efficiently distribute your tasks ( MapReduce example). This may require completely different algorithms than typical textbook examples. The cost development time here is often the best time.
Algorithms
When choosing between algorithms, a large O algorithm is very important, but algorithms with a similar large O can dramatically vary in performance depending on the locality. The less local the algorithm (i.e. the more misses in the cache and misses in the main memory), the worse the performance will be - access to the storage is usually an order of magnitude slower than the main memory. Classic examples of improvements would be tiling for matrix multiplications or loop exchange .
Computer, language, specialized tools
If your bottleneck is I / O, this means that algorithms for large data sets can benefit from more memory (e.g. 64-bit) or programming languages / data structures with less memory consumption (e.g. in Python, __slots__ may be useful ), since more memory may mean less I / O per processor time. BTW, systems with TB main memory are not unheard of (e.g. HP Superdomes ).
Similarly, if your bottleneck is the processor, faster machines, languages, and compilers that allow you to use special architecture features (such as SIMD like SSE ) can increase performance by an order of magnitude.
How you find and access data, and store meta-information, can be very important for performance. You often use flat files or non-standard packages for storing data (for example, non-relational db directly), which allow you to access data more efficiently. For example, kdb + is a specialized database for large time series, and ROOT uses a TTree object to efficiently access data. The pyTables you mentioned will be another example.