Concepts and tools needed to scale algorithms

I would like to start thinking about how I can scale my algorithms that I write for data analysis so that they can be applied to arbitrarily large data sets. I wonder what the relevant concepts (streams, concurrency, fixed data structures, recursion) are and the tools (Hadoop / MapReduce, Terracota and Eucalyptus) for this to happen, and how exactly these concepts and tools are related to each other. I have a rudimentary background in R, Python, and bash scripts, as well as C and Fortran programming, although I am familiar with some basic concepts of functional programming. I need to change the programming method, use another language (Clojure, Haskell, etc.) Or just (or not just like that!) Adapt something like R / Hadoop (HRIPE) ...or write wrappers for Python to enable multithreaded or Hadoop access? I understand that this may include requirements for additional hardware, and I would like to get a general idea of ​​the possibilities / options. I apologize for this rather big and still vague question, but just trying to start - thanks in advance!

+5
source share
2 answers

Although languages ​​and related technologies / structures are important for scaling, they tend to be pale compared to the importance of algorithms, data structures, and architectures. Forget about threads: the number of cores you can use this way is too limited - you want individual processes to exchange messages, so you can scale at least a small cluster of servers on a fast LAN (and, ideally, a large cluster ! -).

" " - , . - , ? (, ), - / .

- " " ... Google , . -. (, , 8 ), ? " " - , , , 95%.

, - " " - 100% . / , , 95% , 99,99% ( , 100.00%!), .

, , " " "" - , . ( ), ( !!!), , ; , , exabytes , Stats 201 - ( ITunes University, YouTube , blip.tv, - ).

Python, R, ++, , , , , ( " " , ...?), .

+9

- , . pythonic .

+3

All Articles