The fastest language for FOR loops

Question

The fastest language for FOR loops

I am trying to understand the best programming language for the analytic model that I am creating. The primary consideration is the speed at which it will start FOR loops.

Some details:

The model must perform numerous (~ 30 writes, more than 12 cycles) operations on a set of elements from the array - the array has ~ 300 thousand rows and ~ 150 columns. Most of these operations are logical in nature, for example, if place (i) = 1, then j (i) = 2.
I built an earlier version of this model using Octave - it takes ~ 55 hours to run on an Amazon EC2 m2.xlarge instance (and it uses ~ 10 GB of memory, but I’m very happy to throw more memory on it). Octave / Matlab will not perform elementary logical operations, so it requires a large number of loops for a loop - I am relatively confident that I have vectorized as much as possible - loops that remain are required. I got octave-multicore to work with this code, which makes some improvement (~ 30% speed reduction when I run it on 8 EC2 cores), but in the end it is unstable with file locking, etc. + I really looked for a step at runtime - I know that actually using Matlab can lead to a 50% improvement in views on some tests, but this is not possible. The initial plan at launch was to actually launch Monte Carlo with this, but at 55 hours, which is completely impractical.
The next version of this will be a full recovery from scratch (for IP reasons, I will not go into it if nothing else), so I am completely open to any programming language. I am most familiar with Octave / Matlab, but am involved in R, C, C ++, Java. I also own w / SQL if the solution involves storing data in a database. I will learn any language for this - this is not the complex functionality that we are looking for, we do not interact with other programs, etc. Therefore, I am not too concerned about the learning curve.

So what did the fastest programming language for FOR loops say? From searching SO and Google, Fortran and C bubble to the top, but look for a few more tips before diving to one or the other.

Thanks!

+6

performance loops matlab

Noah Jul 07 '10 at 0:09

source share

14 answers

This for the loop looks no more complicated than this when it hits the CPU:

for(int i = 0; i != 1024; i++) translates to

 mov r0, 0 ;;start the counter top: ;;some processing add r0, r0, 1 ;;increment the counter by 1 jne top: r0, 1024 ;;jump to the loop top if we havn't hit the top of the for loop (1024 elements) ;;continue on

As you can say, it is quite simple, you cannot very optimize it very well [1] ... Rethink the level of the algorithm.

The first section of the problem is to look at the locality of the cache. See a classic example of matrix multiplication and replacement of indices i and j .

edit: As a second abbreviation, I would suggest evaluating the data dependency algorithm between iterations and the data dependency between locations in your data “matrix”. This can be a good candidate for parallelization.

[1] Some micro-optimizations are possible, but they will not produce the acceleration you are looking for.

+6

Paul nathan Jul 07 '10 at 0:17

source share

~300k * ~150 * ~30 * ~12 = ~16G iteration, right? This number of primitive operations should complete in a matter of minutes (if not seconds) in any compiled language on any decent processor. Fortran, C / C ++ should do this almost equally well. Even managed languages, such as Java and C #, will only lag behind by a small margin (if at all).

If you have a problem with ~ 16G iterations that run for 55 hours, that means they are very far from primitive (80k per second? This is ridiculous), so maybe we need to know more. (as suggested, is disk access performance limitation a network access?)

+5

Rotorsor Jul 07 '10 at 0:30

source share

As @Rotsor said, 16G / 55 hours of operations are around 80,000 operations per second, or one operation every 12.5 microseconds. This is a lot of time per operation.

This means that your loops are not the cause of poor performance, it is that in the innermost loop that takes time. And Octave is an interpreted language. This in itself means a slowdown.

If you need speed, you first need to compile the language. Then you need to perform performance tuning (aka profiling) or just perform one step in the debugger at the instruction level. It will tell you what it actually does in the heart. After you get it where it does not waste cycles, more powerful equipment, cores, CUDA, etc. Will give you acceleration of parallelism. But it is stupid to do this if your code takes too many loops. (Here's an example of performance tuning - 43x acceleration, just cutting fat off.)

I can’t believe that the number of respondents speaks about Matlab, APL and other vectorized languages. These are translators. They give you short source code, which is not at all the same as fast execution. When it comes to bare metal, they get stuck with the same equipment as any other language.

Added: to show you what I mean, I just ran this C ++ code that performs 16G operations on this moldy old laptop and it took 94 seconds or about 6 ns to iterate. (I can't believe that you baby sat for 2 whole days.)

 void doit(){ double sum = 0; for (int i = 0; i < 1000; i++){ for (int j = 0; j < 16000000; j++){ sum += j * 3.1415926; } } }

+5

Mike dunlavey Jul 08 '10 at 21:01

source share

For what you are discussing, Fortran is probably your first choice. The closest second place is probably C ++. Some C ++ libraries use "expression patterns" to get some speed over C for this kind of task. It’s not entirely certain that it will benefit you, but C ++ can be at least faster than C and possibly a bit faster.

At least in theory, there is no reason why Ada could not be competitive, but so long ago I used it for something like that I do not recommend recommending it - not because it's not good, but because I just did not track current Ada compilers well enough to comment reasonably.

+3

Jerry Coffin Jul 07 '10 at 0:17

source share

Any compiled language should execute the cycle itself on approximately equal terms.

If you can state your problem in your own terms, you can look at CUDA or OpenCL and run your matrix code on a GPU - although this is less useful for code with a lot of conditional expressions.

If you want to stay on conventional processors, you can formulate your problem in terms of SSE scatter / gather operations and bitmasks.

+3

Pete kirkham Jul 07 '10 at 9:24

source share

Perhaps assembly language for any platform. But compilers (especially special ones specifically designed for one platform (e.g. Analog Devices or TI DSP)) are often good or better than people. In addition, compilers often know about tricks that you don’t have. For example, the aforementioned DSPs support loops with zero payload, and the compiler will know how to optimize the code to use these loops.

+2

Chris Jul 08 '10 at 21:39

source share

Matlab will perform logical operations on elements and, as a rule, pretty quickly.

Here is a brief example of my computer (AMD Athalon 2.3GHz w / 3GB):

 d=rand(300000,150); d=floor(d*10); >> numel(d(d==1)) ans = 4501524 >> tic;d(d==1)=10;toc; Elapsed time is 0.754711 seconds. >> numel(d(d==1)) ans = 0 >> numel(d(d==10)) ans = 4501524

In general, I found that matlab operators are very fast, you just need to find ways to express your algorithms directly in terms of matrix operators.

+1

dkantowitz Jul 08 '10 at 22:13

source share

How is the data stored? Runtime is probably more dependent on I / O (especially disk or worse network) than your language.

Assuming the string operations are orthogonal, I would go with C # and use PLINQ to use all parallelism, I could.

0

Justin R. Jul 07 '10 at 0:16

source share

Can you not be better by inserting assembler with manual coding? Assuming, of course, that you don't need portability.

That an optimized algorithm should help (and perhaps restructure the data?).

You can also try several algorithms and profile them.

0

Mawg Jul 07 '10 at 0:38

source share

APL.

Even though it is interpreted, its primitive operators work on arrays initially, so you rarely need any explicit contours. You write the same code, regardless of whether the data is scalar or arrays, and the interpreter takes care of any cycle that is necessary internally and, therefore, with minimal overhead - the loops themselves are in a compiled language and will be highly optimized for specific processor architectures, on which he works.

Here is an example of the ease of handling arrays in APL:

  A <- 2 3 4 5 6 8 10 ((2|A)/A) <- 0 A 2 0 4 0 6 8 10

The first line sets A to a vector of numbers. The second line replaces all the odd numbers in the vector with zeros. The third row asks for new values of A, and the fourth row asks for the result.

Note that an explicit loop is not required, since scalar operators such as '|' (remainder) automatically propagate to arrays as needed. APL also has built-in primitives for searching and sorting, which are likely to be faster than writing custom loops for these operations.

Wikipedia has a good APL article that also provides links to vendors such as IBM and Dyalog.

0

Dave gordon Jul 07 '10 at 11:47

source share

Any modern compiled or JIT language will display almost the same machine language code, providing an overhead of 10 nano seconds or less for each iteration on modern processors.

Quote from @Rotsor:

If you have a problem with ~ 16G iterations that run for 55 hours, that means they are very far from primitive (80k per second? This is ridiculous), so maybe we need to know more.

80 thousand operations per second are about 12.5 microseconds each - 1000 times more than expected based on the cycle.

Assuming that your 55-hour run time is single-threaded, and if your operations with one product are as simple as suggested, you should (conservatively) achieve acceleration of 100 times and quickly reduce it to an hour.

If you want to work even faster, you need to look at a multi-threaded solution, in which case a language that provides good support will be important. I tend to PLINQ and C # 4.0, but that's because I already know C # - YMMV.

0

Bevan Jul 08 '10 at 21:56

source share

C ++ is not fast when performing matrix operations with loops. C, in fact, is especially bad in it. See good math math .

I heard that C99 has __restrict pointers that help, but know little about it.

Fortran is still a goto language for numerical computing.

0

Robert Cooper Jul 08 '10 at 22:30

source share

how about a lazy loading language like clojure. it is lisp, so most lisp dialects do not have a for loop, but have many other forms that work more idiomatically to process a list. It can also help with scaling issues, since operations are thread safe, and since the language is functional, it has fewer side effects. If you want to find all the items in the list that would be "i" to work with them, you can do something like this.

 (def mylist ["i" "j" "i" "i" "j" "i"]) (map #(= "i" %) mylist)

result

(true false true true false true)

-one

Jed schneider Jul 07 '10 at 1:08

source share

Markd · Accepted Answer · 2010-07-07T00:12:48+0000

In terms of absolute speed, probably Fortran, followed by C, and then C ++. In practical use, well-written code in any of the three compiled with the descent compiler should be fast enough.

Edit. Usually you will see much better performance with any coded or forcing (for example, if statements) code with a compiled language compared to the interpreted language.

To give an example, in a recent project that I was working on, the sizes and operations of the data were about 3/4 of the size, which you are talking about here, but like your code, there was very little space for vectorization, and it requires a significant cycle. After converting the code from MATLAB to C ++, the execution time lasted from 16-18 hours to about 25 minutes.

The fastest language for FOR loops

More articles: