There is no "universal" library of physics. For example. Can you imagine a useful soft tissue surgery simulation that takes into account relativistic effects? You could imagine dozens more examples.
You are talking about imitating both a scientific and a solid body, so it is unclear how realistic you are. A rigid body is an approximation: nothing is absolutely rigid. But if nothing deforms much in your simulation, and you are fine with a bunch of unrealistic approximations to friction and fast movement (common to all video games), and you want a ready-made solution, I suspect that I run Havok on a modern processor will give you the best performance.
PS / 3 is currently the latest. Although I really enjoyed writing physics during my day, I have to admit that a modern i7 with 6 cores gives you more performance - both theoretically and in practice - than a single cell.
CUDA is currently not proven for physics. I did not write anything about this, but the reader is very interesting to me :) The problems of writing physics based on CUDA are pretty nontrivial if you want to get closer to the IPC ratio (instruction to cycle ratio) of a modern processor, and I donβt know which of them is coping successfully. And if you don't approach processor-based IPCs, there is no point in CUDA physics, since more effort is required.
Just do the math: the $ 500 Kepler GPU has 1,536 cores @ 1GHz = 1.5 petaflops. The $ 590 Sandy Bridge processor has 6 cores / 12 AVX hyper-threads (8x wide) @ 3.8 GHz = 0.36 petaflops. Now, if you can achieve 5-to-1 parity (use 5 GPU cycles on average per processor cycle), your theoretical CUDA physics will run at the same speed as processor physics. Now, using 12 hyperthreads and AVX (8-wide SIMD) is actually not very simple. But the parallel physics tasks of 1536 (!) CUDA threads, which must be very consistent and use memory in a much more controlled way, are also a big feat. I'm not saying that this is impossible (and I would like to try, but I have work for the day and other projects for pets :)), but it will take some time before the physical community comes up with something that is scalable in thousands of threads.
And in the end, the speed improvement is only 5 times or so ... :)
In any case, if you yourself are writing a sim, and you do not need a general simulation of a solid, then CUDA can be your friend. For example. If you want to simulate the motion of all stars in the Milky Way, with relativism, but without supernova and other discrete effects ... Itβs clear how to distribute this through 1536 (or more) streams. But if you want to have a mountain of rigid bodies, imitate the same thing as games at present, you are out of luck.