Without additional information about your type of modeling, I can only guess, so here are my assumptions.
Have you considered the issue of load balancing? I think that the cycle distributes particles between flows, but if you have some limited range potential, then the computational time may differ from particles to particles in different areas of the simulation volume, depending on the spatial density. This is a very common problem in molecular dynamics, and it is very difficult to solve in distributed memory codes (MPI in most cases). Fortunately, with OpenMP, you get direct access to all the particles in each computing element, and therefore load balancing is much easier. It is not only simpler, but also built-in, so to speak, simply change the schedule of the for directive with the schedule(dynamic,chunk) sentence, where chunk is a small number, the optimal value of which may differ from simulation to simulation. You can chunk part of the input to the program, or you can write schedule(runtime) instead, and then play with different scheduling classes by setting the OMP_SCHEDULE environment OMP_SCHEDULE to values such as "static" , "dynamic,1" , "dynamic,10" , "guided" , etc.
Another possible source of performance degradation is false sharing and sharing of authentic data. False sharing occurs when your data structure is not suitable for simultaneous modification. For example, if you store three-dimensional positional and velocity information for each particle (let's say you use the Verlet speed integrator), given the double precision of IEEE 754, each coordinate / speed triplet takes 24 bytes. This means that one 64-byte cache line contains 2 full triplets and 2/3 of the other. The consequence of this is that no matter how you distribute particles between streams, there will always be two streams that would share the cache line. Assume that these threads execute on different physical cores. If one thread writes a cache line to its copy (for example, it updates the position of the particle), the cache coherence protocol will be activated, and this will invalidate the cache line in another stream, which then had to reread it from the slow cache even from main memory. When the second thread updates its particle, this will invalidate the cache line in the first core. The solution to this problem has the right complement and the right choice of block size, so that none of the two threads will share a single cache line. For example, if you add a surface 4th dimension (you can use it to store potential particle energy in the 4th element of the position vector and kinetic energy in the 4th element of the velocity vector), then each quadrupt of the position / velocity will take 32 bytes, and information for exactly two particles will be on the same cache line. If you then distribute an even number of particles per stream, you will automatically get rid of the possible exchange of lies.
True sharing occurs when threads simultaneously process the same data structure, and there is overlap between parts of the structure modified by different threads. In molecular dynamics simulations, this happens very often, since we want to use Newton’s third law to reduce the computation time by two when working with pairwise interaction potentials. When one thread calculates the force acting on particle i when listing its neighbors j , calculating the force exerted by j on i automatically gives you the force that i exerts on j , so that the contribution can be added to the total force on j . But j can belong to another thread, which can be modified at the same time, therefore atomic operations should be used for both updates (both, one other thread can update i if this happens to the neighboring one of the more intrinsic particles). Atomic updates on x86 are implemented with locked instructions. This is not as terribly slow as often presented, but still slower than a regular update. It also includes the same effect of invalidating a line in the cache as with false sharing. To get around this, by increasing memory usage, you can use local arrays to store partial force contributions, and then perform the reduction at the end. The reduction itself must be performed either sequentially or in parallel with blocked instructions, so it may turn out that not only is there no benefit from using this approach, but rather, it can be even slower. To solve this problem, you can use the correct sorting of particles and smart distribution between the processing elements in order to minimize areas of the interface.
Another thing I would like to touch upon is memory bandwidth. Depending on your algorithm, there is a certain relationship between the number of selected data elements and the number of floating point operations performed at each iteration of the loop. Each processor has limited bandwidth available for fetching memory, and if it happens that your data is not quite suitable for the processor cache, it may happen that the memory bus cannot deliver enough data to supply so many threads running on one socket. Your Core i3-2370M only has 3 MiB L3 caches, so if you explicitly keep position, speed and strength for each particle, you can only store about 43,000 particles in the L3 cache and about 3600 particles in the L2 cache (or about 1800 particles on hypertrust).
The latter is hyperthreading. As the High Performance Mark has already noted, many basic mechanisms share hyperthreads. For example, there is only one AVX vector FPU vector engine that is shared between both hyperstreams. If your code is not vectorized, you lose the great processing power available in your processor. If your code is vectorized, then both hyperthreads will fall into each other, as they fight for control of the AVX engine. Hyper-threading is only useful when it is able to hide memory latency by superimposing calculations (in one hyper-thread) with loading memory (in another hyper-thread). With dense numerical codes that perform many register operations before they load / store memory, hyper-threading does not give any advantages, and you better work with half the number of threads and explicitly bind them to different kernels to prevent them from starting the OS scheduler like hyperthreads. The scheduler on Windows is particularly stupid in this regard, see here for an example of rant. The Intel OpenMP implementation supports various binding strategies that are controlled by environment variables. GNU OpenMP I do not know how to control thread binding (aka binding masks) in the Microsoft OpenMP implementation.