I donโt know what information you have, but parallelizing an irregular, strictly synchronized algorithm with small tasks is one of the most difficult parallel problems that may arise. Research teams can devote themselves to such tasks and get limited acceleration or nowhere with them. Often, such algorithms only work on certain architectures that are adapted for parallelization, and debilitating overhead, such as spurious exchanges, have been eliminated by properly designing data structures.
Such an algorithm requires a lot of time and effort for profiling, measuring and reviewing. See, for example, this article.
ww2.cs.fsu.edu/~flin/ppq_report.pdf
Now, to your direct question, since your algorithm is highly synchronized and tasks are small, you experience a side effect of data calculations. To remove them from your parallel algorithm, it will be very difficult, and no one will be able to do this for you.
So, your first point of call is to look at tools to help you detect data races such as Valgrind and Intel thread checking.
Jon e source share