To those who open this question:
We could never fully diagnose the problem with this, our hunch is that connecting to the database tends to fail every once in a while for any reason. Of our research in distributed computing, this is a common problem and needs to be addressed explicitly.
In the end, we adapted our system to become reliable for database connection failures, catching OperationFailure exceptions with similar ones and restoring the database connection. This solved the problem along with a number of people like us.
source share