I have a problem with MongoDB becoming unresponsive and refusing connections in production.
During testing, I can play the “lock” sequentially, sending a lot of requests to my application.
Here is an example output from mongostat:
0 0 0 0 0 1 0 1.3g 1.8g 177m 0 0 0 0|0 0|0 62b 1k 18 19:40:11 0 0 0 0 0 1 0 1.3g 1.8g 177m 0 0 0 0|0 0|0 62b 1k 18 19:40:12 --------------LOAD TEST BEGINS----------------------- 1 56 0 0 1 58 0 1.3g 1.81g 177m 0 0 0 0|0 0|0 10k 302k 18 19:40:13 10 116 0 0 2 127 0 1.3g 1.81g 177m 0 0 0 0|0 1|0 32k 263k 19 19:40:14 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 206 431 0 0 2 638 0 1.3g 1.82g 177m 0 1.1 0 0|0 0|0 206k 339k 22 19:40:15 331 649 0 0 2 983 0 1.3g 1.81g 177m 0 1 0 0|0 0|0 317k 521k 21 19:40:16 367 723 0 0 10 1090 0 1.3g 1.81g 177m 0 0.9 0 0|1 1|1 297k 1m 20 19:40:17 500 961 0 0 11 1461 0 1.3g 1.81g 177m 0 1.2 0 0|0 0|0 390k 1m 19 19:40:18 543 1073 0 0 10 1618 0 1.3g 1.82g 177m 0 1.2 0 0|0 2|0 433k 1m 21 19:40:19 461 909 0 0 11 1370 0 1.3g 1.81g 177m 0 1.1 0 0|0 2|0 365k 1m 20 19:40:20 347 685 0 0 6 1033 0 1.3g 1.81g 177m 0 0.8 0 0|0 1|0 411k 962k 21 19:40:21 482 938 0 0 9 1421 0 1.3g 1.81g 177m 0 1.1 0 0|0 0|0 492k 1m 20 19:40:22 444 877 0 0 10 1322 0 1.3g 1.81g 177m 0 1.1 0 0|0 0|1 479k 1m 22 19:40:23 528 1019 0 0 9 1549 0 1.3g 1.81g 177m 0 1.2 0 0|0 0|0 446k 1m 21 19:40:24 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 520 1033 0 0 10 1554 0 1.3g 1.81g 177m 0 1.2 0 0|0 1|0 418k 1m 22 19:40:25 448 878 0 0 10 1326 0 1.3g 1.81g 177m 0 1.2 0 0|0 1|0 349k 1m 22 19:40:26 503 979 0 0 8 1483 0 1.3g 1.82g 177m 0 1.3 0 0|0 1|0 397k 1m 21 19:40:27 379 743 0 0 10 1123 0 1.3g 1.81g 177m 0 1 0 0|0 0|0 315k 1m 22 19:40:28 456 896 0 0 7 1354 0 1.3g 1.81g 177m 0 1.1 0 0|0 0|0 436k 1m 22 19:40:29 371 715 0 0 10 1086 0 1.3g 1.81g 175m 0 0.9 0 0|0 0|0 299k 1m 20 19:40:30 505 993 0 0 7 1499 0 1.3g 1.81g 175m 0 1.2 0 0|0 2|0 394k 1m 21 19:40:31 493 1023 0 0 10 1519 0 1.3g 1.82g 175m 0 1.1 0 0|0 1|0 392k 1m 22 19:40:32 426 868 0 0 11 1295 0 1.3g 1.81g 170m 0 1 0 0|0 1|0 363k 1m 21 19:40:33 352 772 0 0 7 1125 0 1.3g 1.81g 163m 19 4.4 0 0|1 0|1 368k 1m 22 19:40:34 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 330 656 0 0 5 985 0 1.3g 1.81g 159m 4 1.5 0 0|0 0|0 325k 876k 19 19:40:35 398 817 0 0 8 1218 0 1.3g 1.81g 155m 9 1.3 0 0|0 1|0 372k 1m 21 19:40:36 464 1015 0 0 8 1479 0 1.3g 1.81g 155m 2 1.2 0 0|0 1|0 381k 1m 21 19:40:37 457 915 0 0 11 1372 0 1.3g 1.81g 155m 0 1.1 0 0|0 0|0 356k 1m 22 19:40:38 458 960 0 0 7 1419 0 1.3g 1.83g 155m 0 1.1 0 0|0 0|0 376k 1m 22 19:40:39 391 760 0 0 11 1153 0 1.3g 1.81g 155m 0 1 0 0|0 1|0 321k 1m 21 19:40:40 315 639 0 0 5 956 0 1.3g 1.81g 155m 0 0.8 0 0|0 0|0 325k 877k 21 19:40:41 303 696 0 0 7 998 0 1.3g 1.81g 155m 0 0.8 0 0|0 0|0 279k 1m 21 19:40:42 388 772 0 0 8 1163 0 1.3g 1.81g 155m 0 1 0 0|0 0|0 308k 1m 20 19:40:43 273 623 0 0 7 895 0 1.3g 1.81g 155m 0 0.7 0 0|0 2|0 232k 1m 20 19:40:44 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 397 796 0 0 7 1194 0 1.3g 1.81g 155m 0 0.9 0 0|0 0|0 311k 1m 19 19:40:45 347 754 0 0 4 1103 0 1.3g 1.81g 155m 0 0.9 0 0|0 0|0 329k 771k 20 19:40:46 336 700 0 0 8 1042 0 1.3g 1.82g 160m 20 0.8 0 2|2 3|2 370k 1m 23 19:40:47 272 525 0 0 5 805 0 1.3g 1.79g 167m 188 0.6 0 0|0 0|0 306k 818k 18 19:40:48 103 246 0 0 1 350 0 1.3g 1.78g 166m 0 0.4 0 0|0 0|0 136k 234k 15 19:40:49 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:50 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:51 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:52 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:53 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:54 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:55 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:56 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:57 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:58 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:59 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:00 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:01 0 3 0 0 0 4 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 775b 4k 15 19:41:02 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:03 0 0 0 0 0 1 1 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:04 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:05 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:06 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:07 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:08 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:09 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:10 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:11 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:12 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:13 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:14 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:15 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:16 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:17 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:18 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:19 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:20 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:21 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:22 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:23 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:24 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:25 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:26 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:27 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:28 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:29 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:30 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:31 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:32 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:33 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:34 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:35 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:36 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:37 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:38 0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:39 0 28 0 0 0 29 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 5k 27k 15 19:41:40 86 234 0 0 1 322 0 1.3g 1.79g 166m 0 0.2 0 0|0 1|0 145k 217k 17 19:41:41 216 445 0 0 5 663 0 1.3g 1.79g 167m 0 0.5 0 0|0 0|0 220k 807k 16 19:41:42 302 612 0 0 6 916 0 1.3g 1.79g 167m 0 0.6 0 0|0 0|0 236k 950k 16 19:41:43 113 247 0 0 1 362 0 1.3g 1.79g 167m 0 0.2 0 0|0 0|0 95k 192k 17 19:41:44 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time 38 63 0 0 2 103 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 26k 289k 16 19:41:45 --------------LOAD TEST ENDS----------------------- 0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:46 0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:47 0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:48 0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:49
As you can see, in the middle of the load test, reading and writing are falling almost nothing. During this time, I cannot connect locally to MongoDB via mongo or mongostat (the connection was refused). As you can see, the connections are pretty low, and at max. Connections for Mongo There are 16,000 connections.
I can reproduce it sequentially. However, each time the “lock” has a different length. I watched 51 seconds, 20 seconds, 28 seconds. In each case, I am still connected to mongo via mongostat (and getting the output), but new connections will be refused.
user@myhostname:~
What could be the problem? I am running MongoDB v1.8.2.
Aside, I can reproduce a similar problem where the teams do not drop out completely, but have a low volume and fluctuate wildly.
mongodb
Homer6
source share