RS102 MongoDB on ReplicaSet

I installed a replica set with 4 servers.

For testing purposes, I wrote a script to populate my database with up to ~ 150 million rows of photos using GridFS. My photos are about ~ 15KB. (This should not be a problem for using gridfs for small files ?!)

A few hours later there were about 50 million lines, but I had this message in the logs :

replSet error RS102 too stale to catch up, at least from 192.168.0.1:27017 

And here is the replSet status:

  rs.status(); { "set" : "rsdb", "date" : ISODate("2012-07-18T09:00:48Z"), "myState" : 1, "members" : [ { "_id" : 0, "name" : "192.168.0.1:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "optime" : { "t" : 1342601552000, "i" : 245 }, "optimeDate" : ISODate("2012-07-18T08:52:32Z"), "self" : true }, { "_id" : 1, "name" : "192.168.0.2:27018", "health" : 1, "state" : 3, "stateStr" : "RECOVERING", "uptime" : 64770, "optime" : { "t" : 1342539026000, "i" : 5188 }, "optimeDate" : ISODate("2012-07-17T15:30:26Z"), "lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"), "pingMs" : 0, "errmsg" : "error RS102 too stale to catch up" }, { "_id" : 2, "name" : "192.168.0.3:27019", "health" : 1, "state" : 3, "stateStr" : "RECOVERING", "uptime" : 64735, "optime" : { "t" : 1342539026000, "i" : 5188 }, "optimeDate" : ISODate("2012-07-17T15:30:26Z"), "lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"), "pingMs" : 0, "errmsg" : "error RS102 too stale to catch up" }, { "_id" : 3, "name" : "192.168.0.4:27020", "health" : 1, "state" : 3, "stateStr" : "RECOVERING", "uptime" : 65075, "optime" : { "t" : 1342539085000, "i" : 3838 }, "optimeDate" : ISODate("2012-07-17T15:31:25Z"), "lastHeartbeat" : ISODate("2012-07-18T09:00:46Z"), "pingMs" : 0, "errmsg" : "error RS102 too stale to catch up" } ], "ok" : 1 

The set still accepts the data, but since I have 3 DOWN servers, how do I proceed with the repair (better than deleting the data and re-synchronizing it, but it will work)?

And especially: Is it because of a too strong script? So this almost never happens in production?

+2
mongodb gridfs
source share
1 answer

You do not need to repair, just do a full re-synchronization.

On the secondary server, you can:

  • stop failed mongod
  • delete all data in dbpath (including subdirectories)
  • restart it and it will automatically resynchronize itself

Follow the instructions here .

What happened in your case is that your background people have become obsolete, i.e. there is no common point in their oplog and oplog on the primary. Take a look at this document , which details various statuses. Primary member entries must be copied to secondary ones, and your secondary members could not keep up until they eventually became obsolete. You will need to resize the oplog .

As for the size of the oplog, it depends on how much data you insert / update over time. I would choose a size that allows you many hours or even days of oplog.

Also, I'm not sure which O / S you are using. However, for 64-bit Linux, Solaris, and FreeBSD systems, MongoDB allocates 5% of the free disk space for oplog. If this amount is less than a gigabyte, then MongoDB will allocate 1 gigabyte of space. For 64-bit systems, OS X MongoDB allocates 183 megabytes of space for oplog and for 32-bit systems MongoDB allocates about 48 megabytes of space for oplog.

How big are the records and how much do you want? It depends on whether this data insert is something typical or something abnormal that you just tested.

For example, at 2000 documents per second for 1 KB documents, this will provide you with 120 MB per minute, and your 5 GB burst will last about 40 minutes. This means that if the secondary one ever shuts down for 40 minutes or lags more than that, then you are outdated and must complete a full re-synchronization.

I recommend reading the document here . You have 4 members in your replica set, which is not recommended. You must have an odd number to vote on the (primary) process , so you need to add an arbiter, another secondary or remove one of your secondary.

Finally, here is a detailed RS administration document.

+10
source share

All Articles