I installed a replica set with 4 servers.
For testing purposes, I wrote a script to populate my database with up to ~ 150 million rows of photos using GridFS. My photos are about ~ 15KB. (This should not be a problem for using gridfs for small files ?!)
A few hours later there were about 50 million lines, but I had this message in the logs :
replSet error RS102 too stale to catch up, at least from 192.168.0.1:27017
And here is the replSet status:
rs.status(); { "set" : "rsdb", "date" : ISODate("2012-07-18T09:00:48Z"), "myState" : 1, "members" : [ { "_id" : 0, "name" : "192.168.0.1:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "optime" : { "t" : 1342601552000, "i" : 245 }, "optimeDate" : ISODate("2012-07-18T08:52:32Z"), "self" : true }, { "_id" : 1, "name" : "192.168.0.2:27018", "health" : 1, "state" : 3, "stateStr" : "RECOVERING", "uptime" : 64770, "optime" : { "t" : 1342539026000, "i" : 5188 }, "optimeDate" : ISODate("2012-07-17T15:30:26Z"), "lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"), "pingMs" : 0, "errmsg" : "error RS102 too stale to catch up" }, { "_id" : 2, "name" : "192.168.0.3:27019", "health" : 1, "state" : 3, "stateStr" : "RECOVERING", "uptime" : 64735, "optime" : { "t" : 1342539026000, "i" : 5188 }, "optimeDate" : ISODate("2012-07-17T15:30:26Z"), "lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"), "pingMs" : 0, "errmsg" : "error RS102 too stale to catch up" }, { "_id" : 3, "name" : "192.168.0.4:27020", "health" : 1, "state" : 3, "stateStr" : "RECOVERING", "uptime" : 65075, "optime" : { "t" : 1342539085000, "i" : 3838 }, "optimeDate" : ISODate("2012-07-17T15:31:25Z"), "lastHeartbeat" : ISODate("2012-07-18T09:00:46Z"), "pingMs" : 0, "errmsg" : "error RS102 too stale to catch up" } ], "ok" : 1
The set still accepts the data, but since I have 3 DOWN servers, how do I proceed with the repair (better than deleting the data and re-synchronizing it, but it will work)?
And especially: Is it because of a too strong script? So this almost never happens in production?