created a collection in MongoDB, consisting of 11446615 documents.
Each document has the following form:
{ "_id" : ObjectId("4e03dec7c3c365f574820835"), "httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1", "words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"], "howMany" : 3 }
httpReferer : URL only
words : words parsed using the URL above. The size of the list is from 15 to 90.
I plan to use this database to get a list of web pages that have similar content.
I will query this collection using a word field so that I create (or rather start creating) an index in this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes a lot of time. I tried two approaches (the tests below were done on my laptop):
- Paste and indexing Paste takes 5.5 hours mainly due to intensive data preprocessing. Indexing took 30 hours.
- Indexing before insertion It takes several days to insert all the data into the assembly.
My main focus is on reducing collection generation time. I don't need replication (at least not yet). The request also does not have to be fast.
Now, time for the question:
I have only one machine with one drive, I can run my application. Does it make sense to run more than one database instance and share my data between them?
whysoserious
source share