Before diving into my question, I would like to indicate that I am doing this in part to familiarize myself with node and mongo. I understand that there are probably better ways to achieve my ultimate goal, but what I want to learn from this is a general methodology that can be applied to other situations.
Purpose:
I have a csv file containing 6 million geo-ip entries. Each record contains only 4 fields, and the file is approximately 180 mb.
I want to process this file and insert each entry into the MongoDB collection called "Blocks". Each "Block" will have 4 fields from the csv file.
My current approach
I use mongoose to create the "Block" model and ReadStream to process the file line by line. The code I use to process the file and retrieve the entries works, and I can get it to print each entry on the console if I want.
For each entry in the file, it calls a function that creates a new Blocks object (using mongoose), fills in the fields, and saves them.
This is the code inside the function that is called every time a string is read and parsed. The variable "rec" contains an object representing one record from the file.
block = new Block(); block.ipFrom = rec.startipnum; block.ipTo = rec.endipnum; block.location = rec.locid; connections++; block.save(function(err){ if(err) throw err; //console.log('.'); records_inserted++; if( --connections == 0 ){ mongoose.disconnect(); console.log( records_inserted + ' records inserted' ); } });
Problem
Since the file is read asynchronously, several lines are processed at the same time, and the file is read much faster than MongoDB can write so that the whole process takes about 282,000 records and reaches the same level as a 5k + simultaneous Mongo connection. This is not a failure .. he just sits there, doing nothing and does not seem to be recovering, and the number of items in the Mongo collection does not increase.
What I will do here is a general approach to solving this problem. How do I limit the number of concurrent Mongo connections? I would like to take the opportunity to insert multiple records at the same time, but I donβt have a way to control the flow.
Thanks in advance.
Suited sloth
source share