I have the following JSON file:
sensorlogs.json {"arr":[{"UTCTime":10000001,"s1":22,"s2":32,"s3":42,"s4":12}, {"UTCTime":10000002,"s1":23,"s2":33,"s4":13}, {"UTCTime":10000003,"s1":24,"s2":34,"s3":43,"s4":14}, {"UTCTime":10000005,"s1":26,"s2":36,"s3":44,"s4":16}, {"UTCTime":10000006,"s1":27,"s2":37,"s4":17}, {"UTCTime":10000004,"s1":25,"s2":35,"s4":15}, ... {"UTCTime":12345678,"s1":57,"s2":35,"s3":77,"s4":99} ]}
Sensors s1, s2, s3, etc. all transmit at different frequencies (note that s3 transmits every 2 seconds, and timestanps may be out of order).
How can i achieve something like
Analyzing s1: s = [[10000001, 22], [10000002, 23],.. [12345678,57]] s1 had 2 missing entries Analyzing s2: s = [[10000001, 32], [10000002, 33],.. [12345678,35]] s2 had 0 missing entries Analyzing s3: s = [[10000001, 42], [10000003, 43],.. [12345678,77]] s3 had 0 missing entries Analyzing s4: s = [[10000001, 12], [10000003, 13],.. [12345678,99]] s4 had 1 missing entries
sensorlogs.json - 16 GB.
Missing entries can be found based on the difference in successive UTC timestamps. Each sensor is transmitted at a known frequency.
I cannot use several large arrays for my analysis due to memory limitations, so I will have to make several passes over the same JSON log file and use only one large array for analysis.
What I still have is
var result = []; //1. Extract all the keys from the log file console.log("Extracting keys... \n"); var stream = fs.createReadStream(filePath); var lineReader = lr.createInterface( { input: stream }); lineReader.on('line', function (line) { getKeys(line);//extract all the keys from the JSON }); stream.on('end', function() { //obj -> arr for(var key in tmpObj) arrStrm.push(key); //2. Validate individual sensors console.log("Validating the sensor data ...\n"); //Synchronous execution of the sensors in the array async.each(arrStrm, function(key) { { currSensor = key; console.log("validating " + currSensor + "...\n"); stream = fs.createReadStream(filePath); lineReader = lr.createInterface( { input: stream }); lineReader.on('line', function (line) { processLine(line);//Create the arrays for the sensors }); stream.on('end', function() { processSensor(currSensor);//Process the data for the current sensor }); } }); }); function getKeys(line) { if(((pos = line.indexOf('[')) >= 0)||((pos = line.indexOf(']')) >= 0)) return; if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D) if (line[line.length-1] == ',') line=line.substr(0,line.length-1); // discard , // console.log(line); if (line.length > 1) { // ignore empty lines var obj = JSON.parse(line); // parse the JSON for(var key in obj) { if(key != "debug") { if(tmpObj[key] == undefined) tmpObj[key]=[]; } }; } }
Of course, this does not work, and I can not find anything on the network that explains how this can be implemented.
Note: I can choose any language of my choice for developing this tool (C / C ++, C # / Java / Python), but I turn to JavaScript because of its ability to easily parse JSON arrays (and my interest in improving JS also). Does someone like to suggest an alternative language for this if JavaScript is not the best language to make such a tool?
Edit: Some important information, which is either not very clear, or I did not include earlier, but it seems that it is important to include in the question -
- Data in JSON logs is not broadcast in real time, its saved JSON file on the hard disk
- Data is not stored in chronological order, which means that timestamps may not be in the correct order. Therefore, the data of each sensor must be sorted based on the timestamps after , which were stored in the array
- I cannot use separate arrays for each sensor (which will be the same as for storing only 16 GB of JSON in RAM), and to save memory, only one array should be used at a time. And yes, there are more than 4 sensors in my journal, this is just a sample (about 20 to give an idea).
I changed my JSON and expected result
One solution could be to create several passes over the JSON file, store the sensor data with timestamps in the array at a time, then sort the array, and then finally analyze the data for corruption and spaces. And here is what I am trying to do in my code above