Check for duplicate Javascript objects

TL DR version: I want to avoid adding duplicate Javascript objects to an array of similar objects, some of which can be really big. What is the best approach?

I have an application in which I load large amounts of JSON data into a Javascript data structure. Although this is a bit more complicated, suppose I load JSON into an array of Javascript objects from the server through a series of AJAX requests, for example:

var myObjects = []; function processObject(o) { myObjects.push(o); } for (var x=0; x<1000; x++) { $.getJSON('/new_object.json', processObject); } 

To complicate matters, JSON:

  • is in an unknown pattern
  • has an arbitrary length (probably not huge, but can be in the range of 100-200 kb).
  • may contain duplicates for different requests.

My initial thought is to have an additional object to store the hash of each object (via JSON.stringify ?) And check it on every load, for example:

 var myHashMap = {}; function processObject(o) { var hash = JSON.stringify(o); // is it in the hashmap? if (!(myHashMap[hash])) { myObjects.push(o); // set the hashmap key for future checks myHashMap[hash] = true; } // else ignore this object } 

but I am worried about the presence of property names in myHashMap , the length of which can be 200 kb. So my questions are:

  • Is there a better approach for this problem than the idea of ​​hashmap?
  • If not, is there a better way to make a hash function for a JSON object of arbitrary length and schema than JSON.stringify ?
  • What are the possible problems with super-long property names in an object?
+3
source share
2 answers

I suggest you create an MD5 hash of JSON.stringify (o) and save it in your hash map with a link to your saved object as data for the hash. And to make sure that in JSON.stringify() there are no restrictions on the order of the keys of objects, you need to create a copy of the object that orders the keys.

Then, when each new object arrives, you check it on the hash map. If you find a match on the hash map, then you compare the incoming object with the actual object that you saved to see if they really are duplicated (since there may be MD5 hash collisions). So you have a managed hash table (it only has MD5 hashes).

Here's the code for creating a canonical string representation of an object (including nested objects or objects inside arrays) that processes the keys of objects, which may be in a different order if you just called JSON.stringify ().

 // Code to do a canonical JSON.stringify() that puts object properties // in a consistent order // Does not allow circular references (child containing reference to parent) JSON.stringifyCanonical = function(obj) { // compatible with either browser or node.js var Set = typeof window === "object" ? window.Set : global.Set; // poor man Set polyfill if (typeof Set !== "function") { Set = function(s) { if (s) { this.data = s.data.slice(); } else { this.data = []; } }; Set.prototype = { add: function(item) { this.data.push(item); }, has: function(item) { return this.data.indexOf(item) !== -1; } }; } function orderKeys(obj, parents) { if (typeof obj !== "object") { throw new Error("orderKeys() expects object type"); } var set = new Set(parents); if (set.has(obj)) { throw new Error("circular object in stringifyCanonical()"); } set.add(obj); var tempObj, item, i; if (Array.isArray(obj)) { // no need to re-order an array // but need to check it for embedded objects that need to be ordered tempObj = []; for (i = 0; i < obj.length; i++) { item = obj[i]; if (typeof item === "object") { tempObj[i] = orderKeys(item, set); } else { tempObj[i] = item; } } } else { tempObj = {}; // get keys, sort them and build new object Object.keys(obj).sort().forEach(function(item) { if (typeof obj[item] === "object") { tempObj[item] = orderKeys(obj[item], set); } else { tempObj[item] = obj[item]; } }); } return tempObj; } return JSON.stringify(orderKeys(obj)); } 

And, the algorithm

 var myHashMap = {}; function processObject(o) { var stringifiedCandidate = JSON.stringifyCanonical(o); var hash = CreateMD5(stringifiedCandidate); var list = [], found = false; // is it in the hashmap? if (!myHashMap[hash] { // not in the hash table, so it a unique object myObjects.push(o); list.push(myObjects.length - 1); // put a reference to the object with this hash value in the list myHashMap[hash] = list; // store the list in the hash table for future comparisons } else { // the hash does exist in the hash table, check for an exact object match to see if it really a duplicate list = myHashMap[hash]; // get the list of other object indexes with this hash value // loop through the list for (var i = 0; i < list.length; i++) { if (stringifiedCandidate === JSON.stringifyCanonical(myObjects[list[i]])) { found = true; // found an exact object match break; } } // if not found, it not an exact duplicate, even though there was a hash match if (!found) { myObjects.push(o); myHashMap[hash].push(myObjects.length - 1); } } } 

A test case for jsonStringifyCanonical() is here: https://jsfiddle.net/jfriend00/zfrtpqcL/

+4
source
  • May be. For example, if you know which subject of an object is coming, you can write a better indexing and search system than the keys of JS objects. But you can only do this using JavaScript, and object keys are written in C ...
  • Should your hashing be lossless or not? If possible, try to lose compression (MD5). I assume that you will lose some speed and gain some memory. By the way, do JSON.stringify(o) guarantees the same key order. Since {foo: 1, bar: 2} and {bar: 2, foo: 1} are equal to objects, but not like strings.
  • Cost memory

One possible optimization:

Instead of using getJSON use $.get and pass "text" as dataType param. Than you can use the result as your hash and then convert it to an object.

Actually, having written the last sentence, I want to say about another solution:

  • Collect all results with $.get in an array
  • Sort with buildin (c speed) Array.sort
  • Now you can easily detect and remove duplicates with a single for

Again, different JSON strings can create the same JavaScript object.

+2
source

All Articles