Is there an established understanding of the difference between two JSON?

Are there any established or existing formats or conventions for representing the difference between two JSON documents?

Suppose that two remote nodes (or server / client) have some data that is presented as potentially complex JSON, whose structure is unknown until runtime. One wants to send the update to another, but not sending the whole state as one big JSON. Instead, it's just a delta. What would be a good way to represent the delta (or difference) between any two JSON documents? They are likely to be very similar (one small change), but may not be.

+7
source share
2 answers

JSON documents are essentially trees with leaves containing name / value pairs.

What you want to do is pass a minimal tree delta: the smallest set of changes that converts one tree to another.

Computed tree deltas are a bit of art, partly because it depends on the types of deltas that you allow (just insert / remove swap sheets, move subtree? Duplicate subtree? Rename names? Replace values?). You must also consider semantic equivalences; if you change the position of two subtrees, is the result semantically different? (Your delta detector can see such a set of trees, checking semantic identification may exclude it as uninteresting). If you duplicate a subtree, is the answer semantically different? (I think the effective answer is no for JSON).

You need something like a dynamic programming algorithm to determine such a minimum delta; You can breathe inspiration from Levenshteinโ€™s distance from the strings.

This is a common problem of interest to programmers regarding source code. Think of the JSON document as source code and see the answers https://stackoverflow.com/q/5779160/120163 for further discussion.

+7
source

As Iraq pointed out, there are some options in the lines of Levenshtein, but you would look at the serialization of your object and compare it lexicographically, since, as Ira said, the JSON-specific language that you will not be taken into account (two trees can be identical to JSON, but very differ in Levenshtein distance). What you want is the tree editing distance.

So, to add some details around the tree editing distance, the well-known algorithms used in this space are typical, for example, Zhang and Shasha or Klein, and you can find the Python implementation of Zhang and Shashi. These algorithms will receive the minimum number of changes to convert one tree to another, thereby providing your diff. However, they are somewhat slow O (n ^ 2) in the absolute best possible way, so if you compare a large number of objects or JSON files, you will find yourself improving your golf game, making dishes, underwear, bathing your pets and other household assembly.

And in fact, the art that Iraq is talking about actually consists in the fact that these algorithms are complex and computationally expensive. So what you can do is be creative. One of the methods narrows the number of objects that need to be compared. For example, why calculate the editing distance between two JSON objects that are clearly more like intermediate ones than each other? Do not calculate the editing distance on objects that are identical using lexicographic comparison. If two objects are slightly or very different from each other, perhaps forget about the difference and just say that there should be a direct replacement.

To apply the "art" of tree editing distance, which saves you from unnecessary processor cycles, you need a way to provide a metric around what is meant by "somewhat similar" or "dramatically different."

To this end, I wrote an implementation of the PQ-Gram tree approach algorithm ( http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf ), which you can find on github for use in Node.js or in a browser ( https://github.com/hoonto/jqgram.git ) based on existing PyGram Python code ( https://github.com/Sycondaman/PyGram ).

PQ-Gram is much, much faster than true editing distance algorithms working in O (n log n) and O (n), where n is the number of nodes.

Therefore, I recommend using jqgram very quickly to understand what you are looking at in terms of distance metrics for editing JSON objects. Determine which JSON objects should be compared, instead of just replacing, and then when you want the true distance to get diff, use Klein or Zhang and Shasha to get the actual diff.

Here the jqgram JSON tree object edits an example of approximation of distance taken directly from README to implement jqgram on github:

var jq = require("jqgram").jqgram; var root1 = { "thelabel": "a", "thekids": [ { "thelabel": "b", "thekids": [ { "thelabel": "c" }, { "thelabel": "d" } ]}, { "thelabel": "e" }, { "thelabel": "f" } ] } var root2 = { "name": "a", "kiddos": [ { "name": "b", "kiddos": [ { "name": "c" }, { "name": "d" }, { "name": "y" } ]}, { "name": "e" }, { "name": "x" } ] } jq.distance({ root: root1, lfn: function(node){ return node.thelabel; }, cfn: function(node){ return node.thekids; } },{ root: root2, lfn: function(node){ return node.name; }, cfn: function(node){ return node.kiddos; } },{ p:2, q:3, depth:10 }, function(result) { console.log(result.distance); }); 

The lfn and cfn parameters determine how each JSON tree should determine the label names of the node and the child array for each root of the tree, so that you can do funny things, for example, compare JSON objects from different sources. All you have to do is provide these functions along with each root, and jqgram will do the rest by calling your provided lfn and cfn functions to create the trees.

+5
source

All Articles