Context-sensitive implementation of diff

Question

Context-sensitive implementation of diff

Consolidated and core issue

Using MS Access 2010 and VBA (sigh ..)

I am trying to implement a specialized Diff function that can list changes in different ways depending on what has changed. I need to create a short list of changes to submit for our records.

I would like to use something like html tags, for example These are references 1, 6 so that I can view the changes using the code and adjust how the text of the change is displayed. Or something else to complete my task.

I see this as a way to provide an extensible way to customize the output and possibly move things to a more robust platform and actually use html / css.

Does anyone know of a similar project that can point me in the right direction?

My task

I have an access database with work operations tables - usually 200-300 operations, many of which vary from one version to another. I am currently implementing a function that iterates through tables, finds instructions that have changed and compared them.

Note that each operation command is usually a pair of sentences with two lines at the end with some links to documents.

My algorithm is based on the "difference algorithm (OD) and its variations" , and it works great.

Access supports the text "Rich", which simply glorifies simple html, so I can easily generate the full text with formatted additions and deletions, i.e. add tags, for example This text has been removed . The main conclusion of the Diff procedure is the full text of the operation, which includes immutable, deleted, and pasted text within each other. The diff procedure adds the <del> and <ins> tags, which are later replaced by formatting text later (the result is similar to the representation of changes from exchange changes in the stack).

However, as I said, I need the changes listed in a humanoid format. This has proven difficult due to the ambiguity that many changes create.

for example: If the type of chemical changes from "class A" to "class C", the text of the change that is easily generated is "Change" A "to" C ", which is not very useful for someone viewing the changes. More common are links to documents at the end: adding SOP 3 to a list, such as "SOP 1, 2, 3", generates the text "Add" 3 ". Clearly also not helpful.

What would be most useful is a custom output for text labeled “SOP”, so the output will be “Add link to SOP 3”.

I started with the following solution:

Group words together, for example. process text such as "SOP 1, 2, 3" as one token for comparison. This generates the text “Change“ SOP 1, 2 ”to“ SOP 1, 2, 3. "This is clogged when there is a large list and you are trying to determine what has actually changed.

Where am I now

Now I am trying to add additional html tags before running the diff algorithm. For example, I ran the text through a “preprocessor” that converts “SOP 1, 2” to SOP 1, 2

As soon as the Diff procedure returns the full text of the change, I look at it, marking the current "class" of text, and when there is <del> or <ins> , I commit the text between the tags and use SELECT CASE lock the class for each change.

Actually, this works fine, but there are many problems that I have to work with, for example, add Diff, deciding that the shortest way is to remove certain opening tags and insert others. This creates a script in which there are two  tags, but only one  .

Final question

I want to advise either to continue the direction I started, or to try something else before investing much more time in a non-optimal solution.

Thanks to everyone in advance.

Also note:

A typical run takes between 1.5 and 2.5 seconds when I try to use more interesting things and a bunch of debug.prints. Thus, running through an extra pass or two would not be a killer.

+7

html algorithm vba access-vba diff

ptpaterson Dec 23 '13 at 21:54

source share

3 answers

It is clear that reporting the difference in the smallest changes in the structures that you have is not what you want; You want to report some context.

To do this, you must determine what context there is for the report, so you can decide how much of this is interesting. You sketched an idea in which you combined some elements of your structure (for example, "SOP" '1' '2' into 'SOP 1 2'), but this seems to me wrong. What he does is resize the smallest structural elements, not improve the context.

Although I'm not sure if this is the right approach, I would try to characterize your structures using a grammar such as BNF. For example, some grammar rules you may have:

  action = 'SOP' sop_item_list ; sop_item_list = NATURAL ; sop_item_list = sop_item_list ',' NATURAL ;

Now the actual SOP element can be characterized by an abstract syntax tree (show nested children indexed by constants to get subtrees):

  t1: action('SOP',sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]))

You still want to calculate the difference using something like the dynamic programming algorithm you proposed, but now you want a minimal tree delta. Done correctly (my company makes tools that do this for grammars for common computer languages, and you can find public diff tree algorithms), you can get a delta like:

  replace t1[2] with op_item_list(sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]),',',NATURAL[3]

which is essentially what you got when gluing (SOP, 1,2) into one element, but without the external harassment that you personally decide for this.

The real value in this, I think, is that you can use the t1 tree to report on the context. In particular, you start at the root of the tree and print a summary of the subtrees (obviously, you do not want to print the full subtrees, as this will simply return you the full source text). By printing a subtree to a depth of 1 or two levels and drawing something deep (for example, presenting the list as "..." and one subtree as "_"), you can print something like:

 replace 'SOP' ... with 'SOP',...,3

which, I think, is what you are looking for in your specific example.

No, this is not an algorithm; This is a sketch of an idea. The fact that we have delta tree algorithms that compute useful deltas and a general idea (taken from the LISP debuggers, frankly) suggests that this will probably generalize something useful, or at least lead you into a new direction.

Having received your answer in terms of AST, you should also make it relatively easy to create HTML as you wish. (People work with XML all the time, and XML is basically a tree view).

+4

Ira Baxter Dec 27 '13 at 3:17

source share

In case you decide to continue what you have already achieved (your IMO is already quite far), you can make two steps of the difference.

Group words together, for example. process text such as "SOP 1, 2, 3" as one token for comparison.

This is a good start; You have already managed to make the context understandable to the user.

This generates the text “Change“ SOP 1, 2 ”to“ SOP 1, 2, 3. "This is clogged when there is a large list and you are trying to determine what really changed.

How about making another diff pass through the tokens found (i.e. compare "SOP 1, 2" with "SOP 1, 2, 3"), this time without grouping the words to create additional information? This will make the complete output something like this:

Change "SOP 1, 2" to "SOP 1, 2, 3"
Edit data: Add '3'

The text is a bit cryptic, so you might want to clarify a bit there. I would also suggest truncating long tokens in the first line ("SOP 1, 2, 3, ..."), since the second line should already contain enough details.

I am not sure about the effectiveness of this second pass; in a large text with many changes, you may encounter many calls to diff functionality. You can optimize by copying the changes from the first pass into one “change document”, run the second pass through it, then stitch the results together.

NTN.

+3

Ruud helderman Dec 27 '13 at 10:13

source share

Gene · Accepted Answer · 2013-12-31T05:44:32+0000

Try thinking of Prolog-style rewriting rules that convert your instructions into canonical form, which causes the diff algorithm to generate what you need. The task you indicated will be solved using this rule:

 SOP i1, i2, ... iN -> SOP j1, SOP j2, ... SOP jN where j = sorted(i)

In other words, “distribute” the SOP over the sorted list of next integers. This will cause the diff algorithm to give a complete report of the “Add SOP 3” change.

Rules are applied by looking for input for matches on the left side and replacing the corresponding right.

You may already be doing this, but you will get a more robust analysis if the input token is marked: "SOP" should be considered as one "character" for the diff algorithm. Spaces can be reduced to tokens for space and line breaks if they are significant or ignored.

You can perform a different kind of diff at the symbol level to check the "fuzzy" equality of tokens, to take into account typographical errors when matching the left sides. “SIP” and “SOP” will be considered a “coincidence” because their editing distance is only 1 (and I and O are just one key on the QUERTY keyboard!).

If you consider all the quirks in the output that you are getting now, and you can correct each of them as a rule of rewriting, which takes input into a form in which the diff algorithm creates what you need, then the rest is to implement the rewriting system. Doing this in general, which is effective, so changing and adding rules is not associated with a lot of special coding, this is a difficult problem, but it has been studied. Interestingly, @Ira Baxter mentioned lisp as it excels as a tool for this kind of thing.

The willow syntax clause fits naturally into the rewrite rule method. For example, suppose SOPs have sections and paragraphs:

 SOP 1 section 3, paras 2,1,3

is a hierarchy that should be rewritten as

 SOP 1 section 3 para 1, SOP 1 section 3 para 2, SOP 1 section 3 para 3

Rewrite rules

 paras i1, i2, ... iN -> para j1, para j2, ... para jN where j = sorted(i) section K para i1, ... para iN ->s section K para j1, ... section K para j1 SOP section K para i1, ... section K para i1 -> SOP section K para j1, ... SOP section K para j1

when applied over three passes, a result similar to “Section SOP 1, section 4, paragraph 4” will be obtained.

Although there are many strategies for implementing rules and rewriting, including encoding each of them as a procedure in VB (argh ...), there are other ways. The prologue is a great attempt to do it as best as possible. There is a free implementation. There are others. I used TXL to get useful correspondence. The only problem with TXL is that it assumes you have a pretty strict grammar for inputs, which is not like the case of your problem.

If you publish more examples of quirks in your current findings, I can follow this in more detail with rewriting rules.

Context-sensitive implementation of diff

Consolidated and core issue

My task

I started with the following solution:

Where am I now

Final question

Also note:

More articles: