Is there a way to speed up this code that finds data changes in two XML files?

The following code compares two XML texts and returns data changes between them.

This code works fine , but should be as resource intensive as possible.

Is there a faster way to do this in LINQ, for example. without creating two collections of XElements and comparing each of their fields for differences?

using System; using System.Collections.Generic; using System.Linq; using System.Xml.Linq; namespace TestXmlDiff8822 { class Program { static void Main(string[] args) { XDocument xdoc1 = XDocument.Parse(GetXml1()); XDocument xdoc2 = XDocument.Parse(GetXml2()); List<HistoryFieldChange> hfcList = GetHistoryFieldChanges(xdoc1, xdoc2); foreach (var hfc in hfcList) { Console.WriteLine("{0}: from {1} to {2}", hfc.FieldName, hfc.ValueBefore, hfc.ValueAfter); } Console.ReadLine(); } static public List<HistoryFieldChange> GetHistoryFieldChanges(XDocument xdoc1, XDocument xdoc2) { List<HistoryFieldChange> hfcList = new List<HistoryFieldChange>(); var elements1 = from e in xdoc1.Root.Elements() select e; var elements2 = from e in xdoc2.Root.Elements() select e; for (int i = 0; i < elements1.Count(); i++) { XElement element1 = elements1.ElementAt(i); XElement element2 = elements2.ElementAt(i); if (element1.Value != element2.Value) { HistoryFieldChange hfc = new HistoryFieldChange(); hfc.EntityName = xdoc1.Root.Name.ToString(); hfc.FieldName = element1.Name.ToString(); hfc.KindOfChange = "fieldDataChange"; hfc.ObjectReference = (xdoc1.Descendants("Id").FirstOrDefault()).Value; hfc.ValueBefore = element1.Value; hfc.ValueAfter = element2.Value; hfcList.Add(hfc); } } return hfcList; } public static string GetXml1() { return @" <Customer> <Id>111</Id> <FirstName>Sue</FirstName> <LastName>Smith</LastName> </Customer> "; } public static string GetXml2() { return @" <Customer> <Id>111</Id> <FirstName>Sue2</FirstName> <LastName>Smith-Thompson</LastName> </Customer> "; } } public class HistoryFieldChange { public string EntityName { get; set; } public string FieldName { get; set; } public string ObjectReference { get; set; } public string KindOfChange { get; set; } public string ValueBefore { get; set; } public string ValueAfter { get; set; } } } 
+4
source share
3 answers

You should be able to get all elements that have different values ​​using the linq connection in the element name.

 var name = xdoc1.Root.Name.ToString(); var id = (xdoc1.Descendants("Id").FirstOrDefault()).Value; var diff = from o in xdoc1.Root.Elements() join n in xdoc2.Root.Elements() on o.Name equals n.Name where o.Value != n.Value select new HistoryFieldChange() { EntityName = name, FieldName = o.Name.ToString(), KindOfChange = "fieldDataChange", ObjectReference = id, ValueBefore = o.Value, ValueAfter = n.Value, }; 

One of the advantages of this method is that it is easy to parallelize for multi-core machines, just use PLinq and the AsParallel extension method.

 var diff = from o in xdoc1.Root.Elements() join n in xdoc2.Root.Elements().AsParallel() on o.Name equals n.Name where o.Value != n.Value ... 

Voila, if the request can be parallelized on your computer, then PLinq will automatically process it. This will speed up work with large documents, but if your documents are small, you can get better acceleration by parallelizing the outer loop that calls GetHistoryFieldChanges , using something like Parallel.For.

Another advantage is that you can simply return IEnumerable from GetHistoryFieldChanges , no need to waste time distributing the list, items will be returned as they are listed, and the Linq query will not be executed until then.

 IEnumerable<HistoryFieldChange> GetHistoryFieldChanges(...) 

Here are a few examples of iterations of the original, Yannick In-order and My non-parallel Linq implementations. Run on my 2.8ghz laptop with this code .

 Elapsed Orig 3262ms All Linq 1761ms In Order Only 2383ms 

One interesting thing I noticed ... Run the code in debug mode and then release the mode, it is amazing how much the compiler can optimize a clean version of Linq. I think returning IEnumerable helps a lot to the compiler here.

+9
source

Edit:

I must admit that I did not expect the LINQ implementation to be faster than the naive iterative approach with data in order. But this suggests that the performance bottleneck is not always in obvious places.

As I said, I am assuming that you are comparing this for some reason, so this implementation may perhaps be useful. Of course, this is not as clear as the joshperry implementation. But in terms of performance, he has to take the crown.

 static public IEnumerable<HistoryFieldChange> GetHistoryFieldChanges2(XDocument xdoc1, XDocument xdoc2) { string id = xdoc1.Descendants("Id").FirstOrDefault().Value; string name = xdoc1.Root.Name.ToString(); IEnumerator<XElement> enumerator1 = xdoc1.Root.Elements().GetEnumerator(); IEnumerator<XElement> enumerator2 = xdoc2.Root.Elements().GetEnumerator(); for (; enumerator1.MoveNext() && enumerator2.MoveNext(); ) { XElement element1 = enumerator1.Current; XElement element2 = enumerator2.Current; if (element1.Value != element2.Value) { HistoryFieldChange hfc = new HistoryFieldChange(); hfc.EntityName = name; hfc.FieldName = element1.Name.ToString(); hfc.KindOfChange = "fieldDataChange"; hfc.ObjectReference = id; hfc.ValueBefore = element1.Value; hfc.ValueAfter = element2.Value; yield return hfc; } } } 
+3
source

Do you consider using XmlDiff ?

+1
source

All Articles