Detection of objects having identical children

Question

Detection of objects having identical children

I have two entities, Class and Student , connected to each other.

When data is imported from an external application, unfortunately, some classes are created in duplicate. Duplicate classes have different names, but the same subject and the same students.

For instance:

{Id = 341, title = '10rs / PE1a', SubjectId = 60, Students = {Jack, Bill, Sarah}}
{Id = 429, title = '10rs / PE1b', SubjectId = 60, Students = {Jack, Bill, Sarah}}

There is no general rule for matching the names of these repeating classes, so the only way to determine if two classes are duplicates is that they have the same SubjectId and Students .

I would like to use LINQ to detect all duplicates (and ultimately merge them). So far I have tried:

 var sb = new StringBuilder(); using (var ctx = new Ctx()) { ctx.CommandTimeout = 10000; // Because the next line takes so long! var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id); foreach (var c in allClasses) { var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students)); foreach (var d in duplicates) sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />"); } } lblResult.Text = sb.ToString();

This is not good, because I get the error:

NotSupportedException . Unable to create a constant value of type "TeachEDM.Student". In this context, only primitive types (such as Int32, String, and Guid) are supported.

Obviously, I don't like the fact that I'm trying to map o.SubjectId == c.SubjectId in LINQ.

Moreover, this seems like a terrible method overall and very slow. A call to the database takes more than 5 minutes.

I would really appreciate some advice.

+4

c # .net linq linq-to-entities entity-framework

James Oct 15 '11 at 20:30

source share

1 answer

Slauma · Accepted Answer · 2011-10-15T21:59:31+0000

Comparing SubjectId not a problem because c.SubjectId is a primitive type value ( int , I think). An exception complains about Equals(c.Students) . c.Students is a constant (as requested by duplicates ), but not a primitive type.

I would also try to do the comparison in memory, not in the database. In any case, you load all the data into memory when you start your first foreach : it executes an allClasses request. Then inside the loop, you extend IQueryable allClasses to IQueryable duplicates , which is executed, and then in the inner foreach . This is one database query for each element of your outer loop! This may explain the poor performance of the code.

So, I will try to execute the contents of the first foreach in memory. To compare the Students list, you need to compare element by element, and not links to student collections, because they are exactly different.

 var sb = new StringBuilder(); using (var ctx = new Ctx()) { ctx.CommandTimeout = 10000; // Perhaps not necessary anymore var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id) .ToList(); // executes query, allClasses is now a List, not an IQueryable // everything from here runs in memory foreach (var c in allClasses) { var duplicates = allClasses.Where( o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.OrderBy(s => s.Name).Select(s => s.Name) .SequenceEqual(c.Students.OrderBy(s => s.Name).Select(s => s.Name))); // duplicates is an IEnumerable, not an IQueryable foreach (var d in duplicates) sb.Append(d.LongName) .Append(" is a duplicate of ") .Append(c.LongName) .Append("<br />"); } } lblResult.Text = sb.ToString();

Ordering a sequence by name is necessary because, I believe, SequenceEqual compares the length of the sequence, then element 0 with element 0, then element 1 with element 1, etc.

Change To your comment that the first request is still slow.

If you have 1300 classes with 30 students, the performance on loading ( Include ) may suffer from the multiplication of data that is transferred between the database and the client. This is explained here: How much Include can I use in an ObjectSet in an EntityFramework to maintain performance? . The query is complex because it requires a JOIN between classes and students, and materializing material objects is also complicated because EF must filter out duplicate data when creating objects.

An alternative approach is to load only classes without students in the first query, and then load students one by one inside the loop explicitly. It will look like this:

 var sb = new StringBuilder(); using (var ctx = new Ctx()) { ctx.CommandTimeout = 10000; // Perhaps not necessary anymore var allClasses = ctx.Classes.OrderBy(o => o.Id).ToList(); // <- No Include! foreach (var c in allClasses) { // "Explicite loading": This is a new roundtrip to the DB ctx.LoadProperty(c, "Students"); } foreach (var c in allClasses) { // ... same code as above } } lblResult.Text = sb.ToString();

In this example, you will have 1 + 1300 database queries, not just one, but you will not have the data multiplication that occurs at high load, and the queries are simpler (no JOIN between classes and students).

The loading explanation is explained here:

http://msdn.microsoft.com/en-us/library/bb896272.aspx
For POCOs (also works for EntityObject derived objects): http://msdn.microsoft.com/en-us/library/dd456855.aspx
For EntityObject derived objects, you can also use the Load EntityCollection method: http://msdn.microsoft.com/en-us/library/bb896370.aspx

If you are working with Lazy Loading, the first foreach with LoadProperty will not be needed, as the Students collections will be loaded the first time you access it. This should lead to the same 1300 additional requests, for example, when loading the identifier.

Detection of objects having identical children

More articles: