LINQ routine optimization

I run the build system. A datawise simplified description will be that I have Configurations, and each config has 0..n Builds. Artifacts are currently being collected, and some of them are stored on the server. What I am doing is writing a rule that sums all the bytes created for each configuration and checks to see if this is too much.

Currently, the routine code is:

private void CalculateExtendedDiskUsage(IEnumerable<Configuration> allConfigurations) { var sw = new Stopwatch(); sw.Start(); // Lets take only confs that have been updated within last 7 days var items = allConfigurations.AsParallel().Where(x => x.artifact_cleanup_type != null && x.build_cleanup_type != null && x.updated_date > DateTime.UtcNow.AddDays(-7) ).ToList(); using (var ctx = new LocalEntities()) { Debug.WriteLine("Context: " + sw.Elapsed); var allBuilds = ctx.Builds; var ruleResult = new List<Notification>(); foreach (var configuration in items) { // all builds for current configuration var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id) .OrderByDescending(z => z.build_date); Debug.WriteLine("Filter conf builds: " + sw.Elapsed); // Since I don't know which builds/artifacts have been cleaned up, calculate it manually if (configuration.build_cleanup_count != null) { var buildCleanupCount = "30"; // default if (configuration.build_cleanup_type.Equals("ReserveBuildsByDays")) { var buildLastCleanupDate = DateTime.UtcNow.AddDays(-int.Parse(buildCleanupCount)); configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate) .OrderByDescending(z => z.build_date); } if (configuration.build_cleanup_type.Equals("ReserveBuildsByCount")) { var buildLastCleanupCount = int.Parse(buildCleanupCount); configurationBuilds = configurationBuilds.Take(buildLastCleanupCount).OrderByDescending(z => z.build_date); } } if (configuration.artifact_cleanup_count != null) { // skipped, similar to previous block } Debug.WriteLine("Done cleanup: " + sw.Elapsed); const int maxDiscAllocationPerConfiguration = 1000000000; // 1GB // Sum all disc usage per configuration var confDiscSizePerConfiguration = configurationBuilds .GroupBy(c => new {c.configuration_id}) .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration)) .Select(groupedBuilds => new { configurationId = groupedBuilds.FirstOrDefault().configuration_id, configurationPath = groupedBuilds.FirstOrDefault().configuration_path, Total = groupedBuilds.Sum(c => c.artifact_dir_size), Average = groupedBuilds.Average(c => c.artifact_dir_size) }).ToList(); Debug.WriteLine("Done db query: " + sw.Elapsed); ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification { ConfigurationId = iter.configurationId, CreatedDate = DateTime.UtcNow, RuleType = (int) RulesEnum.TooMuchDisc, ConfigrationPath = iter.configurationPath })); Debug.WriteLine("Finished loop: " + sw.Elapsed); } // find owners and insert... } } 

This does exactly what I want, but I think I can do it faster. Currenly I see:

 Context: 00:00:00.0609067 // first round Filter conf builds: 00:00:00.0636291 Done cleanup: 00:00:00.0644505 Done db query: 00:00:00.3050122 Finished loop: 00:00:00.3062711 // avg round Filter conf builds: 00:00:00.0001707 Done cleanup: 00:00:00.0006343 Done db query: 00:00:00.0760567 Finished loop: 00:00:00.0773370 

SQL generated by .ToList() looks very dirty. (Everything used in WHERE is covered by the index in the database)

I am testing 200 configurations, so this adds up to 00: 00: 18.6326722. I have only ~ 8 thousand elements that need to be processed daily (so the whole procedure takes more than 10 minutes).

I accidentally searched this Internet, and it seems to me that the Entitiy Framework not very well versed in parallel processing. Knowing that I still decided to try this attempt (the first time I tried it, so sorry for any nonsense).

Basically, if I translate all processing out of scope, for example:

 foreach (var configuration in items) { var confDiscSizePerConfiguration = await GetData(configuration, allBuilds); ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification { ... skiped } 

and

 private async Task<List<Tmp>> GetData(Configuration configuration, IQueryable<Build> allBuilds) { var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id) .OrderByDescending(z => z.build_date); //..skipped var confDiscSizePerConfiguration = configurationBuilds .GroupBy(c => new {c.configuration_id}) .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration)) .Select(groupedBuilds => new Tmp { ConfigurationId = groupedBuilds.FirstOrDefault().configuration_id, ConfigurationPath = groupedBuilds.FirstOrDefault().configuration_path, Total = groupedBuilds.Sum(c => c.artifact_dir_size), Average = groupedBuilds.Average(c => c.artifact_dir_size) }).ToListAsync(); return await confDiscSizePerConfiguration; } 

For some reason, this reduces the execution time for 200 elements from 18 → 13 seconds. Anyway, from what I understand, since I am await entering each .ToListAsync() , it is still processed sequentially, is that correct?

So, the query “cannot be processed in parallel” begins to appear when I replace foreach (var configuration in items) with Parallel.ForEach(items, async configuration => . Performing this change will result in:

The second operation started in this context until the previous asynchronous operation is completed. Use "wait" so that asynchronous operations complete before calling another method in this context. Any instance members are not guaranteed to be safe threads.

At first it was a bit confusing to me, since I await almost anywhere the compiler allows it, but maybe the data is sown quickly.

I tried to overcome this by being less greedy and added new ParallelOptions {MaxDegreeOfParallelism = 4} to this parallel loop. It was assumed that the peasant assumption was that the default connection pool size is 100, and all I want to use is 4, there should be a lot. But he still fails.

I also tried to create new DbContext inside the GetData method, but it still doesn't work. If I remember correctly (I can’t check now), I got

Failed to open basic connection.

What are the options to speed up this procedure?

+5
source share
3 answers

Before you go in parallel, it is worth optimizing the request itself. Here are some tips that can improve your time:

1) When working with GroupBy use Key . This can solve the problem of a complex and nested SQL query, since you instruct Linq to use the same keys as in GROUP BY , and not to create a subselect.

  var confDiscSizePerConfiguration = configurationBuilds .GroupBy(c => new { ConfigurationId = c.configuration_id, ConfigurationPath = c.configuration_path}) .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration)) .Select(groupedBuilds => new { configurationId = groupedBuilds.Key.ConfigurationId, configurationPath = groupedBuilds.Key.ConfigurationPath, Total = groupedBuilds.Sum(c => c.artifact_dir_size), Average = groupedBuilds.Average(c => c.artifact_dir_size) }) .ToList(); 

2) It seems that you were bitten by problem N + 1. In simple words - you execute one SQL query to get all the configurations and N others to get information about the assembly. In total there would be ~ 8k small queries, where 2 large queries would be sufficient. If memory usage is not a limitation, retrieve all assembly data into memory and optimize it for quick retrieval using ToLookup .

 var allBuilds = ctx.Builds.ToLookup(x=>x.configuration_id); 

Later you can search for assemblies:

 var configurationBuilds = allBuilds[configuration.configuration_id].OrderByDescending(z => z.build_date); 

3) You do OrderBy on configurationBuilds several times. Filtering does not affect the recording order, so you can safely remove additional calls to OrderBy :

 ... configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate); ... configurationBuilds = configurationBuilds.Take(buildLastCleanupCount); ... 

4) It makes no sense to do GroupBy , since assemblies are already filtered for one configuration.

UPDATE:

I took another step and created a code that will get the same results as your provided code, with one request. It should be more productive and use less memory.

 private void CalculateExtendedDiskUsage() { using (var ctx = new LocalEntities()) { var ruleResult = ctx.Configurations .Where(x => x.build_cleanup_count != null && ( (x.build_cleanup_type == "ReserveBuildsByDays" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).Where(y => y.build_date > buildLastCleanupDate).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration) || (x.build_cleanup_type == "ReserveBuildsByCount" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).OrderByDescending(y => y.build_date).Take(buildCleanupCount).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration) ) ) .Select(x => new Notification { ConfigurationId = x.configuration_id, ConfigrationPath = x.configuration_path CreatedDate = DateTime.UtcNow, RuleType = (int)RulesEnum.TooMuchDisc, }) .ToList(); } } 
+3
source

First create a new context for each parallel. Before you are going to go on this route. But you need to write a request that receives all the necessary data in one trip. To speed up ef u, you can also turn off change tracking or proxies in context when reading ur data.

0
source

There are many places to optimize ...

There are places where you should put .ToArray () to avoid multiple server requests ...

I did a lot of refactoring, but I can’t check, due to the lack of additional information.

Perhaps this may lead to a better solution ...

private void CalculateExtendedDiskUsage (IEnumerable allConfigurations) {var sw = new Stopwatch (); sw.Start ();

  using (var ctx = new LocalEntities()) { Debug.WriteLine("Context: " + sw.Elapsed); var allBuilds = ctx.Builds; var ruleResult = GetRulesResult(sw, allConfigurations, allBuilds); // Clean Code!!! // find owners and insert... } } private static IEnumerable<Notification> GetRulesResult(Stopwatch sw, IEnumerable<Configuration> allConfigurations, ICollection<Configuration> allBuilds) { // Lets take only confs that have been updated within last 7 days var ruleResult = allConfigurations .AsParallel() // Check if you really need this right here... .Where(IsConfigElegible) // Clean Code!!! .SelectMany(x => CreateNotifications(sw, allBuilds, x)) .ToArray(); Debug.WriteLine("Finished loop: " + sw.Elapsed); return ruleResult; } private static bool IsConfigElegible(Configuration x) { return x.artifact_cleanup_type != null && x.build_cleanup_type != null && x.updated_date > DateTime.UtcNow.AddDays(-7); } private static IEnumerable<Notification> CreateNotifications(Stopwatch sw, IEnumerable<Configuration> allBuilds, Configuration configuration) { // all builds for current configuration var configurationBuilds = allBuilds .Where(x => x.configuration_id == configuration.configuration_id); // .OrderByDescending(z => z.build_date); <<< You should order only when needed (most at the end) Debug.WriteLine("Filter conf builds: " + sw.Elapsed); configurationBuilds = BuildCleanup(configuration, configurationBuilds); // Clean Code!!! configurationBuilds = ArtifactCleanup(configuration, configurationBuilds); // Clean Code!!! Debug.WriteLine("Done cleanup: " + sw.Elapsed); const int maxDiscAllocationPerConfiguration = 1000000000; // 1GB // Sum all disc usage per configuration var confDiscSizePerConfiguration = configurationBuilds .OrderByDescending(z => z.build_date) // I think that you can put this even later (or not to have anyway) .GroupBy(c => c.configuration_id) // No need to create a new object, just use the property .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration)) .Select(CreateSumPerConfiguration); Debug.WriteLine("Done db query: " + sw.Elapsed); // Extracting to variable to be able to return it as function result var notifications = confDiscSizePerConfiguration .Select(CreateNotification); return notifications; } private static IEnumerable<Configuration> BuildCleanup(Configuration configuration, IEnumerable<Configuration> builds) { // Since I don't know which builds/artifacts have been cleaned up, calculate it manually if (configuration.build_cleanup_count == null) return builds; const int buildCleanupCount = 30; // Why 'string' if you always need as integer? builds = GetDiscartBelow(configuration, buildCleanupCount, builds); // Clean Code (almost) builds = GetDiscartAbove(configuration, buildCleanupCount, builds); // Clean Code (almost) return builds; } private static IEnumerable<Configuration> ArtifactCleanup(Configuration configuration, IEnumerable<Configuration> configurationBuilds) { if (configuration.artifact_cleanup_count != null) { // skipped, similar to previous block } return configurationBuilds; } private static SumPerConfiguration CreateSumPerConfiguration(IGrouping<object, Configuration> groupedBuilds) { var configuration = groupedBuilds.First(); return new SumPerConfiguration { configurationId = configuration.configuration_id, configurationPath = configuration.configuration_path, Total = groupedBuilds.Sum(c => c.artifact_dir_size), Average = groupedBuilds.Average(c => c.artifact_dir_size) }; } private static IEnumerable<Configuration> GetDiscartBelow(Configuration configuration, int buildCleanupCount, IEnumerable<Configuration> configurationBuilds) { if (!configuration.build_cleanup_type.Equals("ReserveBuildsByDays")) return configurationBuilds; var buildLastCleanupDate = DateTime.UtcNow.AddDays(-buildCleanupCount); var result = configurationBuilds .Where(x => x.build_date > buildLastCleanupDate); return result; } private static IEnumerable<Configuration> GetDiscartAbove(Configuration configuration, int buildLastCleanupCount, IEnumerable<Configuration> configurationBuilds) { if (!configuration.build_cleanup_type.Equals("ReserveBuildsByCount")) return configurationBuilds; var result = configurationBuilds .Take(buildLastCleanupCount); return result; } private static Notification CreateNotification(SumPerConfiguration iter) { return new Notification { ConfigurationId = iter.configurationId, CreatedDate = DateTime.UtcNow, RuleType = (int)RulesEnum.TooMuchDisc, ConfigrationPath = iter.configurationPath }; } } internal class SumPerConfiguration { public object configurationId { get; set; } // public object configurationPath { get; set; } // I did use 'object' cause I don't know your type data public int Total { get; set; } public double Average { get; set; } } 
0
source

All Articles