I run the build system. A datawise simplified description will be that I have Configurations, and each config has 0..n Builds. Artifacts are currently being collected, and some of them are stored on the server. What I am doing is writing a rule that sums all the bytes created for each configuration and checks to see if this is too much.
Currently, the routine code is:
private void CalculateExtendedDiskUsage(IEnumerable<Configuration> allConfigurations) { var sw = new Stopwatch(); sw.Start(); // Lets take only confs that have been updated within last 7 days var items = allConfigurations.AsParallel().Where(x => x.artifact_cleanup_type != null && x.build_cleanup_type != null && x.updated_date > DateTime.UtcNow.AddDays(-7) ).ToList(); using (var ctx = new LocalEntities()) { Debug.WriteLine("Context: " + sw.Elapsed); var allBuilds = ctx.Builds; var ruleResult = new List<Notification>(); foreach (var configuration in items) { // all builds for current configuration var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id) .OrderByDescending(z => z.build_date); Debug.WriteLine("Filter conf builds: " + sw.Elapsed); // Since I don't know which builds/artifacts have been cleaned up, calculate it manually if (configuration.build_cleanup_count != null) { var buildCleanupCount = "30"; // default if (configuration.build_cleanup_type.Equals("ReserveBuildsByDays")) { var buildLastCleanupDate = DateTime.UtcNow.AddDays(-int.Parse(buildCleanupCount)); configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate) .OrderByDescending(z => z.build_date); } if (configuration.build_cleanup_type.Equals("ReserveBuildsByCount")) { var buildLastCleanupCount = int.Parse(buildCleanupCount); configurationBuilds = configurationBuilds.Take(buildLastCleanupCount).OrderByDescending(z => z.build_date); } } if (configuration.artifact_cleanup_count != null) { // skipped, similar to previous block } Debug.WriteLine("Done cleanup: " + sw.Elapsed); const int maxDiscAllocationPerConfiguration = 1000000000; // 1GB // Sum all disc usage per configuration var confDiscSizePerConfiguration = configurationBuilds .GroupBy(c => new {c.configuration_id}) .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration)) .Select(groupedBuilds => new { configurationId = groupedBuilds.FirstOrDefault().configuration_id, configurationPath = groupedBuilds.FirstOrDefault().configuration_path, Total = groupedBuilds.Sum(c => c.artifact_dir_size), Average = groupedBuilds.Average(c => c.artifact_dir_size) }).ToList(); Debug.WriteLine("Done db query: " + sw.Elapsed); ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification { ConfigurationId = iter.configurationId, CreatedDate = DateTime.UtcNow, RuleType = (int) RulesEnum.TooMuchDisc, ConfigrationPath = iter.configurationPath })); Debug.WriteLine("Finished loop: " + sw.Elapsed); } // find owners and insert... } }
This does exactly what I want, but I think I can do it faster. Currenly I see:
Context: 00:00:00.0609067
SQL generated by .ToList() looks very dirty. (Everything used in WHERE is covered by the index in the database)
I am testing 200 configurations, so this adds up to 00: 00: 18.6326722. I have only ~ 8 thousand elements that need to be processed daily (so the whole procedure takes more than 10 minutes).
I accidentally searched this Internet, and it seems to me that the Entitiy Framework not very well versed in parallel processing. Knowing that I still decided to try this attempt (the first time I tried it, so sorry for any nonsense).
Basically, if I translate all processing out of scope, for example:
foreach (var configuration in items) { var confDiscSizePerConfiguration = await GetData(configuration, allBuilds); ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification { ... skiped }
and
private async Task<List<Tmp>> GetData(Configuration configuration, IQueryable<Build> allBuilds) { var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id) .OrderByDescending(z => z.build_date);
For some reason, this reduces the execution time for 200 elements from 18 → 13 seconds. Anyway, from what I understand, since I am await entering each .ToListAsync() , it is still processed sequentially, is that correct?
So, the query “cannot be processed in parallel” begins to appear when I replace foreach (var configuration in items) with Parallel.ForEach(items, async configuration => . Performing this change will result in:
The second operation started in this context until the previous asynchronous operation is completed. Use "wait" so that asynchronous operations complete before calling another method in this context. Any instance members are not guaranteed to be safe threads.
At first it was a bit confusing to me, since I await almost anywhere the compiler allows it, but maybe the data is sown quickly.
I tried to overcome this by being less greedy and added new ParallelOptions {MaxDegreeOfParallelism = 4} to this parallel loop. It was assumed that the peasant assumption was that the default connection pool size is 100, and all I want to use is 4, there should be a lot. But he still fails.
I also tried to create new DbContext inside the GetData method, but it still doesn't work. If I remember correctly (I can’t check now), I got
Failed to open basic connection.
What are the options to speed up this procedure?