Using ToList () for Enableerable LINQ query results for large datasets - performance issue?

I use LINQ queries in the application that I am writing now, and in one of the situations that I execute, I have to convert the results of the LINQ query into lists for further processing (I have my reasons for looking for lists).

I would like to better understand what happens in this list conversion if there is inefficiency, since I used it again now. So, given that I am executing the line as follows:

var matches = (from x in list1 join y in list2 on x equals y select x).ToList(); 

Questions:

  • Is there any overhead besides creating a new list and its combination with links to elements in Enumerable returned from the request?

  • Do you find this ineffective?

  • Is there a way to get the LINQ query to directly generate a list to avoid the need for conversion in this case?

+4
source share
5 answers

Well, he creates a copy of the data. It may be ineffective - but it depends on what happens. If you need a List<T> at the end, a List<T> will usually be close to as efficient as you get. The only exception is if you are going to make a conversion, and the source is already a list, then using ConvertAll will be more efficient, since it can create a basic array of the right size to start with.

If you only need to transfer data - for example, you are just going to make a foreach on it and take actions that do not affect the original data sources, and calling ToList definitely a potential source of inefficiency. This will make all list1 be evaluated - and if it is a lazily evaluated sequence (for example, "the first 1,000,000 values ​​from a random number generator"), then this is not good. Note that as you merge, list2 will be evaluated anyway, as soon as you try to pull the first value out of the sequence (in order to fill the list or not).

You might want to read the Edulinq post on ToList to find out what happens - in at least one possible implementation - in the background.

+5
source
  • No raid exists other than those that were already you.

  • I would say yes, but it depends on the specific application scenario. By the way, in general, it is better to avoid additional calls. (I think this is obvious).

  • I'm afraid not. LINQ query returns a data sequence, which can be an infinite sequence. Converting to List<T> , you make it compact, as well as the ability to access an index that cannot be executed in a sequence or stream.

Assumption: Avoid the situation where you need a List<T> . If, by the way, you need it, push it in as much data as you need at the moment.

Hope this helps.

+1
source

In addition to what has been said, if the initial two lists you connect were already large enough, creating a third (creating an β€œintersection” of the two) can cause memory errors. If you simply repeat the result of the LINQ statement, you will significantly reduce memory usage.

+1
source

Enumerable.ToList(source) is essentially just a call to new List(source) .

This constructor will check if the source is ICollection<T> , and if it allocates an array of the appropriate size. In other cases, that is, in most cases when the source is a LINQ query, it will allocate an array with the initial initial capacity by default (four elements) and increase it, doubling the capacity as necessary. Each time the capacity doubles, a new array is allocated, and the old one to a new one.

This can lead to some overhead in cases where your list has a lot of items (we probably say at least thousands of thousands). The overhead can be significant as soon as the list grows by more than 85 KB, as it is then allocated to a bunch of large objects that are not compacted and can suffer from memory fragmentation. Please note that I am referencing the array in the list. If T is a reference type, this array contains only references, not the actual objects. These objects then do not take into account the 85 KB limit.

You can remove some of this overhead if you can accurately estimate the size of your sequence (where it’s better to overestimate a bit than underestimate a bit). For example, if you use only the .Select() operator for what implements ICollection<T> , you know the size of the output list.

In such cases, this extension method will reduce this overhead:

 public static List<T> ToList<T>(this IEnumerable<T> source, int initialCapacity) { // parameter validation ommited for brevity var result = new List<T>(initialCapacity); foreach (T item in source) { result.Add(item); } return result; } 

In some cases, the list you created will simply replace the list that already exists, for example. from the previous launch. In these cases, you can avoid quite a bit of memory allocation if you reuse the old list. This will only work if you do not have simultaneous access to this old list, and I would not do this if the new lists will usually be much smaller than the old lists. In this case, you can use this extension method:

 public static void CopyToList<T>(this IEnumerable<T> source, List<T> destination) { // parameter validation ommited for brevity destination.Clear(); foreach (T item in source) { destination.Add(item); } } 

.ToList() that, do I .ToList() inefficient? Not if you have memory and you are going to use the list repeatedly, either for random indexing into it a lot, or repeating it several times.

Now back to your specific example:

 var matches = (from x in list1 join y in list2 on x equals y select x).ToList(); 

It may be more efficient for this in another way, for example:

 var matches = list1.Intersect(list2).ToList(); 

which will give the same results if list1 and list2 do not contain duplicates and are very effective if list2 is small.

The only way to find out about this, as usual, is to measure using typical workloads.

0
source
  • Most of the overhead happens before creating a list, such as connecting to db, getting data on
    adapter, for type var.NET must decide what type / data structure ...

  • Efficiency is very relative. For a programmer who is not very strong in SQL, the overhead (detailed in 1) is more efficient, faster (relative to the old ADO).

  • On the other hand, LINQ can call procedures from db itself, which are already faster. I offer you the following test:

    • Run the program on the maximum amount of data and measure the time.
    • Use some db procedure to export data to a file (for example, XML, CSV, ....) and try to create your list from this file and measure the time. Then you can see if the difference is significant. But the second method is less effective for the programmer, but can reduce the execution time.
0
source

All Articles