Can I change the search method in LINQ?

I have a csv file with 30,000 lines. I have to choose many values โ€‹โ€‹based on many conditions, so I decided to use linq for a lot of loops and "if". I wrote a class to read csv. It implements IEnumerable for use with linq. This is my enumerator:

class CSVEnumerator : IEnumerator { private CSVReader _csv; private int _index; public CSVEnumerator(CSVReader csv) { _csv = csv; _index = -1; } public void Reset(){_index = -1;} public object Current { get { return new CSVRow(_index,_csv); } } public bool MoveNext() { return ++_index < _csv.TotalRows; } } 

It works, but it is slow. Let's say I want to select the maximum value in column A in the range of 100; 150 lines.

 max = (from CSVRow r in csv where r.ID > 100 && r.ID < 150 select r).Max(y=>y["A"]); 

This will work, but linq looks for a maximum value of 30,000 lines instead of 48. As I said, I could use a loop, but only in this example the conditions are "cruel" :)

Is there a way to override the linq collection search. Something like: look at the query used in my enumerator, see if any linq conditions in "where" contain a "string identifier filter" and give other data based on this.

I do not want to copy part of the data to another array / collection, and the problem is not in my csv reader. Access to each line by identifier is fast, only the problem is that you get access to all 30,000 of them. Any help appriciated :-)

+6
source share
3 answers

If you want to use LINQ for this efficiently, you will need to use expression trees in a similar (but much simpler) way than what various LINQ providers for SQL databases do. Although it is possible, I think that for such a simple task there would be a lot of code.

Because of this, I believe that the best solution would be to use a separate method to select the desired rows (and then, possibly, LINQ to work with the result).

In addition, many operations that return collections (including the source code and my modification) can be simplified using iterator methods .

So your code might look something like this:

 public static IEnumerable<CSVRow> GetRows( this CSVReader reader, int idGreaterThan, int idLessThan) { for (int i = idGreaterThan + 1; i < idLessThan; i++) { yield return new CSVRow(reader, i); } } 

Here it is an extension method for CSVReader , but another solution (e.g. the actual method for this class) may be more suitable for you.

Your example would look something like this:

 max = csvReader.GetRows(100, 150).Max(y => y["A"]); 

(Also, it seems strange to me that when you have limits of 100 and 150, you really need lines between 101 and 149. But I assume that you have a reason for this, so I did the same.)

+2
source

As for LINQ, r.ID is just a value that is filtered, and therefore all 30k lines are considered for use in the Max operation. If this is the line index that seems to be here, you can use Skip and Take to avoid comparing all 30k lines.

 max = csv.Skip(100).Take(50).Max(y => y["A"]); 
+1
source

@DougM correctly refers to the order of evaluation, but in this case I would do a single click on the initialization and generate a search for any fields of the "index": basically, pre-compute the map (dictionary) of the row index for the row. However, this would only be useful if you have many duplicate queries for a given index field.

0
source

All Articles