The most efficient way to handle large csv in .NET.

Forgive my novelty, but I just need a guide, and I cannot find another question that answers this. I have a rather large csv file (~ 300k lines) and I need to determine for a given input, regardless of whether any line in csv starts with this input. I sorted csv in alphabetical order, but I don't know:

1) how to handle strings in csv - should I read it as a list / collection or use OLEDB or an embedded database or something else?

2) how to find something efficiently from the alphabetical list (using the fact that he sorted to speed up the process, rather than looking for the whole list)

+8
c # search csv
source share
10 answers

You do not give enough details to give you a specific answer, but ...


IF the CSV file changes frequently, then use OLEDB and just modify the SQL query based on your input.

string sql = @"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'"; using(OleDbConnection connection = new OleDbConnection( @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath + ";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\"")) 

IF the CSV file does not change often, and you start a lot of “requests” to it, load it once into memory and quickly search it every time.

IF you want your search to be an exact match in a column, use a dictionary in which the key is the column you want to match and the value is row data.

 Dictionary<long, string> Rows = new Dictionar<long, string>(); ... if(Rows.ContainsKey(search)) ... 

IF you want your search to be a partial match, such as StartsWith, then there is 1 array containing your search data (i.e.: the first column) and another list or array containing your data in the row. Then use C # built into the binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx

 string[] SortedSearchables = new string[]; List<string> SortedRows = new List<string>(); ... string result = null; int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm); if(foundIdx < 0) { foundIdx = ~foundIdx; if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) { result = SortedRows[foundIdx]; } } else { result = SortedRows[foundIdx]; } 

NOTE The code was written inside the browser window and may contain syntax errors since it has not been tested.

+7
source share

If you can cache data in memory, and you just need to search the list in one column of the primary key, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores data as key / value pairs in a hash table. You can use the primary key column as the key in the dictionary, and then use the remaining columns as the value in the dictionary. Finding items by key in a hash table is usually very fast.

For example, you can load data into a dictionary, for example:

 Dictionary<string, string[]> data = new Dictionary<string, string[]>(); using (TextFieldParser parser = new TextFieldParser("C:\test.csv")) { parser.TextFieldType = FieldType.Delimited; parser.SetDelimiters(","); while (!parser.EndOfData) { try { string[] fields = parser.ReadFields(); data[fields[0]] = fields; } catch (MalformedLineException ex) { // ... } } } 

And then you can get data for any element, for example:

 string fields[] = data["key I'm looking for"]; 
+4
source share

If you do this only once to run the program, it looks pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)

  static string FindRecordBinary(string search, string fileName) { using (StreamReader fs = new StreamReader(fileName)) { long min = 0; // TODO: What about header row? long max = fs.BaseStream.Length; while (min <= max) { long mid = (min + max) / 2; fs.BaseStream.Position = mid; fs.DiscardBufferedData(); if (mid != 0) fs.ReadLine(); string line = fs.ReadLine(); if (line == null) { min = mid+1; continue; } int compareResult; if (line.Length > search.Length) compareResult = String.Compare( line, 0, search, 0, search.Length, false ); else compareResult = String.Compare(line, search); if (0 == compareResult) return line; else if (compareResult > 0) max = mid-1; else min = mid+1; } } return null; } 

This is 0.007 seconds for 600,000 test files for writing, which are 50 megabytes. For comparison, the average file scan value is more than half a second, depending on the location of the recording. (100 times difference)

Obviously, if you do this more than once, caching will speed up the process. One simple way to partially cache would be to keep StreamReader open and reuse it, just reset min and max each time. This will save you from storing 50 megabytes in memory all the time.

EDIT: Added fix for knaki02.

+4
source share

Given that the CSV is sorted - if you can load the whole thing into memory (if the only processing you need to do is .StartsWith () on each line) - you can use Binary Search to have an exceptionally fast search.

Maybe something like this (NOT TESTED!):

 var csv = File.ReadAllLines(@"c:\file.csv").ToList(); var exists = csv.BinarySearch("StringToFind", new StartsWithComparer()); 

...

 public class StartsWithComparer: IComparer<string> { public int Compare(string x, string y) { if(x.StartsWith(y)) return 0; else return x.CompareTo(y); } } 
+3
source share

If your file is in memory (for example, due to sorting) and you save it as an array of strings (strings), you can use the simple search bisection > method. You can start with the code on this subject on CodeReview , just change the comparator to work with string instead of int and check only the beginning of each line.

If you need to re-read the file every time, because it can be changed or saved or sorted by another program, then the simplest algorithm will be the best:

 using (var stream = File.OpenText(path)) { // Replace this with you comparison, CSV splitting if (stream.ReadLine().StartsWith("...")) { // The file contains the line with required input } } 

Of course, you can read the entire file in memory (use LINQ or List<T>.BinarySearch() ) every time, but this is far from optimal (you will read everything even if you may need to examine only a few lines), and the file itself may even to be too big.

If you really need something else, and you don’t have a file in memory due to sorting (but you have to profile your actual performance compared to your requirements), you have to implement a better search algorithm, for example, Boyer- algorithm Moore

+1
source share

The OP said it just needed to search along the line.

Then questions should contain lines in memory or not.

If the line is 1 k, then 300 MB of memory.
If the line is 1 megabyte, then 300 GB of memory.

Stream.Readline will have a low memory profile
Since it is sorted, you can stop viewing if it is larger.

If you hold it in memory, then just

 List<String> 

With LINQ will work.
LINQ is not smart enough to take advantage, but against 300K it will still be pretty fast.

BinarySearch takes advantage of sorting.

+1
source share

I wrote it quickly for work, it can be improved ...

Define the column numbers:

 private enum CsvCols { PupilReference = 0, PupilName = 1, PupilSurname = 2, PupilHouse = 3, PupilYear = 4, } 

Define Model

 public class ImportModel { public string PupilReference { get; set; } public string PupilName { get; set; } public string PupilSurname { get; set; } public string PupilHouse { get; set; } public string PupilYear { get; set; } } 

Import and populate the list of models:

  var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray(); var pupils = rows.Select(x => new ImportModel { PupilReference = x[(int) CsvCols.PupilReference], PupilName = x[(int) CsvCols.PupilName], PupilSurname = x[(int) CsvCols.PupilSurname], PupilHouse = x[(int) CsvCols.PupilHouse], PupilYear = x[(int) CsvCols.PupilYear], }).ToList(); 

Returns a list of strongly typed objects

+1
source share

Try the free CSV Reader . No need to reinvent the wheel again and again;)

1) If you do not need to save the results, just repeat it, if CSV - processes each line and forgets about it. If you need to process all the lines again and again, save them in a list or dictionary (with a good key, of course)

2) Try using common extension methods like this

 var list = new List<string>() { "a", "b", "c" }; string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a")); IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a")); 
0
source share

Here is my VB.net code. This is for Qualified CSV, so for regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})

 Dim path as String = "C:\linqpad\Patient.txt" Dim pat = System.IO.File.ReadAllLines(path) Dim Patz = From P in pat _ Let n = P.Split(New Char() {""","""}) _ Order by n(5) _ Select New With { .Doc =n(1), _ .Loc = n(3), _ .Chart = n(5), _ .PatientID= n(31), _ .Title = n(13), _ .FirstName = n(9), _ .MiddleName = n(11), _ .LastName = n(7), .StatusID = n(41) _ } Patz.dump 
0
source share

I would usually recommend finding a dedicated CSV parser (like this or this ). However, I noticed this line in your question:

I need to determine for a given input whether any line in csv will start with this input.

This tells me that computer time parses CSV data before it is determined, this time is wasted. You just need code that just matches the text for the text, and you can do this by comparing strings as easily as anything else.

In addition, you indicate that the data is sorted. This should allow you to speed things up considerably ... but you need to know that for this you will need to write your own code to make search calls in low-level file streams. This will be by far the most effective result, but it will also certainly require the most initial work and maintenance.

I recommend an engineering-based approach where you set a performance goal, build something relatively simple, and measure results for that purpose. In particular, start with the 2nd link above. The CSV reader will only load one entry into memory at a time, so it should work pretty well and it's easy to get started. Create something that this reader uses and measure the results. If they achieve your goal, stop there.

If they do not fit your purpose, adapt the code from the link so that when reading each line, the lines are compared first (before trying to parse the csv data), and do only the csv analysis for the lines that correspond. This should work better, but only do the job if the first option does not meet your goals. When it is ready, measure the productivity again.

Finally, if you still haven’t reached your performance goal, we’ve entered the territory of writing low-level code to perform a binary search in your file stream using search queries. This is most likely the best thing you can do in terms of performance, but it will be very dirty and error-prone code to write, and therefore you only want to go here if you are absolutely not up to your goals from the earlier steps.

Remember that performance is a function, and like any other function, you need to evaluate how you create this function compared to the actual design goals. "As soon as possible" is not a reasonable design goal. Something like “respond to user search within .25 seconds” is the real design goal, and if the simpler but slower code still matches that goal, you need to stop there.

0
source share

All Articles