How to split a huge file into words?

How can I read a very long line from a text file and then process it (split into words)?

I tried the StreamReader.ReadLine() method, but get an OutOfMemory exception. Apparently my lines are very long. This is my code to read the file:

 using (var streamReader = File.OpenText(_filePath)) { int lineNumber = 1; string currentString = String.Empty; while ((currentString = streamReader.ReadLine()) != null) { ProcessString(currentString, lineNumber); Console.WriteLine("Line {0}", lineNumber); lineNumber++; } } 

And the code that breaks the string into words:

 var wordPattern = @"\w+"; var matchCollection = Regex.Matches(text, wordPattern); var words = (from Match word in matchCollection select word.Value.ToLowerInvariant()).ToList(); 
+5
source share
3 answers

With yield you can read char by creating words, using yield to make it deferred, so you don't have to read the whole file right away:

 private static IEnumerable<string> ReadWords(string filename) { using (var reader = new StreamReader(filename)) { var builder = new StringBuilder(); while (!reader.EndOfStream) { char c = (char)reader.Read(); // Mimics regex /w/ - almost. if (char.IsLetterOrDigit(c) || c == '_') { builder.Append(c); } else { if (builder.Length > 0) { yield return builder.ToString(); builder.Clear(); } } } yield return builder.ToString(); } } 

The code reads the file by character, and when it encounters a character without a word, it will yield return word created until then (only for the first non-letter character). The code uses StringBuilder to build a dictionary string.

Char.IsLetterOrDigit() behaves like a regular expression word character w for characters, but underscores (among others) also belong to the latter category. If your input contains more characters that you also want to include, you will have to change if() .

+5
source

Divide it into bit size sections. So instead of reading 4gb, which, in my opinion, relates to page size, try reading 8,500,000 pieces, and this should help.

0
source

Garbage collection may be a solution. I am not sure if this is a problem. But if this is the case, a simple GC.Collect is often insufficient and, for performance reasons, should be called if it is really necessary. Try the following procedure, which causes garbage when available memory is too low (below the threshold provided as a procedure parameter).

 int charReadSinceLastMemCheck = 0 ; using (var streamReader = File.OpenText(_filePath)) { int lineNumber = 1; string currentString = String.Empty; while ((currentString = streamReader.ReadLine()) != null) { ProcessString(currentString, lineNumber); Console.WriteLine("Line {0}", lineNumber); lineNumber++; totalRead+=currentString.Length ; if (charReadSinceLastMemCheck>1000000) { // Check memory left every Mb read, and collect garbage if required CollectGarbage(100) ; charReadSinceLastMemCheck=0 ; } } } internal static void CollectGarbage(int SizeToAllocateInMo) { long [,] TheArray ; try { TheArray =new long[SizeToAllocateInMo,125000]; }low function catch { TheArray=null ; GC.Collect() ; GC.WaitForPendingFinalizers() ; GC.Collect() ; } TheArray=null ; } 
0
source

All Articles