Efficient analysis of a large text file in C #

I need to read a large text file with a spatial separation and count the number of instances of each code in the file. In fact, these are the results of several experiments hundreds of thousands of times. The system spills out a text file that looks something like this:

A7PS A8PN A6PP23 ... 

And there are literally hundreds of thousands of these records, and I need to count the incidents of each of the codes.

I guess I could just open the StreamReader and go through the line, dividing by a space character. See if the code has already been detected and add 1 to the account of this code. However, this is probably rather naive given the size of the data.

Does anyone know of an efficient algorithm to handle this kind of processing?

UPDATE:

OK, so consensus seems like my approach is on the right lines

I would be interested to hear things like - which is more efficient - StreamReader. TextReader, BinaryReader

What is the best structure to store my results dictionary? HashTable, SortedList, HybridDictionary

If there are no line breaks in the file (I have not received a sample yet), will all this just be split in space to be inefficient?

Essentially, I look to make it as possible as possible.

thanks again

+6
c # algorithm parsing text-processing
source share
8 answers

Your approach looks good.

  • Reading a line in a line
  • Separate each line with a space
  • Add an entry to the dictionary if it does not already exist and if it really exists, execute the value ++
+5
source share

I would say that in general your approach is right, but there is an opportunity for parallelism. I would suggest starting several threads or tasks (in .NET 4) each parsing part / fragment of a file. In addition, instead of reading line by line, reading in a piece of bytes will give better performance in terms of IO disk.

Edit : here is a solution diagram.

  • Let's say we process M pieces of N characters at a time (because we want to limit the amount of memory needed and the number of threads used).
  • Highlight the N * M character buffer. We will use this buffer cyclically.
  • Will use the manufacturer-consumer pattern. The producer will fill the buffer. This will try to find the word boundary near the border of the piece (i.e. near every Nth character). So, we will have M pieces of approximately N characters with the start and the final index in the buffer
  • Now run M workflows to process each fragment. Each employee will use their own vocabulary for word counting - this will save you from the need to synchronize threads.
  • The set of results at the end of the iteration. The process must be repeated before reading the entire file.

Of course, I accept really huge files for this approach. I will probably use the old-style search in the buffer to find the search code for the word boundary sign as unsafe in order to avoid related checks.

+4
source share

I agree with PoweRoy's comment: why not give it a try? Perhaps in practice there are no problems.

If you need something else, you can try writing code that takes Stream and returns an IEnumerable<string> . It will read the characters from its input one at a time - if you need buffering for efficiency, you can always wrap the FileStream that you actually pass this code to BufferStream and check if this is a space (or maybe EOL?). If this is not the case, it will add the character to the string buffer (perhaps StringBuilder ?), But if it is, then yield return current buffer and clear it.

After that, you can simply foreach execute the result of calling this code in the contents of the file, and you will get the codes from the file one by one.

Then you can use some data structure, such as Dictionary<string,int> , to count the number of events for each code, saving the code as a key and a count value. But this step will be the same if you read the file line by line and use string.Split to break them into spaces.

+1
source share

If you want to try something else, you can try using BinaryReader and read the byte stream by byte and increment the counter every time you encounter a space.

+1
source share

One hundred thousand records are not many. I would use Dictionary<string,int> . To save the key and counter.

But if you run into memory problems, why not use a database, even a database such as SQL Compact or SQLite. Create a table with a record containing a key and a counter.

Saving data in memory is the fastest for small amounts of data, but when you reach the limits of computer memory, the database will be faster.

+1
source share

At the most basic level, I would start with a Dictionary<string, int> , string.split document in spaces and save the score by simply analyzing this data.

string.split is a relatively reliable method that, and someone, of course, will correct me if I am mistaken, was created to use regular expressions and is extremely complicated than what you need for this script.

Writing your own separation method is likely to be a more viable solution than a solution within. First, I suggest using ready-made versions, as described above, and then rewriting your own if you determine that performance is a problem.

Yang

0
source share

Unless there are other restrictions, you should read the full file as described.

To save codes and a counter, you must use a data structure that allows you to search and insert O (log n) times. SortedDictionary will do this in C #.

EDIT:

What is the best structure to store my results dictionary? HashTable, SortedList, HybridDictionary

Since the sorted order does not seem to require a HybridDictionary or Dictionary will work best in most cases. SortedList will probably be the slowest solution, because inserts accept O (n). You should do some tests with different implementations if performance is so important.

0
source share
  static string LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; static string NUMBERS = "1234567890"; static Random rdGen = new Random(); static Dictionary<string, int> myDic = new Dictionary<string, int>(); static void WriteTest(int max) { myDic = new Dictionary<string, int>(); Stopwatch sw = new Stopwatch(); sw.Start(); for (int i = 0; i < max; i++) { string code = LETTERS[rdGen.Next(0, 26)].ToString() + NUMBERS[rdGen.Next(0, 10)].ToString() + LETTERS[rdGen.Next(0, 26)].ToString() + LETTERS[rdGen.Next(0, 26)].ToString(); if (myDic.ContainsKey(code)) myDic[code]++; else { myDic[code] = 1; } } sw.Stop(); Console.WriteLine(max.ToString() + " itรฉrations : " + sw.ElapsedMilliseconds.ToString()); } 

WriteTest (10000000); // takes 7.5 seconds.

It seems pretty effective to me.

0
source share

All Articles