Multicore text file

Question

Multicore text file

I have a quad-core computer, and I would like to write code to parse a text file that uses all four cores. The text file basically contains one entry per line.

Multithreading is not my forte, so I wonder if anyone can give me some templates that I could use to optimally analyze the file.

My first thoughts are to read all the lines in some queue and then unwind the threads to pull the lines out of the queue and process them, but this means that the queue must exist in memory, and these are quite large files, so I'm not very interested in this idea.

My next thoughts are to have some kind of controller that will read the line and assign it a thread for analysis, but I'm not sure if the controller will be a bottleneck if the threads process the lines faster than they can read and assign them.

I know that maybe a different, simpler solution than both, but right now I just don't see it.

+6

multithreading c #

lomaxx Aug 10 '08 at 3:07

source share

7 answers

Marking the answer is a simpler and more elegant solution. Why build a complex cross-threading program if you don't need it? Create 4 threads. Each thread calculates the size of the / 4 file to determine its starting point (and breakpoint). Then each thread can work completely independently.

The only reason for adding a special read stream is that you expect some lines to take a very long time to process, and you expect these lines to be grouped into one part of the file. Adding cross-threading when you don't need it is a very bad idea. You significantly increase the likelihood of unexpected bottlenecks and / or synchronization failures.

+9

Derek park Aug 10 '08 at 5:21

source share

This will eliminate bottlenecks when one thread is reading:

open file for each thread n=0,1,2,3: seek to file offset 1/n*filesize scan to next complete line process all lines in your part of the file

+3

Mark harrison Aug 10 '08 at 4:41

source share

My experience is with Java, not C #, so I apologize if these solutions do not apply.

The immediate solution I can come up with with my head would be to have an artist that starts 3 threads (using Executors .newFixedThreadPool , say). For each line / record read from the input file, release the job from the executor (using ExecutorService .submit ). The contractor will send you requests and distribute between 3 threads.

There are probably better solutions, but we hope this works. :-)

ETA: Sounds the same as Wolfbyte's second solution. :-)

ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I have never used it, but it may cost you some time.

+1

Chris jester-young Aug 10 '08 at 3:43

source share

Since the bottleneck will usually be handled rather than read when working with files, I would go with the producer-consumer pattern. To avoid blocking, I would look at the block lists. Since you are using C #, you can take a look at the Julian Bucknall Lock-Free List code.

+1

graham.reeds Aug 11 '08 at 5:04

source share

@lomaxx

@Derek and Mark: I would like to accept 2 answers. I will have to finish work with the Wolfbyte solution, because if I split the file into n sections, there is a chance that the thread will encounter a batch of "slow" transactions, however, if I process the file where each process was guaranteed that equal processing would be required, then I really like your decision to simply split the file into pieces and assign each piece of thread and execute with it.

Do not worry. If cluster "slow" transactions are a problem, then a queuing solution is the way to go. Depending on how fast or slow the average transaction is, you can also see how to assign multiple rows at a time to each employee. This will reduce the synchronization overhead. Similarly, you may need to optimize the size of your buffer. Of course, both of these optimizations are probably only after profiling. (It makes no sense to worry about synchronization if this is not a bottleneck.)

0

Derek park Aug 11 '08 at 12:02

source share

If the text you are processing consists of repeating lines and tokens, split the file into pieces, and for each fragment you can have one stream pre-analyzing it in tokens consisting of keywords, "punctuation", identifier lines, and value. Comparing strings and searches can be quite expensive and passing it to multiple workflows can speed up the purely logical / semantic part of the code if it doesn't need to search for strings and compare.

The previously analyzed pieces of data (where you have already performed all string comparisons and tokenized) can then be transferred to the part of the code that will actually look at the semantics and ordering of token data.

In addition, you mentioned that you are occupying a large file size of your file. There are a few things you could do to reduce your memory budget.

Divide the file into pieces and analyze it. Read only as many pieces as you work at the same time, as well as a few for “reading ahead” so that you do not stop on the disk when you finish processing the fragment before moving on to the next fragment.

Alternatively, large files can be displayed in memory and downloaded as “require.” If you have more threads working on file processing than processors (usually threads = 1.5-2X CPU is a good number for demand paging applications), threads that stop at IO for a memory-mapped file automatically stop from the OS until as long as the memory is ready, and other threads will continue to be processed.

0

Adisak Sep 17 '08 at 4:27

source share

Mike minutillo · Accepted Answer · 2008-08-10T03:19:36+0000

I would go with your original idea. If you are concerned that the queue may become too large, a buffer zone will be created for it (that is, if it exceeds 100 lines, stop reading the file, and if it drops below 20, start reading again. You will need to do some testing find optimal barriers). Make sure that any of the streams could potentially be a “read stream”, since it must block the queue in order to pull out the element, one way or another, it can also check if the “low buffer region” has been deleted and start reading again. While he is doing this, other threads can read the rest of the queue.

Or, if you want, ask one read thread to assign lines to three other processor threads (through their own queues) and implement a theft strategy . I have never done this, so I don’t know how hard it is.

Multicore text file

More articles: