The fastest way to parse large lines (multithreaded)

Question

The fastest way to parse large lines (multithreaded)

I am going to start a project that will accept blocks of text, analyzing a lot of data in them on some object, which can then be serialized, saved and obtained from statistics / data. It should be as fast as possible, since I have> 10,000,000 blocks of text that I need to start and get 100,000 thousand a day.

I run this on a system with 12 core xeon + hyperthreads. I also have access / to know a little about CUDA programming, but for string things I think this is not suitable. From each line I need to parse a lot of data, and some of them I know the exact position, some I do not need and I need to use regex / something smart.

So, consider something like this:

object[] parseAll (string [] stringsToParse) { parallel foreach parse( string[n] ) } object parse(string s) { try to use exact positions / substring etc here instead of regex's }

So my questions are:

How slower is the regular expression for substr.
Is .NET significantly slower than other languages.
What kind of optimization (if any) can be done to maximize parallelism.
Anything else that I have not considered?

Thanks for any help! Sorry if this is a long time.

+4

string c # regex parallel-processing parsing

Luke belbina Nov 06 '10 at 18:31

source share

4 answers

I don’t know what kind of processing you are doing here, but if you say hundreds of thousands of lines a day, this seems like a pretty small number. Suppose you get 1 million new lines to process every day, and you can fully execute 10 of these 12 Xeon cores. This is 100,000 rows per core per day. There are 86,400 seconds per day, so we say .864 seconds per line. This is a lot of parsing.

I will repeat the recommendations made by @Pieter, especially where he suggests taking measurements to see how long it takes to process. It is best to get something and work, and then figure out how to do it faster if you need to. I think you will be surprised at how often you do not need to optimize. (I know this heresy for optimization wizards, but CPU time is cheap, and programmer time is expensive.)

How slower is the regex for substr?

It totally depends on how complex your regular expressions are. As @Pieter said, if you are looking for a single line, String.Contains will probably be faster. You can also use String.IndexOfAny if you are looking for constant strings. Regular expressions are not needed unless you are looking for patterns that cannot be represented as constant strings.

Is .NET significantly slower than other languages?

In applications with an intensive processor, .NET may be slower than native applications. Sometimes. If so, it usually ranges from 5 to 20 percent, and most often from 7 to 12 percent. This is just code that runs in isolation. You must consider other factors, such as how long you have been developing a program in this other language and how difficult it is to share data between your native application and the rest of your system.

+1

Jim mischel Nov 06 '10 at 19:09

source share

Google recently announced an internal word processing language (which looks like a subset of Python / Perl designed for highly parallel processing).

http://code.google.com/p/szl/ - Sawzall

0

kagali-san Nov 06 '10 at 19:01

source share

If you want to quickly parse strings in C #, you may need to consider a new NLib project. It contains string extensions for quick string searches. For example, IndexOfAny (string []) and IndexOfNotAny. They also contain overloads with the StringComparison argument.

0

drifter Nov 16 '10 at 16:36

source share

Pieter van ginkel · Accepted Answer · 2010-11-06T18:50:32+0000

How slower is the regular expression for substr.
If you are looking for the exact string, substr will be faster. However, regular expressions are highly optimized. They (or at least parts) are compiled into IL, and you can even save these compiled versions in a separate assembly using Regex.CompileToAssembly . See http://msdn.microsoft.com/en-us/library/9ek5zak6.aspx for more details.

What you really need to do is take measurements. Using something like Stopwatch by far the easiest way to check if one or the other code structure is faster.

What optimizations (if any) can be made to maximize parallelism.
Using Task.Factory.StartNew you can schedule tasks to run in the thread pool. You can also take a look at TPL (Task Parallel Library, of which Task is a part). It has many constructs that will help you parallelize the work and allow constructions like Parallel.ForEach() to iterate over multiple threads. See http://msdn.microsoft.com/en-us/library/dd460717.aspx for more details.

Anything else that I have not considered?
One of the reasons that can cause you more data is memory management. A few things to consider:

Limit memory allocation: try reusing the same buffers for a single document instead of copying them when you need only a portion. Suppose you need to work with a range from char from 1000 to 2000, not copy this range to a new buffer, but build your code to work only in this range. This will make your code complicated, but it will save your memory allocations;
StringBuilder is an important class. If you still do not know about it, take a look.

The fastest way to parse large lines (multithreaded)

More articles: