Regix performance worsens

Question

Regix performance worsens

I am writing a C # application that runs a series of regular expressions (~ 10) over many (~ 25 million) lines. I tried to do this, but any “slow down” regular expression searches are full of tutorials on how reverse relaying, etc. Slows down regular expressions. I guess this is not my problem because my regular expressions start fast and slow down.

The first million lines take about 60 ms per 1000 lines to run regular expressions. Towards the end, it slowed to such an extent that it took about 600 ms. Does anyone know why?

It was worse, but I improved it by using RegEx instances instead of the cached version and compiling the expressions I could.

Some of my regular expressions should vary, for example. depending on the username, this may be mike said (\w*) or john said (\w*)

I understand that it is not possible to compile these regular expressions and pass parameters (e.g. saidRegex.Match(inputString, userName) ).

Does anyone have any suggestions?

[Edited to accurately reflect speed - per 1000 lines, not per line)

+6

performance c # regex

mike1952 Feb 11 '13 at 17:17

source share

2 answers

Regex takes time to compute. However, U can make it compact using some tricks. You can also use string functions in C # to avoid the regex function.

The code will be long, but may improve performance. String has several functions for cutting and extracting characters, as well as matching patterns as needed. e.g. IndeOfAny, LastIndexOf, Contains ....

 string str= "mon"; string[] str2= new string[] {"mon","tue","wed"}; if(str2.IndexOfAny(str) >= 0) { //success code// }

0

Arshad Mar 15 '13 at 11:48

source share

Troy alford · Accepted Answer · 2013-02-21T23:26:19+0000

This may not be the direct answer to your question about RegEx performance degradation - which is somewhat fun. However - after reading the entire commentary and discussion above - I would suggest the following:

Parse the data once, dividing the mapped data into a database table. It looks like you are trying to capture the following fields:

 Player_Name | Monetary_Value

If you need to create a database table containing these values for each row, and then catch each new row as it is created - analyze it and add it to the data table - you can easily do any analysis / calculation against the data - without having to repeat 25M rows over and over (which is waste).

In addition, in the first run, if you were to split 25M records into 100,000 record blocks, run the algorithm 250 times (100,000 x 250 = 25,000,000) - you can enjoy all the performance that you describe without slowing down, because you do the work.

In other words, consider the following:

Create the database table as follows:

 CREATE TABLE PlayerActions ( RowID INT PRIMARY KEY IDENTITY, Player_Name VARCHAR(50) NOT NULL, Monetary_Value MONEY NOT NULL )

Create an algorithm that breaks your 25 meter lines into 100 thousand pieces. An example of using LINQ / EF5 as an assumption.

 public void ParseFullDataSet(IEnumerable<String> dataSource) { var rowCount = dataSource.Count(); var setCount = Math.Floor(rowCount / 100000) + 1; if (rowCount % 100000 != 0) setCount++; for (int i = 0; i < setCount; i++) { var set = dataSource.Skip(i * 100000).Take(100000); ParseSet(set); } } public void ParseSet(IEnumerable<String> dataSource) { String playerName = String.Empty; decimal monetaryValue = 0.0m; // Assume here that the method reflects your RegEx generator. String regex = RegexFactory.Generate(); for (String data in dataSource) { Match match = Regex.Match(data, regex); if (match.Success) { playerName = match.Groups[1].Value; // Might want to add error handling here. monetaryValue = Convert.ToDecimal(match.Groups[2].Value); db.PlayerActions.Add(new PlayerAction() { // ID = ..., // Set at DB layer using Auto_Increment Player_Name = playerName, Monetary_Value = monetaryValue }); db.SaveChanges(); // If not using Entity Framework, use another method to insert // a row to your database table. } } }

Do one of the above to download all previously downloaded data.
Create a hook somewhere that allows you to detect the addition of a new line. Each time a new line is created, call:
```
 ParseSet(new List<String>() { newValue }); 
```
or if multiple numbers are created immediately, call:
```
 ParseSet(newValues); // Where newValues is an IEnumerable<String> 
```

Now you can do any computational analysis or data mining that you want from the data, without having to worry about the performance of more than 25 m lines on the fly.

Regix performance worsens

More articles: