Can this regex be further optimized?

Question

Can this regex be further optimized?

I wrote this regular expression to parse records from srt files.

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

I don't know if this matters, but it is done using the Scala programming language (Java Engine, but literal strings, so I don't need to double the backslash).

s{1,2} used because some files will have line breaks \n and others will have line breaks and carriage returns \n\r The first (?s) turns on DOTALL mode, so the third capture group can also correspond to a break lines.

My program basically splits the srt file using \n\r?\n as a delimiter and uses Scala's pattern matching function to read each record for further processing:

 val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r def apply(string: String): Entry = string match { case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start), timeFormat.parse(end), text); }

Example entries:

One line:

 1073 01:46:43,024 --> 01:46:45,015 I am your father.

Two lines:

 160 00:20:16,400 --> 00:20:19,312 <i>Help me, Obi-Wan Kenobi. You're my only hope.</i>

The fact is that the profiler shows me that this method of parsing is by far the most time-consuming operation in my application (which does intensive math of time and can even transcode a file several times faster than what is required for reading and analyzing a record )

So any regular expression wizard can help me optimize it? Or maybe I should sacrifice a regular expression / pattern match and try the old java.util.Scanner approach?

Greetings

+4

java optimization scala regex

Anthony accioly Aug 18 '11 at 23:32

source share

4 answers

 (?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

In Java, $ means the end of an input or the beginning of a line break immediately preceding the end of an input. \z means an unambiguous end to the input, so if it is also semantics in Scala, then \r?$ is redundant, and $ will do the same. If you really want only CR at the end, not CRLF, then \r?\z might be better.

(?s) should also do (.+)\r? redundant, since + is greedy,. must always expand to include \r . If you don't want \r be included in this third capture group, make the match lazy: (.+?) Instead of (.+) .

Can

 (?s)^\d++\s\s?(.{12}) --> (.{12})\s\s?(.+?)\r?\z

Other small high-performance alternatives to regular expressions that will run inside the JVM & | CLRs include JavaCC and ANTLR . For a Scala solution, see http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html

+4

Mike samuel Aug 18 '11 at 23:47

source share

Check this:

 (?m)^\d++\r?+\n(.{12}) --> (.{12})\r?+\n(.++(?>\r?+\n.++)*+)$

This regex matches the full .srt file entry. You do not need to break the contents first into line breaks; what a huge waste of resources.

The regular expression uses the fact that there is exactly one line separator ( \n or \r\n ) that separates the lines within the record (line separators are used to separate the records from each other). Using \r?+\n instead of \s{1,2} means that you can never accidentally match two line separators ( \n\n ) when you only want to match them.

This way you do not need to rely on mode . in (?s) . @Jacob was right about this: he doesn't really help you, and he kills your work. But (?m) mode is useful for both correctness and performance.

You mentioned java.util.Scanner ; this regex will work very well with findWithinHorizon(0) . But I would be surprised if Scala does not offer a good, idiomatic way to use it.

+2

Alan moore Aug 19 '11 at 0:39

source share

I would not use java.util.Scanner or even strings. Everything you do works fine on a byte stream, if you can assume the UTF-8 encoding of your files (or lack of unicode). You should be able to speed up your work by at least 5 times.

Edit: this is just multi-level processing of bytes and indexes. Something here is based on the things I did before that seem about 2-5 times faster, depending on file size, caching, etc. I do not parse the date, I just return the strings, and I assume the files are small enough to fit in one block of memory (i.e. <2G). This is quite meticulous; if you know, for example, that the format of the date string is always in order, then the parsing can be even faster (just count the characters after the first string of numbers).

 import java.io._ abstract class Entry { def isDefined: Boolean def date1: String def date2: String def text: String } case class ValidEntry(date1: String, date2: String, text: String) extends Entry { def isDefined = true } object NoEntry extends Entry { def isDefined = false def date1 = "" def date2 = "" def text = "" } final class Seeker(f: File) { private val buffer = { val buf = new Array[Byte](f.length.toInt) val fis = new FileInputStream(f) fis.read(buf) fis.close() buf } private var i = 0 private var d1,d2 = 0 private var txt,n = 0 def isDig(b: Byte) = ('0':Byte) <= b && ('9':Byte) >= b def nextNL() { while (i < buffer.length && buffer(i) != '\n') i += 1 i += 1 if (i < buffer.length && buffer(i) == '\r') i += 1 } def digits() = { val zero = i while (i < buffer.length && isDig(buffer(i))) i += 1 if (i==zero || i >= buffer.length || buffer(i) != '\n') { nextNL() false } else { nextNL() true } } def dates(): Boolean = { if (i+30 >= buffer.length) { i = buffer.length false } else { d1 = i while (i < d1+12 && buffer(i) != '\n') i += 1 if (i < d1+12 || buffer(i)!=' ' || buffer(i+1)!='-' || buffer(i+2)!='-' || buffer(i+3)!='>' || buffer(i+4)!=' ') { nextNL() false } else { i += 5 d2 = i while (i < d2+12 && buffer(i) != '\n') i += 1 if (i < d2+12 || buffer(i) != '\n') { nextNL() false } else { nextNL() true } } } } def gatherText() { txt = i while (i < buffer.length && buffer(i) != '\n') { i += 1 nextNL() } n = i-txt nextNL() } def getNext: Entry = { while (i < buffer.length) { if (digits()) { if (dates()) { gatherText() return ValidEntry(new String(buffer,d1,12), new String(buffer,d2,12), new String(buffer,txt,n)) } } } return NoEntry } }

Now that you see this, aren't you glad the regex solution was encoded so fast?

+1

Rex kerr Aug 19 '11 at 15:06

source share

Jacob eggsers · Accepted Answer · 2011-08-19T00:10:22+0000

I am not an optimist, but here are two things to try:

you can do, move (?s) just before you need it.
delete \ r? $ and use greedy text .++ for text .+
^ \ d ++ \ s {1,2} (. {12}) → (. {12}) \ s {1,2} (? s) (. ++) $

To really get good performance, I would reorganize the code and regular expression to use findAllIn . The current code performs regular expression for each Entry in your file. I guess the single findAllIn will work better ... But maybe not ...

Can this regex be further optimized?

More articles: