I wrote this regular expression to parse records from srt files.
(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$
I don't know if this matters, but it is done using the Scala programming language (Java Engine, but literal strings, so I don't need to double the backslash).
s{1,2} used because some files will have line breaks \n and others will have line breaks and carriage returns \n\r The first (?s) turns on DOTALL mode, so the third capture group can also correspond to a break lines.
My program basically splits the srt file using \n\r?\n as a delimiter and uses Scala's pattern matching function to read each record for further processing:
val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r def apply(string: String): Entry = string match { case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start), timeFormat.parse(end), text); }
Example entries:
One line:
1073 01:46:43,024 --> 01:46:45,015 I am your father.
Two lines:
160 00:20:16,400 --> 00:20:19,312 <i>Help me, Obi-Wan Kenobi. You're my only hope.</i>
The fact is that the profiler shows me that this method of parsing is by far the most time-consuming operation in my application (which does intensive math of time and can even transcode a file several times faster than what is required for reading and analyzing a record )
So any regular expression wizard can help me optimize it? Or maybe I should sacrifice a regular expression / pattern match and try the old java.util.Scanner approach?
Greetings