I am creating a syntax shortcut and I am using String.split to create tokens from the input string. The first problem is that String.split creates a huge amount of empty lines, which leads to the fact that everything will be rather slow than otherwise.
For example, "***".split(/(\*)/)→ ["", "*", "", "*", "", "*", ""]. Is there any way to avoid this?
Another problem is the priority of the expression in the regular expression itself. Let's say I'm trying to parse a C-style multi-line comment. That is /* comment */. Now suppose the input string "/****/". If I used the following regular expression, it would work, but would produce many additional tokens (and all these empty lines!).
/(\/\*|\*\/|\*)/
It's best to read /*'s, */and then read everything else *in one token. That is, the best result for the specified string is ["/*", "**", "*/"]. However, when using a regex that should do this, I get bad results. The regular expression looks like this: /(\/\*|\*\/|\*+)/.
The result of this expression, however, is as follows: ["/*", "***", "/"]. I guess this is because the last part is greedy, so she steals the match from another part.
The only solution I found is to make a negative expression like this:
/(\/\*|\*\/|\*+(?!\/)/
This gives the expected result, but it is very slow compared to the other, and it has an effect for large strings.
Is there a solution to any of these problems?