I need help to complete my idea of regular expressions.
Introduction
There was a question about more syntax for regular expressions on SE, but I do not think that I will use free syntax. This is probably nice for beginners, but in the case of complex regular expressions, you replace the gibberish line with a whole page of slightly better gibberish. I like the approach where the regex consists of smaller parts. His decision is readable, but made by hand; it offers a smart way to create complex regex instead of a class supporting it.
I am trying to do this in a class using something like this (first an example of it)
final MyPattern pattern = MyPattern.builder() .caseInsensitive() .define("numberOfPoints", "\\d+") .define("numberOfNights", "\\d+") .define("hotelName", ".*") .define(' ', "\\s+") .build("score `numberOfPoints` for `numberOfNights` nights? at `hotelName`"); MyMatcher m = pattern.matcher("Score 400 FOR 2 nights at Minas Tirith Airport"); System.out.println(m.group("numberOfPoints"));
where free syntax is used to combine regular expressions as follows:
- define named patterns and use them by wrapping them in backlinks
`name` creates a named group- mnemonics: shell captures the result of a command enclosed in backticks
`:name` creates a group without capturing- Mnemonics: similar to
(?: ... )
`-name` creates a backlink- Mnemonics: a dash connects it to a previous occurrence
- redefine individual characters and use it everywhere if not specified
- only some characters are allowed here (e.g.
~ @#% ")- redefining
+ or ( will be very confusing, therefore it is not allowed - redefining the space to denote any interval is very natural in the above example
- redefinition of the character can make the pattern more compact, which is good, unless used
- for example, using something like
define('#', "\\\\") to match backslashes, can make the pattern readable
- override some quoted sequences like
\s or \w
Named patterns serve as a kind of local variable to help decompose a complex expression into small and easy-to-understand parts. A proper naming pattern often makes comment unnecessary.
Questions
The above should not be difficult to implement (I have already done this in most cases), and I hope it can be really useful. You think so?
However, I'm not sure how it should behave inside brackets, sometimes it makes sense to use definitions, and sometimes not, for example. in
.define(' ', "\\s") // a blank character .define('~', "/\**[^*]+\*/") // an inline comment (simplified) .define("something", "[ ~\\d]")
expanding the space to \s makes sense, but expanding the tilde does not. Maybe there should be a separate syntax to somehow define your own character classes?
Can you come up with some examples when a named template is very useful or not at all? I need some borderline cases and some ideas for improvement.
Tchrist's response
Comments on his objections
- Lack of multi-line pattern strings.
- There are no multi-line strings in Java that I would like to change but cannot.
- Freedom from insanely burdensome and erroneous double back dumping ...
- This again is something that I cannot do, I can only offer a workaround, p. below.
- No compilation exceptions for invalid regular expression literals and no caching of properly compiled regular expressions.
- As regular expressions, only part of the standard library, and not the language itself, there is nothing that can be done here.
- There are no tools for debugging or profiling.
- I can’t do anything here.
- Lack of compliance with UTS # 18.
- This is easily solved by overriding the appropriate patterns, as I suggested. This is not ideal, as you will see explosive replacements in the debugger.
It sounds like you don't like Java. I would be happy to see some syntax improvements, but there is nothing I can do about it. I am looking for something working with current Java.
RFC 5322
Your example can be easily written using my syntax:
final MyPattern pattern = MyPattern.builder() .define(" ", "") // ignore spaces .useForBackslash('#') // (1): see (2) .define("address", "`mailbox` | `group`") .define("WSP", "[\u0020\u0009]") .define("DQUOTE", "\"") .define("CRLF", "\r\n") .define("DIGIT", "[0-9]") .define("ALPHA", "[A-Za-z]") .define("NO_WS_CTL", "[\u0001-\u0008\u000b\u000c\u000e-\u001f\u007f]") // No whitespace control ... .define("domain_literal", "`CFWS`? #[ (?: `FWS`? `dcontent`)* `FWS`? #] `CFWS1?") // (2): see (1) ... .define("group", "`display_name` : (?:`mailbox_list` | `CFWS`)? ; `CFWS`?") .define("angle_addr", "`CFWS`? < `addr_spec` `CFWS`?") .define("name_addr", "`display_name`? `angle_addr`") .define("mailbox", "`name_addr` | `addr_spec`") .define("address", "`mailbox` | `group`") .build("`address`");
disadvantages
When rewriting your example, I ran into the following problems:
- Since there are no
\xdd escape sequences \udddd , you must use - Using a different character instead of a backslash is a bit strange
- As I prefer to write from bottom to top, I had to cancel your lines
- Without much idea of what he is doing, I, apart from myself, have made some mistakes
On the bright side: - Ignoring spaces is not a problem - Comments are not a problem - readability is good
And most importantly: Simple Java and uses the existing regex-engine as is.