Improved regular expression syntax

Question

Improved regular expression syntax

I need help to complete my idea of regular expressions.

Introduction

There was a question about more syntax for regular expressions on SE, but I do not think that I will use free syntax. This is probably nice for beginners, but in the case of complex regular expressions, you replace the gibberish line with a whole page of slightly better gibberish. I like the approach where the regex consists of smaller parts. His decision is readable, but made by hand; it offers a smart way to create complex regex instead of a class supporting it.

I am trying to do this in a class using something like this (first an example of it)

final MyPattern pattern = MyPattern.builder() .caseInsensitive() .define("numberOfPoints", "\\d+") .define("numberOfNights", "\\d+") .define("hotelName", ".*") .define(' ', "\\s+") .build("score `numberOfPoints` for `numberOfNights` nights? at `hotelName`"); MyMatcher m = pattern.matcher("Score 400 FOR 2 nights at Minas Tirith Airport"); System.out.println(m.group("numberOfPoints")); // prints 400

where free syntax is used to combine regular expressions as follows:

define named patterns and use them by wrapping them in backlinks
- `name` creates a named group
  - mnemonics: shell captures the result of a command enclosed in backticks
- `:name` creates a group without capturing
  - Mnemonics: similar to (?: ... )
- `-name` creates a backlink
  - Mnemonics: a dash connects it to a previous occurrence
redefine individual characters and use it everywhere if not specified
- only some characters are allowed here (e.g. ~ @#% ")
  - redefining + or ( will be very confusing, therefore it is not allowed
  - redefining the space to denote any interval is very natural in the above example
  - redefinition of the character can make the pattern more compact, which is good, unless used
  - for example, using something like define('#', "\\\\") to match backslashes, can make the pattern readable
override some quoted sequences like \s or \w
- standard definitions do not match Unicode
- sometimes you may have an idea that a word or space

Named patterns serve as a kind of local variable to help decompose a complex expression into small and easy-to-understand parts. A proper naming pattern often makes comment unnecessary.

Questions

The above should not be difficult to implement (I have already done this in most cases), and I hope it can be really useful. You think so?

However, I'm not sure how it should behave inside brackets, sometimes it makes sense to use definitions, and sometimes not, for example. in

 .define(' ', "\\s") // a blank character .define('~', "/\**[^*]+\*/") // an inline comment (simplified) .define("something", "[ ~\\d]")

expanding the space to \s makes sense, but expanding the tilde does not. Maybe there should be a separate syntax to somehow define your own character classes?

Can you come up with some examples when a named template is very useful or not at all? I need some borderline cases and some ideas for improvement.

Tchrist's response

Comments on his objections

Lack of multi-line pattern strings.
- There are no multi-line strings in Java that I would like to change but cannot.
Freedom from insanely burdensome and erroneous double back dumping ...
- This again is something that I cannot do, I can only offer a workaround, p. below.
No compilation exceptions for invalid regular expression literals and no caching of properly compiled regular expressions.
- As regular expressions, only part of the standard library, and not the language itself, there is nothing that can be done here.
There are no tools for debugging or profiling.
- I can’t do anything here.
Lack of compliance with UTS # 18.
- This is easily solved by overriding the appropriate patterns, as I suggested. This is not ideal, as you will see explosive replacements in the debugger.

It sounds like you don't like Java. I would be happy to see some syntax improvements, but there is nothing I can do about it. I am looking for something working with current Java.

RFC 5322

Your example can be easily written using my syntax:

 final MyPattern pattern = MyPattern.builder() .define(" ", "") // ignore spaces .useForBackslash('#') // (1): see (2) .define("address", "`mailbox` | `group`") .define("WSP", "[\u0020\u0009]") .define("DQUOTE", "\"") .define("CRLF", "\r\n") .define("DIGIT", "[0-9]") .define("ALPHA", "[A-Za-z]") .define("NO_WS_CTL", "[\u0001-\u0008\u000b\u000c\u000e-\u001f\u007f]") // No whitespace control ... .define("domain_literal", "`CFWS`? #[ (?: `FWS`? `dcontent`)* `FWS`? #] `CFWS1?") // (2): see (1) ... .define("group", "`display_name` : (?:`mailbox_list` | `CFWS`)? ; `CFWS`?") .define("angle_addr", "`CFWS`? < `addr_spec` `CFWS`?") .define("name_addr", "`display_name`? `angle_addr`") .define("mailbox", "`name_addr` | `addr_spec`") .define("address", "`mailbox` | `group`") .build("`address`");

disadvantages

When rewriting your example, I ran into the following problems:

Since there are no \xdd escape sequences \udddd , you must use
Using a different character instead of a backslash is a bit strange
As I prefer to write from bottom to top, I had to cancel your lines
Without much idea of what he is doing, I, apart from myself, have made some mistakes

On the bright side: - Ignoring spaces is not a problem - Comments are not a problem - readability is good

And most importantly: Simple Java and uses the existing regex-engine as is.

+7

java regex fluent

maaartinus Feb 06 '11 at 17:04

source share

2 answers

I think that perhaps Regular Expression is actually not desirable, but rather something like a Parser-Combinator library (which can work with characters and / or include regular expressions in it).

That is, going beyond regular expressions (as irregularly as they can be implemented), tchrist definitely uses the Perl implementation ;-) and in context-free grammars - or at least those that can be represented in LL (n ), preferably with a minimum deviation back.

Scala: The Magic Begind Parse-Combinators Note that this is similar to BCNF. Has a nice introduction.

Haskel: Parsec Same thing.

Some examples in Java: JParsec and JPC .

Java as a language, however, is not as favorable for such seamless DSL extensions as some competitors; -)

+1

user166390 Feb 07 '11 at 2:30

source share

tchrist · Accepted Answer · 2011-02-06T17:38:04+0000

Named Capture Examples

Can you come up with some examples when a named template is very useful or not at all?

In response to your question, here is an example when the named patterns are especially useful. Its Perl or PCRE template for parsing the RFC 5322 mailing address. Firstly, in /x mode due to (?x) . Secondly, it separates the definitions from the call; the named address group is what makes a complete recursive descent. Its definition follows it in the non-executing block (?DEFINE)…) .

  (?x) # allow whitespace and comments (?&address) # this is the capture we call as a "regex subroutine" # the rest is all definitions, in a nicely BNF-style (?(DEFINE) (?<address> (?&mailbox) | (?&group)) (?<mailbox> (?&name_addr) | (?&addr_spec)) (?<name_addr> (?&display_name)? (?&angle_addr)) (?<angle_addr> (?&CFWS)? < (?&addr_spec) > (?&CFWS)?) (?<group> (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?) (?<display_name> (?&phrase)) (?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*) (?<addr_spec> (?&local_part) \@ (?&domain)) (?<local_part> (?&dot_atom) | (?&quoted_string)) (?<domain> (?&dot_atom) | (?&domain_literal)) (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)? \] (?&CFWS)?) (?<dcontent> (?&dtext) | (?&quoted_pair)) (?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e]) (?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~]) (?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?) (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?) (?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*) (?<text> [\x01-\x09\x0b\x0c\x0e-\x7f]) (?<quoted_pair> \\ (?&text)) (?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e]) (?<qcontent> (?&qtext) | (?&quoted_pair)) (?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))* (?&FWS)? (?&DQUOTE) (?&CFWS)?) (?<word> (?&atom) | (?&quoted_string)) (?<phrase> (?&word)+) # Folding white space (?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+) (?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e]) (?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment)) (?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) ) (?<CFWS> (?: (?&FWS)? (?&comment))* (?: (?:(?&FWS)? (?&comment)) | (?&FWS))) # No whitespace control (?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]) (?<ALPHA> [A-Za-z]) (?<DIGIT> [0-9]) (?<CRLF> \x0d \x0a) (?<DQUOTE> ") (?<WSP> [\x20\x09]) )

I highly recommend not to reinstall a perfectly good wheel. Start with PCRE compatibility. If you want to go beyond the basic Perl5 templates such as RFC5322-parser above, theres always Perl6 templates to draw.

indeed, it really pays to research existing practice and literature before embarking on an open R&D mission. These problems have long been resolved, sometimes quite elegantly.

Java Regex Syntax Enhancement

If you really need the best regex syntax ideas for Java, you should first address these flaws in Javas regexes:

Lack of multi-line pattern strings as shown above.
Freedom from the insanely burdensome and error prone double back flush, also demonstrated above.
No compilation exceptions for invalid regular expression literals and no caching of properly compiled regular expressions.
It is not possible to change something like "foo".matches(pattern) to use a better template library, in part, but not only because of the final classes that are not overridable.
There are no tools for debugging or profiling.
Lack of Compliance UTS # 18: Support for Basic Regular Expression , the most basic steps required for Java regular expressions to be useful for Unicode. They are not currently. They don’t even support Unicode 3.1 features from decades ago, which means that you cannot use Java templates for Unicode in any reasonable way; main building blocks are missing.

Of these, the first 3 were reviewed in several JVM languages, including both Groovy and Scala; even Clojure goes back and forth.

The second set of 3 steps will be tougher, but absolutely necessary. Last, the lack of even the most basic Unicode support in regular expressions, just kills Java to work in Unicode. At the end of the game, this is unforgivable. I can provide many examples if necessary, but you must trust me because I really know what I'm talking about here.

Only after you have achieved all of this do you have to worry about fixing Javas regular expressions so that they can catch up with the current state in comparison with the pattern. Until you take care of these past mistakes, you cannot begin to look at the present, not to mention the future.

Improved regular expression syntax

Introduction

Questions

Tchrist's response

Comments on his objections

RFC 5322

disadvantages

Named Capture Examples

Java Regex Syntax Enhancement

More articles: