Golang regular expression excludes quoted strings

I am trying to implement the removeComments function in Golang from this Javascript implementation . I hope to remove any comments from the text. For instance:

 /* this is comments, and should be removed */ However, "/* this is quoted, so it should not be removed*/" 

In the Javascript implementation, quoted associations are not grouped into groups, so I can easily filter them out. However, in the Golang, it seems not easy to determine if the agreed part is captured in the group or not. So, how can I implement the same removeComments logic in Golang as in the Javascript version?

+6
source share
6 answers

They do not preserve formatting


Preferred path (creates NULL if group 1 does not map)
works on the golang playground -

  # https://play.golang.org/p/yKtPk5QCQV # fmt.Println(reg.ReplaceAllString(txt, "$1")) # (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*) (?: # Comments /\* # Start /* .. */ comment [^*]* \*+ (?: [^/*] [^*]* \*+ )* / # End /* .. */ comment | // [^\n]* # Start // comment (?: \n | $ ) # End // comment ) | ( # (1 start), Non - comments " [^"\\]* # Double quoted text (?: \\ [\S\s] [^"\\]* )* " | ' [^'\\]* # Single quoted text (?: \\ [\S\s] [^'\\]* )* ' | [\S\s] # Any other char [^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline) ) # (1 end) 

Alternative method (group 1 always matches, but may be empty)
works on the golang playground -

  # https://play.golang.org/p/7FDGZSmMtP # fmt.Println(reg.ReplaceAllString(txt, "$1")) # (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))?((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)?) (?: # Comments /\* # Start /* .. */ comment [^*]* \*+ (?: [^/*] [^*]* \*+ )* / # End /* .. */ comment | // [^\n]* # Start // comment (?: \n | $ ) # End // comment )? ( # (1 start), Non - comments (?: " [^"\\]* # Double quoted text (?: \\ [\S\s] [^"\\]* )* " | ' [^'\\]* # Single quoted text (?: \\ [\S\s] [^'\\]* )* ' | [\S\s] # Any other char [^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline) )? ) # (1 end) 

Formatting Cadilac - Preserves

(Unfortunately, this cannot be done in the Golang, because the Golan cannot make statements)
Added that you switch to another regular expression engine.

  # raw: ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^/"'\\\s]*) # delimited: /((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/ ( # (1 start), Comments (?: (?: ^ [ \t]* )? # <- To preserve formatting (?: /\* # Start /* .. */ comment [^*]* \*+ (?: [^/*] [^*]* \*+ )* / # End /* .. */ comment (?: # <- To preserve formatting [ \t]* \r? \n (?= [ \t]* (?: \r? \n | /\* | // ) ) )? | // # Start // comment (?: # Possible line-continuation [^\\] | \\ (?: \r? \n )? )*? (?: # End // comment \r? \n (?= # <- To preserve formatting [ \t]* (?: \r? \n | /\* | // ) ) | (?= \r? \n ) ) ) )+ # Grab multiple comment blocks if need be ) # (1 end) | ## OR ( # (2 start), Non - comments " [^"\\]* # Double quoted text (?: \\ [\S\s] [^"\\]* )* " | ' [^'\\]* # Single quoted text (?: \\ [\S\s] [^'\\]* )* ' | (?: \r? \n | [\S\s] ) # Linebreak or Any other char [^/"'\\\s]* # Chars which doesn't start a comment, string, escape, # or line continuation (escape + newline) ) # (2 end) 
+1
source

BACKGROUND

The correct way to accomplish the task is to match and write quoted lines (given that there may be hidden objects inside them), and then match multi-line comments.

REGEX IN-CODE DEMO

Here is the code for this:

 package main import ( "fmt" "regexp" ) func main() { reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*")|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`) txt := `random text /* removable comment */ "but /* never remove this */ one" more random *text*` fmt.Println(reg.ReplaceAllString(txt, "$1")) } 

Watch the Playground Demo

EXPLANATION

I offer regex with the Best Regex Trick Ever concept and consists of 2 alternatives:

  • ("[^"\\]*(?:\\.[^"\\]*)*") - Double quoted string literal expression - group 1 (see capture group formed with an external pair of uninsulated brackets and later available by replacing backlinks ) matching double quotes of string literals that may contain escaped sequences. This part corresponds to:
    • " - leading double quote
    • [^"\\]* - 0+ characters except " and \ (as [^...] construct is a negative character class that matches any characters except those defined inside it) ( * is zero or more cases corresponding to quantifier )
    • (?:\\.[^"\\]*)*" - 0+ (see the last * and not an exciting group only for grouping subpatterns without generating a capture) of the escaped sequence ( \\. Matches the literal \ , followed by any character) followed by 0+ characters other than " and \
  • | - or
  • /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ - the multi-line comment regex matches * without forming a capturing group (thus, it is not accessible from the replacement template via reverse links) and matches
    • / - literal trait /
    • \* - literal asterisk
    • [^*]* - zero or more characters except an asterisk
    • \*+ - 1 or more ( + - one or more matches corresponding to the quantifier) ​​stars
    • (?:[^/*][^*]*\*+)* - 0+ (without capturing, we will not use it later) of any character, but / or * (see [^/*] ) followed by followed by 0 + characters, except for an asterisk (see [^*]* ), and then followed by asterisks 1+ (see \*+ ).
    • / - literal (trailing, closing) slash.

NOTE : This multi-line comment regex is the fastest I've ever tested. The same applies to the double bonded regular expression, since "[^"\\]*(?:\\.[^"\\]*)*" written with the unroll-the-loop method : no interlacing, only character classes with quantifiers * and + are used in a specific order, allowing you to perform the fastest matching.

NOTES TO REDUCE INDICATOR REMEDIES

If you plan to extend to matching with single quotes, there is nothing simpler, just add another alternative to the first capture group, reusing double quotes of the string literal and replacing the double quotes with single quotes:

 reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`) ^-------------------------^ 

Here is a single and double cable literal supporting regex demo removing miltiline comments

Adding support for comments on one line is similar: just add //[^\n\r]* alternative end:

 reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*[\r\n]*`) ^-----------^ 

Here is a one- and two-show literal supporting a regex demo removing miltiline and singleline comments

+5
source

I never read / wrote anything in Go, so carry me. Fortunately, I know the regex. I did a little research on Go regular expressions, and it would seem that they lack most of the modern features (such as links).

Despite this, I have developed a regex that seems to be what you are looking for. I assume that all lines are one line. Here he is:

 reg := regexp.MustCompile(`(?m)^([^"\n]*)/\*([^*]+|(\*+[^/]))*\*+/`) txt := `random text /* removable comment */ "but /* never remove this */ one" more random *text*` fmt.Println(reg.ReplaceAllString(txt, "${1}")) 

Option: the above version will not delete comments that occur after quotation marks. This version will be, but it may need to be run several times.

 reg := regexp.MustCompile( `(?m)^(([^"\n]*|("[^"\n]*"))*)/\*([^*]+|(\*+[^/]))*\*+/` ) txt := ` random text what /* removable comment */ hi "but /* never remove this */ one" then /*whats here*/ i don't know /*what*/ more random *text* ` newtxt := reg.ReplaceAllString(txt, "${1}") fmt.Println(newtxt) newtxt = reg.ReplaceAllString(newtxt, "${1}") fmt.Println(newtxt) 

Explanation

  • (?m) means multi-line mode. Regex101 gives a good explanation for this:

    Anchors ^ and $ now correspond at the beginning / end of each line, respectively, instead of the beginning / end of the entire line.

    It should be attached to the beginning of each line (using ^ ) to ensure that the quote does not start.

  • The first regular expression contains the following: [^"\n]* . Essentially, it matches everything that isn't " or \n . I added brackets because this material is not comments, so it needs to be returned.

  • The second regular expression has the following: (([^"\n]*|("[^"\n]*"))*) . A regular expression with this expression can either match [^"\n]* (as the first regular expression does), or ( | ), it can match a pair of quotation marks (and the contents between them) using "[^"\n]*" It repeats, so it works, for example, when there are several quotes. Please note that, like a simpler regular expression, this material is captured without comment.

  • Both regular expressions use this: /\*([^*]+|(\*+[^/]))*\*+/ . It matches /* followed by any number:

    • [^*]+ Not * characters

    or

    • \*+[^/] One or more * that are not followed by / .
  • And then it corresponds to closing */

  • During the replacement, ${1} refers to non-comments that were captured, so they are inserted into the line again.

+3
source

Demo

Listen to the golang demo

(The conclusions at each stage are output, and the end result can be seen by scrolling down.)

Method

A few "tricks" are used to work around Golang with a somewhat limited regex syntax :

  • Replace start quotes and end quotes with a unique character. It is essential that the characters used to define the start and end quotes must be different from each other and extremely unlikely to be displayed in the processed text.
  • Replace all comment starters ( /* ) that have not undergone an unrivaled start quote with a unique sequence of one or more characters.
  • Similarly, replace all comment commentators ( */ ) that fail to get a final quote that does not have a start quote in front of it with another unique sequence of one or more characters.
  • Delete all remaining /*...*/ comment sequences.
  • Unlock previously masked comment starters / end users by replacing the replacements made in steps 2 and 3. above.

Limitations

The current demo does not address the possibility of a double quote appearing in a comment, for example. /* Not expected: " */ . Note. My feeling that this can be handled is just that you haven’t made the effort yet, so let me know if you think this could be a problem and I will consider it.

+1
source

Just for fun, another approach, a minimal lexer implemented as a state machine, inspired and well described in Rob Pike, says http://cuddle.googlecode.com/hg/talk/lex.html . The code is more verbose, but more readable, clear and hacked, and then a regular expression. It can also work with any Reader and Writer, and not with lines, so it does not consume RAM and should even be faster.

 type stateFn func(*lexer) stateFn func run(l *lexer) { for state := lexText; state != nil; { state = state(l) } } type lexer struct { io.RuneReader io.Writer } func lexText(l *lexer) stateFn { for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() { switch r { case '"': l.Write([]byte(string(r))) return lexQuoted case '/': r, _, err = l.ReadRune() if r == '*' { return lexComment } else { l.Write([]byte("/")) l.Write([]byte(string(r))) } default: l.Write([]byte(string(r))) } } return nil } func lexQuoted(l *lexer) stateFn { for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() { if r == '"' { l.Write([]byte(string(r))) return lexText } l.Write([]byte(string(r))) } return nil } func lexComment(l *lexer) stateFn { for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() { if r == '*' { r, _, err = l.ReadRune() if r == '/' { return lexText } } } return nil } 

You can see that it works http://play.golang.org/p/HyvEeANs1u

+1
source

Try this example.

play golang

0
source

All Articles