Regex: match a substring in all lines, except when the substring is inside the comments section

Here I go:

I am encoding a PHP application and I have a new official domain for it, where all the FAQs are now located. Some of the files in my script include help links in the old FAQ section, so I want to replace them using the new domain. However, I want the links that link to the old domain to appear only if they are located in the comments or comments block (I still use the old domain for self-reference and other documentation).

So basically, what I want to achieve is a regular expression that works, given the following:

  • Match all occurrences of example.com on all lines *.
  • Does not match the entire string, just the string example.com .
    • If the line starts with // , /* or "*" does not match the example.com instance in that single line (although this may be a problem if the comment block is closed in the same place it was opened).

I usually write my block comments as follows:

 /* text * blah * blah */ 

That's why I don't want to match "example.com" if it is after // , /* or "*".

I realized that it would be something like this:

 ^(?:(?!//|/\*|\s\*).?).*example\.com 

But this has one problem: it matches the entire line, not just “example.com” (this causes problems mainly when two or more lines of “example.com” are matched on the same line).

Can someone help me fix my regex? Please note: this should not be a PHP regular expression, since I could always use a tool like grepWin to locally edit all files at once.

Oh, and please let me know if there is a way to summarize the block comments in some way: for example, /* found does not match example.com until */ is found. That would be extremely helpful. Is it possible to achieve this in general (non-lingual) regular expressions?

+4
source share
2 answers

A regular expression that matches only example.com if it is not inside the comment section (but this does not apply to lines, so you will have to do it separately):

 $result = preg_replace( '%example\.com # Match example.com (?! # only if it\ not possible to match (?: # the following: (?!/\*) # (unless an opening comment starts first) . # any character )* # any number of times \*/ # followed by a closing comment. ) # End of lookahead %sx', 'newdomain.com', $subject); 
+2
source

I would use some kind of tokenizer to distinguish between comments and other language characters.

How you process PHP files, you should use the PHP token_get_all :

 $tokens = token_get_all($source); 

Then you can list the markers and select the markers by their type :

 foreach ($tokens as &$token) { if (in_array($token[0], array(T_COMMENT, T_DOC_COMMENT, T_ML_COMMENT))) { // comment } else { // not a comment $token[1] = str_replace('example.com', 'example.net', $token[1]); } } 

At the end, put everything back together with implode .

For other languages ​​in which you do not have a suitable tokenizer, you can write your own small tokenizer:

 preg_match_all('~/\*.*?\*/|//(?s).*|(example\.com)|.~', $code, $tokens, PREG_SET_ORDER); foreach ($tokens as &$token) { if (strlen($token[1])) { $token = str_replace('example.com', 'example.net', $token[1]); } else { $token = $token[0]; } } $code = implode('', $tokens); 

Note that this does not account for any other tokens, such as strings. Thus, this does not match example.com if it appears in a line, but also as a comment:

 'foo /* not a comment example.com */ bar' 
+2
source

All Articles