Sorry, but I will have to tell you the news that what you are trying to do is impossible. The reason is that Java is not an ordinary language. As we all know, most regex engines provide irregular functions, but Python in particular lacks something like recursion (PCRE) or balancing groups (.NET) that could do the trick. But let's look at it in more detail.
First of all, why are your templates not as good as you think? (for the task of matching public within these literals, similar problems will be applied to change the logic)
As you already learned, you will have problems with line breaks (in the case of /*...*/ ). This can be solved either using the modifier / option / flag re.S (which changes the behavior . ), Or using [\s\S] instead . (since the previous matches any character).
But there are other problems. You just want to find the surrounding occurrences of lines or comment literals. You are not really sure that they are specially wrapped around public . I'm not sure how much you can put on one line in Java, but if you have an arbitrary line and then a public and then another line on one line, your regular expression will match public because it can find " before and after it. Even if this is not possible, if you have two block comments in the same input, then any public between the two block comments will cause a match. So you will need to find a way to claim that your public really inside "..." or /*...*/ , and not only that these literals You can find anywhere on the left.
Next: matches cannot match. But your match includes everything from the initial literal to the literal. Therefore, if you have "public public" , which will result in a single match. And capture cannot help you here. Usually the trick to avoid this is to use images (which are not included in the match). But (as we will see later) lookbehind does not work as well as you think, because it cannot be of arbitrary length (only in .NET, which is possible).
Now the worst part. What if there is a " in the comment? This should not be counted, right? What should I do if the line contains // or /* or */ ? This should not be counted, right? What about the ' inside " strings and " inside ' -strings? Worse, what about the \" inside " -string? So, for 100% reliability, you would have to do a similar check for your surrounding delimiters. And this is usually when regular expressions reach the end of their capabilities, and therefore you need the correct parser that walks along the input line and creates a whole tree of your code.
But say that you never had comment literals inside the lines, and you never had quotes inside the comments (or just matching quotes, because they would be a string, and we don’t want inside the public inside the lines). Therefore, we basically assume that each of the literals in question is correctly selected, and they are never nested. In this case, you can use lookahead to check if you are inside or outside of one of the literals (in fact, multiple hits). I will get to him soon.
But one more thing remains. What works (?<!//).*public.* ? For this, a coincidence for (?<!//) in any single position is sufficient. for example, if you just entered // public , the engine would try to find a negative lookbehind right at the beginning of the line (to the left of the beginning of the line), would not find // , and then use .* to consume // and a space, and then public . Do you really want (?<!//.*)public . This will start lookbehind from the starting position public and will look left along the current line. But ... this is a variable lookbehind length that is only supported by .NET.
But let's see how we can make sure that we are truly out of line. We can use lookahead to look all the way to the end of the input, and check that there are an even number of quotes in the path.
public(?=[^"]*("[^"]*"[^"]*)*$)
Now, if we try very hard, we can also ignore escaped quotes inside the string:
public(?=[^"]*("(?:[^"\\]|\\.)*"[^"]*)*$)
So, as soon as we meet with " , we will accept either non-cable characters, or backslashes, or backslashes and everything that follows (which also allows you to escape backslashes, so in "a string\\" we will not consider closing " as shielded). We can use this with multi-line mode ( re.M ) so as not to get to the end of the input (because the end of the line is enough):
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)
( re.M implied for all of the following patterns)
This is what searches with single quote strings:
public(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)
For block comments, this is a little easier, because we need to search only /* or the end of the line (this time really the end of the entire line), without encountering */ in this way. This is done with a negative look at each position until the end of the search:
public(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
But, as I said, at the moment we are stunned by single-line comments. But in any case, we can combine the last three regular expressions in one, because lookaheads do not actually advance the position of the regular expression mechanism in the target line:
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
Now what about these one-line comments? The trick to emulate a variable length lookbehind is usually to rotate the string and pattern, which makes the lookbehind lookahead:
cilbup(?!.*
Of course, this means that we must also reverse all other patterns. The good news is that if we don't care about escaping, they look the same (because both quotation marks and block comments are symmetrical). Thus, you can run this template on the reverse input:
cilbup(?=[^"\r\n]*("[^"\r\n]*"[^"\r\n]*)*$)(?=[^'\r\n]*('[^'\r\n]*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
You can then find the matching positions in your actual input with inputLength -foundMatchPosition - foundMatchLength .
How about an escape? This is becoming very unpleasant now, because we need to skip quotes if they are followed by a backslash. Due to some return issues, we need to take care of this in five places. Three times when you use characters without quotes (because we need to also allow "\ and two times when you use quotes (using a negative result to make sure there is no backslash after them). Look at double quotes
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)
(This looks horrible, but if you compare it to a template that ignores escaping, you will notice a few differences.)
So by including this in the above template:
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
So this can do it in many cases. But, as you can see, this is terrible, almost impossible to read, and certainly impossible to maintain.
What were the reservations? No comment literals inside strings, no string literals inside another type of string, no string literals inside comments. In addition, we have four independent views that are likely to take some time (at least I think I have canceled most backtracking).
In any case, I find this as close to regular expressions as possible.
EDIT:
I just realized that I had forgotten the condition that public should not be part of a longer literal. You have included spaces, but what if this is the first thing that goes into the input? The easiest way would be to use \b . This corresponds to the position (without the inclusion of surrounding characters), which is between the word character and a character other than the word. However, Java identifiers can contain any Unicode letter or number, and I'm not sure if Python \b is Unicode-aware. In addition, Java identifiers may contain $ . It will break anyway. Look for help! Instead of claiming that there is a symbol of space on each side, let it be argued that there is no non-spatial character. Since we need negative images for this, we get the advantage of not including these characters in the match for free:
(?<!\S)cilbup(?!\S)(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
And since just by scrolling this piece of code to the right, you cannot understand how ridiculously huge this regular expression is, here it is in freespacing ( re.X ) mode with some annotations:
(?<!\S)