Trying to find all instances of the NOT keyword in comments or literals?

Question

Trying to find all instances of the NOT keyword in comments or literals?

I try to find all instances of the "public" keyword in some Java code (with a Python script) that are not in the comments or lines, which were not found after // , between a /* and */ , and not between double or single quotation marks and which are not part of variable names, i.e. they must be preceded by a space, tab or new line, and the same shall follow.

So, here’s what I have at the moment -

 //.*\spublic\s.*\n /\*.*\spublic\s.*\*/ ".*\spublic\s.*" '.*\spublic\s.*'

Am I a mess at all?

But this finds exactly what I am NOT looking for. How can I rotate it and look for the inverse of the sum of these four expressions as a single regular expression?

I realized that this probably uses a negative look and look, but I still can't put it together. Also, for / ** / regex, I am worried that .* Does not match newlines, so it will not be able to recognize that this public is in a comment:

 /* public */

Everything below this point, I reflect on paper and can be ignored. These thoughts are not entirely accurate.

Edit:

I believe that (?<!//).*public.* Will match anything not in the comments on a single line, so I get things. I think. But still not sure how to combine all this.

Edit2:

So, after this idea, I | set them all out to get -

(?<!//).*public.*|(?<!/\*).*public.\*/(?!\*/)|(?<!").*public.*(?!")|(?<!').*public.*(?!')

But I'm not so sure about that. //public will not match the first element, but it will match the second. I need And look ahead and look, and not all this.

+7

python regex regex-negation

Aerovistae Dec 11 '12 at 5:31

source share

4 answers

Have you considered replacing all comments and single and double quotes of lines with zero lines using the re sub() method. Then just do a simple search / match / search for the resulting file for the word you are looking for?

This will at least give you line numbers where the word is located. You can use this information to edit the source file.

+2

Don O'Donnell Dec 11 '12 at 9:10

source share

You can use pyparsing to find the public keyword outside the comment or double quote:

 from pyparsing import Keyword, javaStyleComment, dblQuotedString keyword = "public" expr = Keyword(keyword).ignore(javaStyleComment | dblQuotedString)

Example

 for [token], start, end in expr.scanString(r"""{keyword} should match /* {keyword} should not match " */ // this {keyword} also shouldn't match "neither this \" {keyword}" but this {keyword} will re{keyword} is ignored '{keyword}' - also match (only double quoted strings are ignored) """.format(keyword=keyword)): assert token == keyword and len(keyword) == (end - start) print("Found at %d" % start)

Exit

 Found at 0 Found at 146 Found at 187

To ignore the single quote as well, you can use quotedString instead of dblQuotedString .

To do this with just regular expressions, see the regex-negation tag in SO , for example Regular expression to match a string that doesn't contain a word? or using even less Regex regex features : matching by exception, not searchable - is this possible? . An easy way would be to use a positive match and skip matching comments, quoted lines. As a result, the rest of the matches.

+1

jfs Dec 11 '12 at 18:56

source share

This finds the opposite, because that is what you are asking for. :)

I don’t know a way to match them all in one regular expression (although this should be theoretically possible, since regular languages are closed by additions and intersections). But you can definitely find all instances of the public, and then delete all instances that match one of your “bad” regular expressions. Try using, for example, the set.difference properties of match.start and match.end from re.finditer .

0

Dougal Dec 11 '12 at 6:08

source share

Martin ender · Accepted Answer · 2012-12-13T09:16:47+0000

Sorry, but I will have to tell you the news that what you are trying to do is impossible. The reason is that Java is not an ordinary language. As we all know, most regex engines provide irregular functions, but Python in particular lacks something like recursion (PCRE) or balancing groups (.NET) that could do the trick. But let's look at it in more detail.

First of all, why are your templates not as good as you think? (for the task of matching public within these literals, similar problems will be applied to change the logic)

As you already learned, you will have problems with line breaks (in the case of /*...*/ ). This can be solved either using the modifier / option / flag re.S (which changes the behavior . ), Or using [\s\S] instead . (since the previous matches any character).

But there are other problems. You just want to find the surrounding occurrences of lines or comment literals. You are not really sure that they are specially wrapped around public . I'm not sure how much you can put on one line in Java, but if you have an arbitrary line and then a public and then another line on one line, your regular expression will match public because it can find " before and after it. Even if this is not possible, if you have two block comments in the same input, then any public between the two block comments will cause a match. So you will need to find a way to claim that your public really inside "..." or /*...*/ , and not only that these literals You can find anywhere on the left.

Next: matches cannot match. But your match includes everything from the initial literal to the literal. Therefore, if you have "public public" , which will result in a single match. And capture cannot help you here. Usually the trick to avoid this is to use images (which are not included in the match). But (as we will see later) lookbehind does not work as well as you think, because it cannot be of arbitrary length (only in .NET, which is possible).

Now the worst part. What if there is a " in the comment? This should not be counted, right? What should I do if the line contains // or /* or */ ? This should not be counted, right? What about the ' inside " strings and " inside ' -strings? Worse, what about the \" inside " -string? So, for 100% reliability, you would have to do a similar check for your surrounding delimiters. And this is usually when regular expressions reach the end of their capabilities, and therefore you need the correct parser that walks along the input line and creates a whole tree of your code.

But say that you never had comment literals inside the lines, and you never had quotes inside the comments (or just matching quotes, because they would be a string, and we don’t want inside the public inside the lines). Therefore, we basically assume that each of the literals in question is correctly selected, and they are never nested. In this case, you can use lookahead to check if you are inside or outside of one of the literals (in fact, multiple hits). I will get to him soon.

But one more thing remains. What works (?<!//).*public.* ? For this, a coincidence for (?<!//) in any single position is sufficient. for example, if you just entered // public , the engine would try to find a negative lookbehind right at the beginning of the line (to the left of the beginning of the line), would not find // , and then use .* to consume // and a space, and then public . Do you really want (?<!//.*)public . This will start lookbehind from the starting position public and will look left along the current line. But ... this is a variable lookbehind length that is only supported by .NET.

But let's see how we can make sure that we are truly out of line. We can use lookahead to look all the way to the end of the input, and check that there are an even number of quotes in the path.

 public(?=[^"]*("[^"]*"[^"]*)*$)

Now, if we try very hard, we can also ignore escaped quotes inside the string:

 public(?=[^"]*("(?:[^"\\]|\\.)*"[^"]*)*$)

So, as soon as we meet with " , we will accept either non-cable characters, or backslashes, or backslashes and everything that follows (which also allows you to escape backslashes, so in "a string\\" we will not consider closing " as shielded). We can use this with multi-line mode ( re.M ) so as not to get to the end of the input (because the end of the line is enough):

 public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)

( re.M implied for all of the following patterns)

This is what searches with single quote strings:

 public(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)

For block comments, this is a little easier, because we need to search only /* or the end of the line (this time really the end of the entire line), without encountering */ in this way. This is done with a negative look at each position until the end of the search:

 public(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))

But, as I said, at the moment we are stunned by single-line comments. But in any case, we can combine the last three regular expressions in one, because lookaheads do not actually advance the position of the regular expression mechanism in the target line:

 public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))

Now what about these one-line comments? The trick to emulate a variable length lookbehind is usually to rotate the string and pattern, which makes the lookbehind lookahead:

 cilbup(?!.*//)

Of course, this means that we must also reverse all other patterns. The good news is that if we don't care about escaping, they look the same (because both quotation marks and block comments are symmetrical). Thus, you can run this template on the reverse input:

 cilbup(?=[^"\r\n]*("[^"\r\n]*"[^"\r\n]*)*$)(?=[^'\r\n]*('[^'\r\n]*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)

You can then find the matching positions in your actual input with inputLength -foundMatchPosition - foundMatchLength .

How about an escape? This is becoming very unpleasant now, because we need to skip quotes if they are followed by a backslash. Due to some return issues, we need to take care of this in five places. Three times when you use characters without quotes (because we need to also allow "\ and two times when you use quotes (using a negative result to make sure there is no backslash after them). Look at double quotes

 cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)

(This looks horrible, but if you compare it to a template that ignores escaping, you will notice a few differences.)

So by including this in the above template:

 cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)

So this can do it in many cases. But, as you can see, this is terrible, almost impossible to read, and certainly impossible to maintain.

What were the reservations? No comment literals inside strings, no string literals inside another type of string, no string literals inside comments. In addition, we have four independent views that are likely to take some time (at least I think I have canceled most backtracking).

In any case, I find this as close to regular expressions as possible.

EDIT:

I just realized that I had forgotten the condition that public should not be part of a longer literal. You have included spaces, but what if this is the first thing that goes into the input? The easiest way would be to use \b . This corresponds to the position (without the inclusion of surrounding characters), which is between the word character and a character other than the word. However, Java identifiers can contain any Unicode letter or number, and I'm not sure if Python \b is Unicode-aware. In addition, Java identifiers may contain $ . It will break anyway. Look for help! Instead of claiming that there is a symbol of space on each side, let it be argued that there is no non-spatial character. Since we need negative images for this, we get the advantage of not including these characters in the match for free:

 (?<!\S)cilbup(?!\S)(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)

And since just by scrolling this piece of code to the right, you cannot understand how ridiculously huge this regular expression is, here it is in freespacing ( re.X ) mode with some annotations:

 (?<!\S) # make sure there is no trailing non-whitespace character cilbup # public (?!\S) # make sure there is no leading non-whitespace character (?= # lookahead (effectively lookbehind!) to ensure we are not inside a # string (?:[^"\r\n]|"\\)* # consume everything except for line breaks and quotes, unless the # quote is followed by a backslash (preceded in the actual input) (?: # subpattern that matches two (unescaped) quotes "(?!\\) # a quote that is not followed by a backslash (?:[^"\r\n]|"\\)* # we've seen that before "(?!\\) # a quote that is not followed by a backslash (?:[^"\r\n]|"\\)* # we've seen that before )* # end of subpattern - repeat 0 or more times (ensures even no. of ") $ # end of line (start of line in actual input) ) # end of double-quote lookahead (?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$) # the same horrible bastard again for single quotes (?= # lookahead (effectively lookbehind) for block comments (?: # subgroup to consume anything except */ (?![*]/) # make sure there is no */ coming up [\s\S] # consume an arbitrary character )* # repeat (?:/[*]|\Z)# require to find either /* or the end of the string ) # end of lookahead for block comments (?!.*//) # make sure there is no // on this line

Trying to find all instances of the NOT keyword in comments or literals?

Example

Exit

More articles: