Parsing VBA Const ... declarations with regex

I am trying to write a VBA parser; To create a ConstantNode I need to be able to match all the possible variations of a Const declaration.

They work great:

  • Const foo = 123
  • Const foo$ = "123"
  • Const foo As String = "123"
  • Private Const foo = 123
  • Public Const foo As Integer = 123
  • Global Const foo% = 123

But I have 2 problems:

  • If there is a comment at the end of the declaration, I collect it as part of the value:

     Const foo = 123 'this comment is included as part of the value 
  • If two or more constants are specified in the same instruction, I cannot execute the full instruction:

     Const foo = 123, bar = 456 

Here are the regular expressions that I use:

  /// <summary> /// Gets a regular expression pattern for matching a constant declaration. /// </summary> /// <remarks> /// Constants declared in class modules may only be <c>Private</c>. /// Constants declared at procedure scope cannot have an access modifier. /// </remarks> public static string GetConstantDeclarationSyntax() { return @"^((Private|Public|Global)\s)?Const\s(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?<as>\sAs\s(?<reference>(((?<library>[a-zA-Z][a-zA-Z0-9_]*))\.)?(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)))?\s\=\s(?<value>.*)$"; } 

Obviously, both problems are caused by the (?<value>.*)$ Part that matches anything to the end of the line. I got a VariableNode to support multiple declarations in a single statement, including the entire template in a capture group and adding an optional comma, but since the constants have this value group, this led to the first constant having all subsequent declarations captured as part of its value .. ., which returns me to problem number 1.

I wonder if it can solve problem # 1 with a regular expression at all, given that this value can be a string containing an apostrophe and, possibly, several hidden (doubled) double quotes.

I think I can solve it in the ConstantNode class itself, in getter for value :

 /// <summary> /// Gets the constant value. Strings include delimiting quotes. /// </summary> public string Value { get { return RegexMatch.Groups["value"].Value; } } 

I mean, I could implement some additional logic here to do what I cannot do with regex.


If problem number 1 can be solved with a regular expression, then I believe that problem number 2 can also be ... or am I on the right track here? Should I break the [fairly complex] regex patterns and think of something else? I'm not too familiar with greedy subexpressions, backlinks, and other more advanced regex functions - is this what limits me, or am I just using the wrong hammer for this nail?

Note: it does not matter that the templates potentially correspond to illegal syntax - this code will only work with compiled VBA code.

+7
c # regex parsing
source share
1 answer

Let me continue and add a disclaimer of this. This is absolutely not a good idea (but it was a fun task). The regular expression (s) I'm going to present will analyze the test cases in the question, but they obviously are not bullet proof. Using the analyzer can save you a lot of headache later. I tried to find a parser for VBA, but came up empty-handed (and I guess everyone else too).

Regex

For this to be good, you need to have some control over the VBA code. If you cannot do this, you really need to look at the parser instead of using regular expressions. However, judging by what you have already said, you may have a little control. Perhaps this will help.

So, for this I had to split the regular expression into two different regular expressions. The reason for this is that the .NET Regex library cannot process capture groups in a repeating group.

Capturing a string and starting parsing, this will put the variables (with values) in one group, but the second Regex will analyze them. Just fyi, regular expressions use negative lookbehinds.

 ^(?:(?<Accessibility>Private|Public|Global)\s)?Const\s(?<variable>[a-zA-Z][a-zA-Z0-9_]*(?:[%&@!#$])?(?:\sAs)?\s(?:(?:[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s[^',]+(?:(?:(?!"").)+"")?(?:,\s)?){1,}(?:'(?<comment>.+))?$ 

Regex Demo

Here is a regular expression for analyzing variables

 (?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?:\sAs)?\s(?:(?<reference>[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s(?<value>[^',]+(?:(?:(?!").)+")?),? 

Regex Demo

And here is some C # code you can toss and check everything. This should make it easier to check for any edge cases that you have.

 static void Main(string[] args) { List<String> test = new List<string> { "Const foo = 123", "Const foo$ = \"123\"", "Const foo As String = \"1'2'3\"", "Const foo As String = \"123\"", "Private Const foo = 123", "Public Const foo As Integer = 123", "Global Const foo% = 123", "Const foo = 123 'this comment is included as part of the value", "Const foo = 123, bar = 456", "'Const foo As String = \"123\"", }; foreach (var str in test) Parse(str); Console.Read(); } private static Regex parse = new Regex(@"^(?:(?<Accessibility>Private|Public|Global)\s)?Const\s(?<variable>[a-zA-Z][a-zA-Z0-9_]*(?:[%&@!#$])?(?:\sAs)?\s(?:(?:[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s[^',]+(?:(?:(?!"").)+"")?(?:,\s)?){1,}(?:'(?<comment>.+))?$", RegexOptions.Compiled | RegexOptions.Singleline, new TimeSpan(0, 0, 20)); private static Regex variableRegex = new Regex(@"(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?:\sAs)?\s(?:(?<reference>[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s(?<value>[^',]+(?:(?:(?!"").)+"")?),?", RegexOptions.Compiled | RegexOptions.Singleline, new TimeSpan(0, 0, 20)); public static void Parse(String str) { Console.WriteLine(String.Format("Parsing: {0}", str)); var match = parse.Match(str); if (match.Success) { //Private/Public/Global var accessibility = match.Groups["Accessibility"].Value; //Since we defined this with atleast one capture, there should always be something here. foreach (Capture variable in match.Groups["variable"].Captures) { //Console.WriteLine(variable); var variableMatch = variableRegex.Match(variable.Value); if (variableMatch.Success) { Console.WriteLine(String.Format("Identifier: {0}", variableMatch.Groups["identifier"].Value)); if (variableMatch.Groups["specifier"].Success) Console.WriteLine(String.Format("specifier: {0}", variableMatch.Groups["specifier"].Value)); if (variableMatch.Groups["reference"].Success) Console.WriteLine(String.Format("reference: {0}", variableMatch.Groups["reference"].Value)); Console.WriteLine(String.Format("value: {0}", variableMatch.Groups["value"].Value)); Console.WriteLine(""); } else { Console.WriteLine(String.Format("FAILED VARIABLE: {0}", variable.Value)); } } if (match.Groups["comment"].Success) { Console.WriteLine(String.Format("Comment: {0}", match.Groups["comment"].Value)); } } else { Console.WriteLine(String.Format("FAILED: {0}", str)); } Console.WriteLine("+++++++++++++++++++++++++++++++++++++++++++++"); Console.WriteLine(""); } 

The C # code was what I used to test my theory, so I apologize for the insanity in it.

For completeness, here is a small sample output. If you run the code, you will get more output, but this directly shows that it can handle the situations you requested.

 Parsing: Const foo = 123 'this comment is included as part of the value Identifier: foo value: 123 Comment: this comment is included as part of the value Parsing: Const foo = 123, bar = 456 Identifier: foo value: 123 Identifier: bar value: 456 

What is he processing

Here are the main cases that I can think of that you are probably interested in. It should still handle everything you had before, as I just added to the regular expression that you provided.

  • Comments
  • Multiple variable declarations on the same line
  • Apostrophe (comment symbol) in string value. Those. foo = "She awesome"
  • If a line starts with a comment, the line should be ignored

What he does not process

The only thing I really couldn’t do was distance, but it should not be difficult to add to yourself if you need it. So, for example, if the declaration of several variables should be a space after the decimal point. those. (VALID: foo = 123, foobar = 124) (INVALID: foo = 123, foobar = 124)

You will not get much leniency in the format from it, but you cannot do much with this when using regular expressions.


Hope this helps you, and if you need more explanation on how this works, just let me know. Just know that this is a bad idea . You will encounter situations that the regular expression cannot handle. If I were you, I would think of writing a simple parser that would give you more flexibility in the long run. Good luck.

+3
source share

All Articles