How to remove trailing comments via regexp?

Question

How to remove trailing comments via regexp?

For non-MATLAB readers: not sure which family they belong to, but MATLAB regular expressions are fully described here . The comment character is MATLAB % (percent), and its line separator is ' (apostrophe). The string delimiter inside the string is written as a double apostrophe ( 'this is how you write "it''s" in a string.' ). To complicate matters, matrix transposed operators are also apostrophes ( A' (Hermitian) or A.' (regular)).

Now, for dark reasons (which I will not develop :), I am trying to interpret MATLAB code in my own MATLAB language.

I'm currently trying to remove all trailing comments in an array of row cells, each of which contains a MATLAB line of code. At first glance, this may seem simple:

 >> str = 'simpleCommand(); % simple trailing comment'; >> regexprep(str, '%.*$', '') ans = simpleCommand();

But of course, something like this might come:

 >> str = ' fprintf(''%d%*c%3.0f\n'', value, args{:}); % Let' do this! '; >> regexprep(str, '%.*$', '') ans = fprintf(' %// <-- WRONG!

Obviously, we need to exclude all comment characters that are inside the lines from coincidence, and also considering that one apostrophe (or apostrophe point) immediately after the operator is an operator, not a line separator.

Based on the assumption that the number of open / close characters in front of the comment character should be even (which, as I know, is incomplete due to the transpose matrix operator), I created the following dynamic regular expression to handle this type of case:

 >> str = { 'myFun( {''test'' ''%''}); % let' ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it' ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let' ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); ' 'A = A.'';%tight trailing comment' }; >> >> C = regexprep(str, '(^.*)(?@mod(sum(\1==''''''''),2)==0;)(%.*$)', '$1')

but

 C = 'myFun( {'test' '%'}); ' %// sucess 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' %// sucess 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' %// sucess 'sprintf(str, '%*8.0f%*s%c' %// FAIL 'A = A.';' %// success (although I'm not sure why)

so I'm almost there, but not quite yet :)

Unfortunately, I have exhausted the amount of time I can spend thinking about this, and he needs to continue with other things, so maybe someone who has more time is friendly enough to think about these issues:

Are comment characters inside lines the only exception I need to look for?
What is the right and / or more efficient way to do this?

+8

string regex parsing matlab

Rody oldenhuis Jun 28 '13 at 7:25

source share

5 answers

How do you feel about using undocumented features? If you don't mind, you can use the mtree function to analyze the code and mark up comments. There are no regular expressions and we all know that we should not try to parse context-free grammars using regular expressions.

This function is a complete MATLAB code analyzer written in pure M-code. As far as I can tell, this is an experimental implementation, but it is already used by Mathworks in several places (this is the same function as MATLAB Cody and Contests for measuring code length) and can be used for other useful things.

If the input is a row cell, we do:

 >> str = {..}; >> C = deblank(cellfun(@(s) tree2str(mtree(s)), str, 'UniformOutput',false)) C = 'myFun( { 'test', '%' } );' 'sprintf( str, '%*8.0f%*s%c%3d\n' );' 'sprintf( str, '%*8.0f%*s%c%3d\n' );' 'sprintf( str, '%*8.0f%*s%c%3d\n' );' 'A = A.';'

If you already have an M file stored on disk, you can split the comments just like this:

 s = tree2str(mtree('myfile.m', '-file'))

If you want to see the comments back, add: mtree(.., '-comments')

+5

Amro Jun 28 '13 at 17:10

source share

Look what I found! :)

Comment Removal Tool by Peter J. Acklam.

For m-code, it contains the following regex:

 mainregex = [ ... ' ( ' ... % Grouping parenthesis (content goes to $1). ' ( ^ | \n ) ' ... % Beginning of string or beginning of line. ' ( ' ... % Non-capturing grouping parenthesis. ' ' ... '' ... % Match anything that is neither a comment nor a string... ' ( ' ... % Non-capturing grouping parenthesis. ' [\]\)}\w.]' ... % Either a character followed by ' ''+ ' ... % one or more transpose operators ' | ' ... % or else ' [^''%] ' ... % any character except single quote (which ' ' ... % starts a string) or a percent sign (which ' ' ... % starts a comment). ' )+ ' ... % Match one or more times. ' ' ... '' ... % ...or... ' | ' ... ' ' ... '' ... % ...match a string. ' '' ' ... % Opening single quote that starts the string. ' [^''\n]* ' ... % Zero or more chars that are neither single ' ' ... % quotes (special) nor newlines (illegal). ' ( ' ... % Non-capturing grouping parenthesis. ' '''' ' ... % An embedded (literal) single quote character. ' [^''\n]* ' ... % Again, zero or more chars that are neither ' ' ... % single quotes nor newlines. ' )* ' ... % Match zero or more times. ' '' ' ... % Closing single quote that ends the string. ' ' ... ' )* ' ... % Match zero or more times. ' ) ' ... ' [^\n]* ' ... % What remains must be a comment. ]; % Remove all the blanks from the regex. mainregex = mainregex(~isspace(mainregex));

Which becomes

 mainregex = '((^|\n)(([\]\)}\w.]''+|[^''%])+|''[^''\n]*(''''[^''\n]*)*'')*)[^\n]*'

and should be used as

 C = regexprep(str, mainregex, '$1')

So far, it has passed all my tests, so I think this should solve my problem pretty well :)

+4

Rody oldenhuis Jun 28 '13 at 14:07

source share

I prefer to overuse checkcode (replacing old mlint ) to do parsing. Here is a suggestion

 function strNC = removeComments(str) if iscell(str) strNC = cellfun(@removeComments, str, 'UniformOutput', false); elseif regexp(str, '%', 'once') err = getCheckCodeId(str); strNC = regexprep(str, '%[^%]*$', ''); errNC = getCheckCodeId(strNC); if strcmp(err, errNC), strNC = removeComments(strNC); else strNC = str; end else strNC = str; end end function errid = getCheckCodeId(line) fName = 'someTempFileName.m'; fh = fopen(fName, 'w'); fprintf(fh, '%s\n', line); fclose(fh); if exist('checkcode') structRep = checkcode(fName, '-id'); else structRep = mlint(fName, '-id'); end delete(fName); if isempty(structRep) errid = ''; else errid = structRep.id; end end

For each line, it checks to see if we introduce an error by trimming the line from the last % to the end of the line.

In your example, it returns:

 >> removeComments(str) ans = 'myFun( {'test' '%'}); ' 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' 'A = A.';'

It does not remove the suppression directive, %#ok , so you get:

 >> removeComments('a=1; %#ok') ans = a=1; %#ok

This is most likely good.

+2

Mohsen nosratinia Jun 28 '13 at 8:52

source share

How about the whole apostrophe before the comment is paired as follows:

 >> str = { 'myFun( {''test'' ''%''}); % let' ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it' ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let' ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); ' }; >> C = regexprep(str, '^(([^'']*''[^'']*'')*[^'']*)%.*$', '$1') C = myFun( {'test' '%'}); sprintf(str, '%*8.0f%*s%c%3d\n'); sprintf(str, '%*8.0f%*s%c%3d\n'); sprintf(str, '%*8.0f%*s%c%3d\n');

+1

ahilsend Jun 28 '13 at 8:23

source share

Mohsen nosratinia · Accepted Answer · 2013-06-28T09:57:17+0000

This matches the conjugated transpose register by checking which characters are allowed up to one

Numbers 2'
Letters A'
Point A.'
Left bracket, bracket and bracket A(1)' , A{1}' and [1 2 3]'

These are the only cases that I can think of now.

 C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

in your example we return

 >> C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1') C = 'myFun( {'test' '%'}); ' 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' 'A = A.';'

How to remove trailing comments via regexp?

More articles: