Python regex question: removing multi-line comments, but keeping line breaks

I am parsing the source code file and I want to delete all comments on the line (ie starting with "//") and multi-line comments (ie ./..../). However, if a multiline comment has at least one line break in it (\ n), I want the output to have exactly one line break.

For example, the code:

qwe /* 123 456 789 */ asd 

It should definitely turn into:

 qwe asd 

not "qweasd" or:

 qwe asd 

What would be the best way? Thanks


EDIT: Sample code for testing:

 comments_test = "hello // comment\n"+\ "line 2 /* a comment */\n"+\ "line 3 /* a comment*/ /*comment*/\n"+\ "line 4 /* a comment\n"+\ "continuation of a comment*/ line 5\n"+\ "/* comment */line 6\n"+\ "line 7 /*********\n"+\ "********************\n"+\ "**************/\n"+\ "line ?? /*********\n"+\ "********************\n"+\ "********************\n"+\ "********************\n"+\ "********************\n"+\ "**************/\n"+\ "line ??" 

Expected results:

 hello line 2 line 3 line 4 line 5 line 6 line 7 line ?? line ?? 
+4
source share
5 answers
 comment_re = re.compile( r'(^)?[^\S\n]*/(?:\*(.*?)\*/[^\S\n]*|/[^\n]*)($)?', re.DOTALL | re.MULTILINE ) def comment_replacer(match): start,mid,end = match.group(1,2,3) if mid is None: # single line comment return '' elif start is not None or end is not None: # multi line comment at start or end of a line return '' elif '\n' in mid: # multi line comment with line break return '\n' else: # multi line comment without line break return ' ' def remove_comments(text): return comment_re.sub(comment_replacer, text) 
  • (^)? will match if the comment starts at the beginning of the line if MULTILINE -flag is used.
  • [^\S\n] will match any space character except a newline character. We do not want to match line breaks if a comment begins with its own line.
  • /\*(.*?)\*/ will match the multi-line comment and display the content. Lazy matching, so we do not respond to two or more comments. DOTALL -flag does . matching newline characters.
  • //[^\n] will match a single line comment. Cannot be used . because of the DOTALL -flag.
  • ($)? will match if the comment stops at the end of the line if MULTILINE -flag is used.

Examples:

 >>> s = ("qwe /* 123\n" "456\n" "789 */ asd /* 123 */ zxc\n" "rty // fgh\n") >>> print '"' + '"\n"'.join( ... remove_comments(s).splitlines() ... ) + '"' "qwe" "asd zxc" "rty" >>> comments_test = ("hello // comment\n" ... "line 2 /* a comment */\n" ... "line 3 /* a comment*/ /*comment*/\n" ... "line 4 /* a comment\n" ... "continuation of a comment*/ line 5\n" ... "/* comment */line 6\n" ... "line 7 /*********\n" ... "********************\n" ... "**************/\n" ... "line ?? /*********\n" ... "********************\n" ... "********************\n" ... "********************\n" ... "********************\n" ... "**************/\n") >>> print '"' + '"\n"'.join( ... remove_comments(comments_test).splitlines() ... ) + '"' "hello" "line 2" "line 3 " "line 4" "line 5" "line 6" "line 7" "line ??" "line ??" 

edits:

  • Updated to new specification.
  • Another example added.
+9
source

The fact that you should even ask this question, and that the proposed solutions, let's say, will be considered less readable :-) should be good evidence that REs are not a real answer to this question.

You would be much better, in terms of readability, to actually code this as a relatively simple parser.

Too often, people try to use RE to be β€œsmart” (I don't mean it disparagingly), thinking that one line is elegant, but all they get is an unattainable swamp of characters. I would prefer to fully comment on the 20-line solution, which I can understand in an instant.

+5
source

Is this what you are looking for?

 >>> print(s) qwe /* 123 456 789 */ asd >>> print(re.sub(r'\s*/\*.*\n.*\*/\s*', '\n', s, flags=re.S)) qwe asd 

This will only work for those comments that contain more than one line, but leave others alone.

+1
source

How about this:

 re.sub(r'\s*/\*(.|\n)*?\*/\s*', '\n', s, re.DOTALL).strip() 

It attacks leading spaces, /* , any text and a new line before the first *\ , then after the space, spaces.

This twists the sycora example a bit, but it is also not greedy inside. You can also look at the Multiline option.

+1
source

See can-regular-expressions-be-used-to-match-nested-patterns - if you are considering nested comments, regular expressions are not a solution.

0
source

All Articles