Python regex question: removing multi-line comments, but keeping line breaks

Question

Python regex question: removing multi-line comments, but keeping line breaks

I am parsing the source code file and I want to delete all comments on the line (ie starting with "//") and multi-line comments (ie ./..../). However, if a multiline comment has at least one line break in it (\ n), I want the output to have exactly one line break.

For example, the code:

qwe /* 123 456 789 */ asd

It should definitely turn into:

 qwe asd

not "qweasd" or:

 qwe asd

What would be the best way? Thanks

EDIT: Sample code for testing:

 comments_test = "hello // comment\n"+\ "line 2 /* a comment */\n"+\ "line 3 /* a comment*/ /*comment*/\n"+\ "line 4 /* a comment\n"+\ "continuation of a comment*/ line 5\n"+\ "/* comment */line 6\n"+\ "line 7 /*********\n"+\ "********************\n"+\ "**************/\n"+\ "line ?? /*********\n"+\ "********************\n"+\ "********************\n"+\ "********************\n"+\ "********************\n"+\ "**************/\n"+\ "line ??"

Expected results:

 hello line 2 line 3 line 4 line 5 line 6 line 7 line ?? line ??

+4

python comments regex parsing

Roee adler May 10, '09 at 4:20

source share

5 answers

The fact that you should even ask this question, and that the proposed solutions, let's say, will be considered less readable :-) should be good evidence that REs are not a real answer to this question.

You would be much better, in terms of readability, to actually code this as a relatively simple parser.

Too often, people try to use RE to be “smart” (I don't mean it disparagingly), thinking that one line is elegant, but all they get is an unattainable swamp of characters. I would prefer to fully comment on the 20-line solution, which I can understand in an instant.

+5

paxdiablo May 10, '09 at 5:07

source share

Is this what you are looking for?

 >>> print(s) qwe /* 123 456 789 */ asd >>> print(re.sub(r'\s*/\*.*\n.*\*/\s*', '\n', s, flags=re.S)) qwe asd

This will only work for those comments that contain more than one line, but leave others alone.

+1

sykora May 10, '09 at 4:30

source share

How about this:

 re.sub(r'\s*/\*(.|\n)*?\*/\s*', '\n', s, re.DOTALL).strip()

It attacks leading spaces, /* , any text and a new line before the first *\ , then after the space, spaces.

This twists the sycora example a bit, but it is also not greedy inside. You can also look at the Multiline option.

+1

Joseph Pecoraro May 10, '09 at 4:56

source share

See can-regular-expressions-be-used-to-match-nested-patterns - if you are considering nested comments, regular expressions are not a solution.

0

gimel May 10, '09 at 4:57

source share

Markus jarderot · Accepted Answer · 2009-05-10T05:01:02+0000

 comment_re = re.compile( r'(^)?[^\S\n]*/(?:\*(.*?)\*/[^\S\n]*|/[^\n]*)($)?', re.DOTALL | re.MULTILINE ) def comment_replacer(match): start,mid,end = match.group(1,2,3) if mid is None: # single line comment return '' elif start is not None or end is not None: # multi line comment at start or end of a line return '' elif '\n' in mid: # multi line comment with line break return '\n' else: # multi line comment without line break return ' ' def remove_comments(text): return comment_re.sub(comment_replacer, text)

(^)? will match if the comment starts at the beginning of the line if MULTILINE -flag is used.
[^\S\n] will match any space character except a newline character. We do not want to match line breaks if a comment begins with its own line.
/\*(.*?)\*/ will match the multi-line comment and display the content. Lazy matching, so we do not respond to two or more comments. DOTALL -flag does . matching newline characters.
//[^\n] will match a single line comment. Cannot be used . because of the DOTALL -flag.
($)? will match if the comment stops at the end of the line if MULTILINE -flag is used.

Examples:

 >>> s = ("qwe /* 123\n" "456\n" "789 */ asd /* 123 */ zxc\n" "rty // fgh\n") >>> print '"' + '"\n"'.join( ... remove_comments(s).splitlines() ... ) + '"' "qwe" "asd zxc" "rty" >>> comments_test = ("hello // comment\n" ... "line 2 /* a comment */\n" ... "line 3 /* a comment*/ /*comment*/\n" ... "line 4 /* a comment\n" ... "continuation of a comment*/ line 5\n" ... "/* comment */line 6\n" ... "line 7 /*********\n" ... "********************\n" ... "**************/\n" ... "line ?? /*********\n" ... "********************\n" ... "********************\n" ... "********************\n" ... "********************\n" ... "**************/\n") >>> print '"' + '"\n"'.join( ... remove_comments(comments_test).splitlines() ... ) + '"' "hello" "line 2" "line 3 " "line 4" "line 5" "line 6" "line 7" "line ??" "line ??"

edits:

Updated to new specification.
Another example added.

Python regex question: removing multi-line comments, but keeping line breaks

More articles: