Find / replace url in document using Python regex

Question

Find / replace url in document using Python regex

Python regex experts! I am trying to change a string in an XML document. Source line:

<Tag name="low" Value="%hello%\dir"/>

The result I want to see:

 <Tag name="low" Value="C:\art"/>

My unsuccessful direct jump attempt:

 lines = re.sub("%hello%\dir"", "C:\art"/>

This does not work. Changes nothing in the document. Something with%?

For testing purposes, I tried:

 lines = re.sub("dir", "C:\art", a)

And I get:

 <Tag name="low" Value="%hello%\C:BELrt"/>

The problem is that \ a = BEL .

I tried a bunch of other things, but to no avail. How do I solve this problem?

+2

python xml regex

Renata pre Oct 12 '12 at 16:14

source share

3 answers

Philip · Answer 1 · 2012-10-12T16:17:22+0000

In Python, use the r prefix for a literal string to avoid having to flush your slashes. Then avoid the slash to avoid \d matching numbers.

 lines = re.sub(r"%hello%\\dir", r"C:\\art")

cwallenpoole · Answer 2 · 2012-10-12T16:18:45+0000

You problem is that you have some characters that have a specific meaning in regex's.

\d means any digit. %hello%\dir then %hello%[0-9]ir

You need to avoid these slashes / use a raw string to get around this:

 a = '''<Tag name="low" Value="%hello%\dir"/>''' lines = re.sub(r"%hello%\\dir", r"C:\\art", a) print(lines) #<Tag name="low" Value="C:\\art"/>

jfs · Answer 3 · 2012-10-12T16:54:15+0000

That's a good question. It shows three problems with text representation at once:

'\a' Python string literal is a single BELL character.
To enter a backslash followed by the letter 'a' in the Python source code, you need to either use raw literals: r'\a' , or execute the slash '\\a' .
r'\d' (two characters) has special meaning when interpreted as a regular expression ( r'\d' means the match of a digit in a regular expression).
In addition to the rules for Python string literals, you also need to avoid possible regular expression metacharacters. You can use re.escape(your_string) in the general case, or just r'\\d' or '\\\\d' . '\a' in the repl part should also be escaped (twice in your case: r'\\a' or '\\\\a' ):
```
 >>> old, new = r'%hello%\dir', r'C:\art' >>> print re.sub(re.escape(old), new.encode('string-escape'), xml) <Tag name="low" Value="C:\art"/> 
```
btw, you don't need regular expressions at all:
```
 >>> print xml.replace(old, new) <Tag name="low" Value="C:\art"/> 
```
finally, the value of the XML attribute cannot contain specific characters that must also be escaped, for example, '&' , '"' , "<" , etc.

In general, you should not use a regular expression to manipulate XML. Python stdlib has XML parsers.

 >>> import xml.etree.cElementTree as etree >>> xml = r'<Tag name="low" Value="%hello%\dir"/>' >>> tag = etree.fromstring(xml) >>> tag.set('Value', r"C:\art & design") >>> etree.dump(tag) <Tag Value="C:\art &amp; design" name="low" />

Find / replace url in document using Python regex

More articles: