Find / replace url in document using Python regex

Python regex experts! I am trying to change a string in an XML document. Source line:

<Tag name="low" Value="%hello%\dir"/> 

The result I want to see:

 <Tag name="low" Value="C:\art"/> 

My unsuccessful direct jump attempt:

 lines = re.sub("%hello%\dir"", "C:\art"/> 

This does not work. Changes nothing in the document. Something with%?

For testing purposes, I tried:

 lines = re.sub("dir", "C:\art", a) 

And I get:

 <Tag name="low" Value="%hello%\C:BELrt"/> 

The problem is that \ a = BEL .

I tried a bunch of other things, but to no avail. How do I solve this problem?

+2
source share
3 answers

In Python, use the r prefix for a literal string to avoid having to flush your slashes. Then avoid the slash to avoid \d matching numbers.

 lines = re.sub(r"%hello%\\dir", r"C:\\art") 
0
source

You problem is that you have some characters that have a specific meaning in regex's.

\d means any digit. %hello%\dir then %hello%[0-9]ir

You need to avoid these slashes / use a raw string to get around this:

 a = '''<Tag name="low" Value="%hello%\dir"/>''' lines = re.sub(r"%hello%\\dir", r"C:\\art", a) print(lines) #<Tag name="low" Value="C:\\art"/> 
0
source

That's a good question. It shows three problems with text representation at once:

  • '\a' Python string literal is a single BELL character.

    To enter a backslash followed by the letter 'a' in the Python source code, you need to either use raw literals: r'\a' , or execute the slash '\\a' .

  • r'\d' (two characters) has special meaning when interpreted as a regular expression ( r'\d' means the match of a digit in a regular expression).

    In addition to the rules for Python string literals, you also need to avoid possible regular expression metacharacters. You can use re.escape(your_string) in the general case, or just r'\\d' or '\\\\d' . '\a' in the repl part should also be escaped (twice in your case: r'\\a' or '\\\\a' ):

     >>> old, new = r'%hello%\dir', r'C:\art' >>> print re.sub(re.escape(old), new.encode('string-escape'), xml) <Tag name="low" Value="C:\art"/> 

    btw, you don't need regular expressions at all:

     >>> print xml.replace(old, new) <Tag name="low" Value="C:\art"/> 
  • finally, the value of the XML attribute cannot contain specific characters that must also be escaped, for example, '&' , '"' , "<" , etc.

In general, you should not use a regular expression to manipulate XML. Python stdlib has XML parsers.

 >>> import xml.etree.cElementTree as etree >>> xml = r'<Tag name="low" Value="%hello%\dir"/>' >>> tag = etree.fromstring(xml) >>> tag.set('Value', r"C:\art & design") >>> etree.dump(tag) <Tag Value="C:\art &amp; design" name="low" /> 
0
source

All Articles