Python XPath syntax tag with apostrophe

I am new to XPath. I am trying to parse a page using XPath. I need to get the information from the tag, but the escaped apostrophe in the title twists everything.

For analysis, I use Grab .

tag

from source:

<img src='somelink' border='0' alt='commission:Alfred\ misadventures' title='commission:Alfred\ misadventures'> 

Actual XPath:

 g.xpath('.//tr/td/a[3]/img').get('title') 

Returns

 commission:Alfred\\ 

Is there any way to fix this?

thanks

+7
source share
2 answers

Trash, trash. Your input is not correct as it improperly escapes the single quote character. Many programming languages ​​(including Python) use the backslash character to exclude quotation marks in string literals. In XML, no. You must either 1) surround the attribute value with double quotation marks; or 2) use &apos; to include a single quote .

From the XML specification :

In order for attribute values ​​to contain both single and double quotes, an apostrophe or a single quote character (') can be represented as " &apos; " and a double quote character (") as" &quot; "

+5
source

Since the provided "XML" is not a well-formed document due to nested apostrophes, the XPath expression cannot be evaluated on it .

The provided incorrectly formed text can be corrected for:

 <img src="somelink" border="0" alt="commission:Alfred misadventures" title="commission:Alfred misadventures"/> 

If there is a strange requirement not to use quotation marks, then one correct conversion :

 <img src='somelink' border='0' alt='commission:Alfred&apos;s misadventures' title='commission:Alfred&apos;s misadventures'/> 

If you are provided with incorrect input, in a language such as C #, you can try to convert it to its correct instance using :

 string correctXml = input.replace("\\'s", "&apos;s") 

Perhaps there is a similar way to do the same in Python.

+1
source

All Articles