UTF-8 Encoding URLs

Information:

I have a program that generates XML sitemaps for Google Webmaster Tools (among other things).
GWTs gives me errors for some Sitemaps, because the URLs contain sequences of characters like ã¾, ã <, ã €, etc. **

GWTs says:

We require your Sitemap to be encoded in UTF-8 format (you can usually do this by saving the file). As with all XML files, any data values ​​(including URLs) must use entity escape codes for characters: & , , " < , .

Special characters are output in XML files (with HTML objects).
XML file fragment:

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://domain/folder/listing-&#227;&#129;.shtml</loc> ... 

Are my UTF-8 URLs encoded?

If not, How to do it in Java ?
Below is the line in my program where I add the URL to the sitemap:

  siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase())); 

** = I am not sure which of them cause the error, perhaps the first two examples.

Sorry for all the editing.

+4
source share
4 answers

Try using URLEncoder.encode(stringToBeEncoded, "UTF-8") to encode the URL.

+15
source

URLs must be percent encoded according to the URI specification .

For example, code point U + 00e3 (ã) becomes the coded sequence %C3%A3 .

When a URI is emitted in an XML document, it must meet the markup requirements for XML.

For example, the URI http://foo/bar?a=b&x=%C3%A3 becomes http://foo/bar?a=b&amp;x=%C3%A3 . Ampersand is an escape character in XML.

A detailed discussion of URI encoding can be found here .

+2
source

Do not confuse percent encoding of non-ASCII characters in URLs with XML does not remove characters in URLs. You need to do this when creating XML sitemaps.

In honesty from reading your original post, it seems like something scared is happening, because the characters you mention remind me when the conversion failed :)

Are you sure these characters are really part of your URLs when using UTF-8?

+2
source

All non-ascii characters in the URL must be encoded in "x-url-encoding".

Here is a wiki link that explains this: http://en.wikipedia.org/wiki/Percent-encoding .

In addition, all special XML characters ( &, >, <, etc. ) must also be escaped.

Jai answer shows the correct method for an arbitrary x-url-encode string. Note, however, that it does not perform XML escaping.

+1
source

Source: https://habr.com/ru/post/925575/


All Articles