How to encode ASCII URL characters?

I am using Ruby to extract the file url of a file to upload and download. The file name has utf8 characters, for example:

www.domain.com/.../Γ–Γ‡Γ„Γœ360ÓïÒôÖúÀí.txt 

When you try to download the above URL, it fails. Using URI::escape creates a URI that also does not work:

 www.domain.com/.../%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt 

But if I follow the URL Encoding Link , it works:

 www.domain.com/.../%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt 

I tried to find a function in Ruby that does the same encoding, but I could not find it. Before trying to write a function that implements the table in the link above, I want to ask if anyone knows any existing library that does this. And if I decide to do this, then what range of characters I should encode is obviously not all.

I am using JRuby 1.6.2 with RUBY_VERSION => "1.8.7"

+7
source share
1 answer

Oh, the joys of character encodings!

Here the following happens. Ruby internally stores the string you retrieved as a sequence of bytes, which is the utf-8 encoding of the file name. When you call URI.escape on it, these bytes are escaped in %xy format, and the resulting string, which now consists solely of bytes in the ASCII range, is used as the URL.

However, the receiving server interprets these bytes (after canceling them from the %xy form) as if they were in a different encoding, in this case ISO -8859-1 , and therefore the resulting file name that it calls does not match anything that he is.

There is a demo using Ruby 1.9, as it supports encodings better.

 1.9.3-p194 :003 > f => "Γ–Γ‡Γ„Γœ360ÓïÒôÖúÀí.txt" 1.9.3-p194 :004 > f.encoding => #<Encoding:UTF-8> 1.9.3-p194 :005 > URI.escape f => "%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt" 1.9.3-p194 :006 > g = f.encode 'iso-8859-1' => "\xD6\xC7\xC4\xDC360\xD3\xEF\xD2\xF4\xD6\xFA\xC0\xED.txt" 1.9.3-p194 :007 > g.encoding => #<Encoding:ISO-8859-1> 1.9.3-p194 :008 > URI.escape g => "%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt" 

So the solution in this case is to encode the string as ISO-8859-1 before slipping away from it. In Ruby 1.9 you do this as stated above, in earlier versions you can use Iconv (Im assuming JRuby includes Iconv, Im not really familiar with JRuby):

 1.8.7 :001 > f => "\303\226\303\207\303\204\303\234360\303\223\303\257\303\222\303\264\303\226\303\272\303\200\303\255.txt" 1.8.7 :005 > g = Iconv.conv('iso-8859-1', 'utf-8', f) => "\326\307\304\334360\323\357\322\364\326\372\300\355.txt" 1.8.7 :006 > URI.escape f => "%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt" 1.8.7 :007 > URI.escape g => "%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt" 

Please note that in general, you cannot depend on the server using any particular encoding. It should use utf-8, but obviously it is not.

+15
source

All Articles