Wordpress / Apache Error - 404 with Unicode Characters in Image File Names

We recently moved the website to a new server and faced an odd problem when some uploaded images with unicode characters in the file name give us a 404 error.

Through ssh / FTP, we see that the files definitely exist.

For instance:

http://sjofasting.no/project/adnoy

none of the images work:

the code:

<img class='image-display' title='' src='http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg' width='685' height='484'/> 

SSH:

-rw-r - r-- 1 xxxxxxxx xxxxxxxx 836813 Aug 3 16:12 ådnøy_1_2.jpg

It is also strange that if you go to the directory, you can even click on the image and it works:

http://sjofasting.no/wp/wp-content/uploads/2012/03/

click "ådnøy_1_2.jpg" and it works.

Somehow Wordpress is generating

http://sjofasting.no/wp/wp-content/uploads/2012/03 /ådnøy_1_2.jpg

and copying from direct folder view generates

http://sjofasting.no/wp/wp-content/uploads/2012/03/a%CC%8Adn%C3%B8y_1_2.jpg

What's happening?


edit:

If I copy the image URL from a Wordpress source, I get:

http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellg%C3%A5rd-12.jpg

When copying from apache browser, I get:

http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellga%cc%8ard-12.jpg

What could explain this discrepancy between:% C3% A5 and% cc% 8

??

+6
source share
1 answer

Normalization of Unicode.

0xC3 0xA5 is the UTF-8 encoding for U + 00E5 a-with-ring.

0xCC 0x8A is the UTF-8 encoding for the combination ring U + 030A.

U + 0035 is a compiled (normal form C) way of writing an a-ring; a a , and then U + 030A is a decomposed (normal form D) way of writing it. å vs å - they should look the same, although they may vary slightly depending on the font rendering.

Now it usually doesn't matter which one you have, because smart file systems leave them untouched. If you save the file with the name [char U+00E5].txt ( å.txt ), it will remain called under Windows and Linux.

Poppies, on the other hand, are insane. The file system prefers the normal form D, to the extent that any composed characters that you pass into it are converted to decomposed ones. If you put the file in the called [char U+00E5].txt and immediately list the directory, you will find that you have a file named a[char U+030A].txt . You can still access the file as [char U+00E5].txt on Mac, because it will convert this input to Normal Form D before looking for it, but you cannot restore the same file name in terms of character sequences as you insert: this is a lossy conversion.

So, if you save your files on a Mac and then go to the file system where [char U+00E5].txt and a[char U+030A].txt refer to different files, you will get broken links.

Refresh the pages to indicate the version of the normal D form URLs, or reload files from the file system that are not egregious Unicode characters.

Think of another cause problems with incompatible problems.

+9
source

Source: https://habr.com/ru/post/924185/


All Articles