PHP scandir () and htmlentities (): problems with encoding and / or special characters

I use jqueryFileTree to show a list of directories on a server with links to download files in a directory. I recently ran into a problem with files that contain special characters:

  • test.pdf: works great
  • tΓ©st.pdf: does not work (pay attention to Γ© - sharp accent - in the file name).

When debugging the php connector jqueryFileTree, I see that it makes scandir () of the directory passed through $ _GET, and then iterates over each file / directory of the directory. Before parsing the file name into a URL, the script seems to correctly execute htmlentities () on the file name. The problem is that this call to htmlentities ($ file) simply returns an empty string, which according to php docs may occur when the input string contains invalid code in this encoding. However, I tried to pass the encoding implicitly by calling:

$file = htmlentities($file,ENT_QUOTES,'UTF-8'); 

But it also returns an empty string.

If I call: $ file = htmlentities ($ file, ENT_IGNORE, 'UTF-8'); The sharp character just goes down (so tΓ©st.pdf becomes tst.pdf)

When debugging my php script using xdebug, I see that the source line contains an unknown character (looks like this ).

Thus, I am quite sure of this in order to find a solution for this. Any help would be appreciated.

FYI:

  • My page wrapper is UTF-8 (indicated in metadata)
  • The file is stored in the Windows 2003 file server, and scandir () is executed with a UNC loop (for example, // fileserver / sharename / sourcedir)
  • The default encoding in my php.ini is set to UTF-8
  • Web server and PHP 5.4.26 run on a Windows 2008 R2 server
+7
php utf-8 character-encoding
source share
1 answer

My best guess is that the file name itself does not use UTF-8. Or at least scandir() doesn't pick it that way.

Maybe mb_detect_encoding() can shed some light?

 var_dump(mb_detect_encoding($filename)); 

If not, try to guess what encoding (CP1252 or ISO-8859-1 will be my first guess) and convert it to UTF-8, see if the output is valid:

 var_dump(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252')); var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-1')); var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-15')); 

Or using iconv() :

 var_dump(iconv('WINDOWS-1252', 'UTF-8', $filename)); var_dump(iconv('ISO-8859-1', 'UTF-8', $filename)); var_dump(iconv('ISO-8859-15', 'UTF-8', $filename)); 

Then, when you figure out which encoding is actually used, your code should look something like this (assuming CP1252):

 $filename = htmlentities(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252'), ENT_QUOTES, 'UTF-8'); 
+12
source share

All Articles