I have a website that receives FTP CSV file once a month. For many years it was an ASCII file. Now I get UTF-8 one month, then UTF-16BE the next and UTF-16LE after a month. Maybe I will get UTF-32 next month. Fgets returns the byte order mark at the beginning of UTF files. How can I get PHP to automatically recognize character encoding? I tried mb_detect_encoding and it returned ASCII regardless of file type. I changed my code to read the spec, and explicitly put the character encoding in mb_convert_encoding. This worked until the last file, which is UTF-16LE. In this file, it reads the first line correctly, and all subsequent lines are displayed as question marks ("?"). What am I doing wrong?
$fhandle = fopen( $file_in, "r" ); if ( fhandle === false ) { echo "<p class=redbold>Error opening file $file_in.</p>"; die(); } $i = 0; while( ( $line = fgets( $fhandle ) ) !== false ) { $i++; // Detect encoding on first line. Actual text always begins with string "Document" if ( $i == 1 ) { $line_start = substr( $line, 0, 4 ); $line_start_hex = bin2hex( $line_start ); $utf16_start = 'fffe4400'; $utf8_start = 'efbbbf44'; if ( strcmp( $line_start, 'Docu' ) == 0 ) { $char_encoding = 'ASCII'; } elseif ( strcmp( $line_start_hex, 'efbbbf44' ) == 0 ) { $char_encoding = 'UTF-8'; $line = substr( $line, 3 ); } elseif ( strcmp( $line_start_hex, 'fffe4400' ) == 0 ) { $char_encoding = 'UTF-16LE'; $line = substr( $line, 2 ); } elseif ( strcmp( $line_start_hex, 'feff4400' ) == 0 ) { $char_encoding = 'UTF-16BE'; $line = substr( $line, 2 ); } else { echo "<p class=redbold>Error, unknown character encoding. Line =<br>", $line_start_hex, '</p>'; require( '../footer.php' ); die(); } echo "<p>char_encoding = $char_encoding</p>"; } // Convert UTF if ( $char_encoding != 'ASCII' ) { $line = mb_convert_encoding( $line, 'ASCII', $char_encoding); } echo '<p>'; var_dump( $line ); echo '</p>'; }
Output:
char_encoding = UTF-16LE string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name " string(83) "???????????????????????????????????????????????????????????????????????????????????" string(88) "????????????????????????????????????????????????????????????????????????????????????????" string(84) "????????????????????????????????????????????????????????????????????????????????????" string(80) "????????????????????????????????????????????????????????????????????????????????"
source share