PHP character encoding for reading csv file with fgets

I have a website that receives FTP CSV file once a month. For many years it was an ASCII file. Now I get UTF-8 one month, then UTF-16BE the next and UTF-16LE after a month. Maybe I will get UTF-32 next month. Fgets returns the byte order mark at the beginning of UTF files. How can I get PHP to automatically recognize character encoding? I tried mb_detect_encoding and it returned ASCII regardless of file type. I changed my code to read the spec, and explicitly put the character encoding in mb_convert_encoding. This worked until the last file, which is UTF-16LE. In this file, it reads the first line correctly, and all subsequent lines are displayed as question marks ("?"). What am I doing wrong?

$fhandle = fopen( $file_in, "r" ); if ( fhandle === false ) { echo "<p class=redbold>Error opening file $file_in.</p>"; die(); } $i = 0; while( ( $line = fgets( $fhandle ) ) !== false ) { $i++; // Detect encoding on first line. Actual text always begins with string "Document" if ( $i == 1 ) { $line_start = substr( $line, 0, 4 ); $line_start_hex = bin2hex( $line_start ); $utf16_start = 'fffe4400'; $utf8_start = 'efbbbf44'; if ( strcmp( $line_start, 'Docu' ) == 0 ) { $char_encoding = 'ASCII'; } elseif ( strcmp( $line_start_hex, 'efbbbf44' ) == 0 ) { $char_encoding = 'UTF-8'; $line = substr( $line, 3 ); } elseif ( strcmp( $line_start_hex, 'fffe4400' ) == 0 ) { $char_encoding = 'UTF-16LE'; $line = substr( $line, 2 ); } elseif ( strcmp( $line_start_hex, 'feff4400' ) == 0 ) { $char_encoding = 'UTF-16BE'; $line = substr( $line, 2 ); } else { echo "<p class=redbold>Error, unknown character encoding. Line =<br>", $line_start_hex, '</p>'; require( '../footer.php' ); die(); } echo "<p>char_encoding = $char_encoding</p>"; } // Convert UTF if ( $char_encoding != 'ASCII' ) { $line = mb_convert_encoding( $line, 'ASCII', $char_encoding); } echo '<p>'; var_dump( $line ); echo '</p>'; } 

Output:

  char_encoding = UTF-16LE string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name " string(83) "???????????????????????????????????????????????????????????????????????????????????" string(88) "????????????????????????????????????????????????????????????????????????????????????????" string(84) "????????????????????????????????????????????????????????????????????????????????????" string(80) "????????????????????????????????????????????????????????????????????????????????" 
+4
source share
2 answers

Explicitly pass the order and possible encodings for detection and use a strict parameter. Also please use file_get_contents if the file is in UTF-16LE, fgets will be up to you.

 <?php header( "Content-Type: text/html; charset=utf-8"); $input = file_get_contents( $file_in ); $encoding = mb_detect_encoding( $input, array( "UTF-8", "UTF-32", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-16BE", "UTF-16LE" ), TRUE ); if( $encoding !== "UTF-8" ) { $input = mb_convert_encoding( $input, "UTF-8", $encoding ); } echo "<p>$encoding</p>"; foreach( explode( PHP_EOL, $input ) as $line ) { var_dump( $line ); } 

The order is important because UTF-8 and UTF-32 are more restrictive, and UTF-16 is extremely permissive; almost any random even byte length is valid UTF-16.

The only way to save all the information is to convert it to Unicode, not ASCII.

+4
source

My suggestion was to just convert everything to UTF-8 or ASCII (not quite sure about the code you posted if you are trying to convert everything to UTF-8 or ASCII)

 $utf8Line = iconv( mb_detect_encoding( $line ), 'UTF-8', $line ); 

or...

 $asciiLine = iconv( mb_detect_encoding( $line ), 'ASCII', $line ); 

You can use mb_detect_encoding for heavy lifting for you

+1
source

All Articles