Fgetcsv eats the first letter of a string if it is Umlaut

I am importing content from a CSV file created in Excel into an XML document, for example:

$csv = fopen($csvfile, r); $words = array(); while (($pair = fgetcsv($csv)) !== FALSE) { array_push($words, array('en' => $pair[0], 'de' => $pair[1])); } 

The data entered are expressions in English / German.

I insert these values ​​into the XML structure and output the XML as follows:

 $dictionary = new SimpleXMLElement('<dictionary></dictionary>'); //do things $dom = dom_import_simplexml($dictionary) -> ownerDocument; $dom -> formatOutput = true; header('Content-encoding: utf-8'); //<3 UTF-8 header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!! echo $dom -> saveXML(); 

This works great, but I ran into one really weird problem. When the first letter of a string is Umlaut (for example, in Österreich or Ägypten ), the character will be omitted, resulting in gypten or sterreich . If Umlaut is in the middle of the line ( Russische Föderation ), it is passed correctly. The same goes for things like ß or é or something else.

All files are encoded in UTF-8 and served in UTF-8.

This seems rather strange and looks like a mistake, but maybe something is missing for me, there are a lot of smart people.

+8
xml php csv character-encoding diacritics
source share
5 answers

So this seems like a bug in fgetcsv .

Now I process the CSV data myself (a little cumbersome), but it works, and I have no encoding problems at all.

This (not yet optimized version) of what I am doing:

 $rawCSV = file_get_contents($csvfile); $lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://stackoverflow.com/a/7498886/797194 foreach ($lines as $line) { array_push($words, getCSVValues($line)); } 

getCSVValues comes from here and is needed to work with CSV strings like this (commas!):

 "I'm a string, what should I do when I need commas?",Howdy there 

Looks like:

 function getCSVValues($string, $separator=","){ $elements = explode($separator, $string); for ($i = 0; $i < count($elements); $i++) { $nquotes = substr_count($elements[$i], '"'); if ($nquotes %2 == 1) { for ($j = $i+1; $j < count($elements); $j++) { if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes // Put the quoted string pieces back together again array_splice($elements, $i, $j-$i+1, implode($separator, array_slice($elements, $i, $j-$i+1))); break; } } } if ($nquotes > 0) { // Remove first and last quotes, then merge pairs of quotes $qstr =& $elements[$i]; $qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1); $qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1); $qstr = str_replace('""', '"', $qstr); } } return $elements; } 

Pretty workaround, but it seems to work fine.

EDIT:

There is also an error , apparently, it depends on the locale settings.

+4
source share

If the line is obtained from Excel (I had problems with the letter ø disappearing if it was at the beginning of the line) ... then this is fixed:

setlocale (LC_ALL, 'en_US.ISO-8859-1');

+3
source share

If the other umlauts in the middle look normal, this is not a basic encoding problem. The fact that this happens at the beginning of a line probably indicates some incompatibility with the newline. Perhaps the CSV was created with a different newline encoding.

This happens when moving files between different OSs:

  • Windows: \r\n (characters 13 and 10)
  • Linux: \n (character 10)
  • Mac OS: \r (character 13)

If I were you, I would definitely check the brand of the new line.

If on Linux: hexdump -C filename | more hexdump -C filename | more and check the document.

You can change the newline labels with the sed expression if this is the case.

Hope this helps!

+2
source share

A slightly simpler workaround (but rather dirty):

 //1. replace delimiter in input string with delimiter + some constant $dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine); //2. parse $parsedLine = str_getcsv($dataLine, $this->fieldDelimiter); //3. remove the constant from resulting strings. foreach ($parsedLine as $i => $parsedField) { $parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField); } 
+2
source share

There may be some kind of utf8_encode() problem. This comment on the documentation page indicates that you are encoding Umlaut, when it is already encoded, this can cause problems.

Maybe a test to find out if utf-8 data has already been encoded using mb_detect_encoding() .

0
source share

All Articles