Correction of incorrect encoding UTF-8

Question

Correction of incorrect encoding UTF-8

I am in the process of fixing incorrect UTF-8 encoding. I am currently using PHP 5 and MySQL.

There are several cases of bad encodings in my database that print as: ÃƒÂ®

Database Sort: utf8_general_ci
PHP uses the correct UTF-8 header
Notepad ++ is configured to use UTF-8 without specification
database management is done in phpMyAdmin
not all accented characters are broken

I need some function that will help me match instances of ®®, ,Â, ÃƒÂ¼ and others like them with their accented UTF-8 characters.

+58

php mysql unicode utf-8

Jayrox Aug 28 '09 at 2:14

source share

13 answers

If you have UTF8 double encoded characters (various smart quotes, dashes, apostrophes, quotes, etc.), in mysql you can reset the data and then read it back to fix the broken encoding.

Like this:

mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \ --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \ --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

This was a 100% fix for my dual-encoded UTF-8.

Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

+92

jsdalton Dec 16 '10 at 16:05

source share

If you utf8_encode() in a string that is already UTF-8, then it looks garbled when it is encoded several times.

I made a toUTF8() function that converts strings to UTF-8.

You do not need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a combination of the three.

I myself used this in a channel with mixed encodings on one line.

Using:

 $utf8_string = Encoding::toUTF8($mixed_string); $latin1_string = Encoding::toLatin1($mixed_string);

My other fixUTF8() function corrects corrupted UTF8 strings if they have been encoded in UTF8 several times.

Using:

 $utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

 echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

 Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football

Download:

https://github.com/neitanod/forceutf8

+78

Sebastián Grignoli Aug 19 '10 at 11:38

source share

I had a problem with an xml file that had a broken encoding, he said it was utf-8, but it had characters where utf-8 is not.
After several trial and error with mb_convert_encoding() I will be able to fix it with

 mb_convert_encoding($text, 'Windows-1252', 'UTF-8')

+11

Celleb Jul 14 '14 at 8:11

source share

As Dan remarked, you need to convert them to binary, and then convert / fix the encoding.

For example, for utf8 stored as latin1, the following SQL will fix it:

 UPDATE table SET field = CONVERT( CAST(field AS BINARY) USING utf8) WHERE $broken_field_condition

+10

blueyed Mar 04 '10 at 12:59

source share

I know this is not very elegant, but after it was mentioned that strings can be encoded in a double way, I made this function:

 function fix_double encoding($string) { $utf8_chars = explode(' ', 'À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö'); $utf8_double_encoded = array(); foreach($utf8_chars as $utf8_char) { $utf8_double_encoded[] = utf8_encode(utf8_encode($utf8_char)); } $string = str_replace($utf8_double_encoded, $utf8_chars, $string); return $string; }

This seems to work just fine to remove the double coding that I am experiencing. I probably miss some characters that might be a problem for others. However, for my needs, it works great.

+2

Jayrox Aug 29 '09 at 18:39

source share

A way to convert to binary, and then to fix the encoding

+2

Dan Nov 24 '09 at 19:09

source share

Another thing to check that turned out to be my solution (found here ) is how the data is returned from your server. In my application, I use PDO to connect from PHP to MySQL. I need to add a flag to the connection that says the data is returned in UTF-8 format

The answer was

 $dbHandle = new PDO("mysql:host=$dbHost;dbname=$dbName;charset=utf8", $dbUser, $dbPass, array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES 'utf8'"));

+1

Luke Madhanga Mar 08 '15 at 17:43

source share

It looks like your utf-8 is being interpreted as iso8859-1 or win-1250 at some point.

When you say, “I have several examples of incorrect encodings in my database,” how did you check this? Through your application, phpmyadmin or command line client? Are all utf-8 codes displayed in this way or only some? Perhaps you had the wrong encodings and it was converted incorrectly from iso8859-1 to utf-8 when it was already utf-8?

0

teambob Aug 28 '09 at 2:58

source share

I had the same problem a long time ago and she fixed it with

 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15">

0

Jose De Gouveia Apr 20 2018-11-21T00:

source share

I found a solution after several days of searching. My comment will be buried, but anyway ...

I get corrupted data using php.
I do not use UTF8 job names
I use utf8_decode () for my data
I am updating my database with my new decoded data, still not using UTF8 set names

and voilà :)

0

David 天宇 Wong Feb 26 '13 at 12:24

source share

This script had a good approach. Converting it to your chosen language should not be too complicated:

http://plasmasturm.org/log/416/

 #!/usr/bin/perl use strict; use warnings; use Encode qw( decode FB_QUIET ); binmode STDIN, ':bytes'; binmode STDOUT, ':encoding(UTF-8)'; my $out; while ( <> ) { $out = ''; while ( length ) { # consume input string up to the first UTF-8 decode error $out .= decode( "utf-8", $_, FB_QUIET ); # consume one character; all octets are valid Latin-1 $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length; } print $out; }

0

Erik Aronesty Nov 16 '16 at 14:23

source share

@ Sebastian Grignoli launched the following example:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football\n"); echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football\n"); echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football\n"); echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football\n");

and got the following result:

FÃ©dÃ©ration Camerounaise de Football FÃ©dÃ©ration Camerounaise de Football FÃÃÃ©dÃÃÃ©ration Camerounaise de Football FÃ©dÃ©ration Camerounaise de Football

-edit:

The above results were when I wrote to a file, like so:

 fclose(STDOUT); $STDOUT = fopen('pathtofile.txt', 'a'); echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football\n"); echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football\n"); echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football\n"); echo Encoding::fixUTF8("FÃÂÂÂÂ©dÃÂÂÂÂ©ration Camerounaise de Football\n");

although its output to standard output worked fine.

edit 2:

use print instead of echo when writing to a file, otherwise it will not work

0

Dmitry Boychev Jan 15 '19 at 17:59

source share

Eli · Accepted Answer · 2009-08-28 17:59

I had to try to “fix” several broken UTF8 situations in the past, and, unfortunately, it was never easy and often quite impossible.

If you cannot determine exactly how it was broken, and it was always broken in the same way, then it will be difficult to “cancel” the damage.

If you want to try to undo the damage, it is best to start writing sample code where you make numerous variants of mb_convert_encoding () calls to find out if you can find a combination of "from" and "from", to ', which captures your data. In the end, it is often better not to even bother about fixing old data due to pain levels, but instead just fix the situation in the future.

However, before doing this, you need to make sure that you fix everything that causes this problem in the first place. You already mentioned that sorting and database table editors are installed correctly. But there are still places where you need to check that everything is correct UTF-8:

Make sure you serve your HTML as UTF-8:
- header ("Content-Type: text / html; charset = utf-8");
Change your default PHP encoding to utf-8:
- ini_set ("default_charset", 'utf-8');
If your database does not ALWAYS talk in utf-8, you may need to talk about it in each connection in order to provide it in utf-8 mode, in MySQL you do this by issuing:
- charset utf8
You may need to say that your web server is always trying to talk in UTF8, in Apache this command:
- AddDefaultCharset UTF-8
Finally, you need to ALWAYS make sure that you are using PHP functions that are the correct complaint of UTF-8. This means that always use the mb_ * string functions in the multibyte style. This also means that when calling functions such as htmlspecialchars (), you include the appropriate “utf-8” charset parameter at the end to make sure that it does not encode them incorrectly.

If you skip any one step through the whole process, the encoding may be distorted and problems arise. As soon as you get into the “groove” for utf-8, all this becomes second nature. And, of course, PHP6 should be a completely one-time complaint from getgo, which will make a lot of this easier (hopefully)

Correction of incorrect encoding UTF-8

More articles: