How to detect and fix character encoding in mysql database via php?

I got this database with names of people and data in French, which means the use of characters like é, è, ö, û etc. About 3,000 entries.

Apparently, the data inside was sometimes encoded using utf8_encode (), and sometimes not. This leads to a confused result: in some places the characters are displayed well, but in others they are not.

At first, I tried to track all the places in the user interface where these problems occur, and use utf8_decode () if necessary, but this is really not a practical solution.

I did some testing, and first of all you should not use utf8_encode, so I would rather delete all this and just work in UTF8 everywhere - at the browser, middleware and database levels. Therefore, I need to clear the database by converting all the incorrectly recorded data into its cleared version.

Question: could one create a function in php that checks the utf8 string encoding is correct (without utf8_encode) or not (with utf8_encode), and if it was, return it to its original state?

In other terms: I would like to know how I could detect utf8 content that was utf8_encode () on utf8 content that was not utf8_encode () d.

** UPDATE: EXAMPLE **

Here is a good example: you take a string full of special characters and take a copy of that string and utf8_encode (). The function that I dream of takes both lines, leaves the first untouched, and the second line is now the same as the first.

I tried this:

$loc_fr = setlocale(LC_ALL, 'fr_BE.UTF8',' fr_BE@euro ', 'fr_BE', 'fr', 'fra', 'fr_FR'); $str1= "éèöûêïà "; $str2 = utf8_encode($str1); function convert_charset($str) { $charset= mb_detect_encoding($str); if( $charset=="UTF-8" ) { return utf8_decode($str); } else { return $str; } } function correctString($str) { echo "\nbefore: $str"; $str= convert_charset($str); echo "\nafter: $str"; } correctString($str1); echo('<hr/>'."\n"); correctString($str2); 

And it gives me:

 before: éèöûêïà after:         before: éèöûêïà after: éèöûêïà 

Thanks,

Alex

+6
php mysql special-characters character-encoding
source share
5 answers

It is not entirely clear from the question which character-encoded lens you are currently viewing (it depends on the settings of your text editor, browser headers, database configuration, etc.) and what character encoding transformations that the data has passed. It is possible that, for example, by adjusting the database configuration, everything will be fixed, and this is much better than making phased changes to the data.

It looks like this might be a utf8 double coding problem, and if that happens, the original and damaged data will be in utf8, so detecting the encoding will not give you the necessary information. The approach in this case requires making assumptions about which characters can reasonably appear in your data: as far as PHP and Mysql are concerned, "Ã ©" is completely legal utf8, so you have to make a decision based on what you know about the data and their authors that it should be damaged. These are risky assumptions if you are just a tech. Fortunately, if you know that the data is in French and there are only 3,000 entries there, you can probably make such assumptions.

Below is a script that you can configure, first of all, to check your data, then fix it and finally check it again. All he does is treat the string as utf8, breaking it into characters and comparing characters with a white list of expected French characters. This signals a problem if the string is not in utf8 or contains characters that are not normally expected in French, for example:

 PROBABLY OK Côte d'Azur HAS NON-WHITELISTED CHAR Côte d'Azur 195,180 ô NON-UTF8 C e d'Azur 

Here's the script you need to load unicode dependent functions from http://hsivonen.iki.fi/php-utf8/

 <?php // Download from http://hsivonen.iki.fi/php-utf8/ require "php-utf8/utf8.inc"; $my_french_whitelist = array_merge( range(0,127), // throw in all the lower ASCII chars array( 0xE8, // small e-grave 0xE9, // small e-acute 0xF4, // small o-circumflex //... Will need to add other accented chars, // Euro sign, and whatever other chars // are normally expected in the data. ) ); // NB, whether this string literal is in utf8 // depends on the encoding of the text editor // used to write the code $str1 = "Côte d'Azur"; $test_data = array( $str1, utf8_encode($str1), utf8_decode($str1), ); foreach($test_data as $str){ $questionable_chars = non_whitelisted( $my_french_whitelist, $str ); if($questionable_chars===true){ p("NON-UTF8", $str); }else if ($questionable_chars){ p( "HAS NON-WHITELISTED CHAR", $str, implode(",", $questionable_chars), unicodeToUtf8($questionable_chars) ); }else{ p("PROBABLY OK", $str); } } function non_whitelisted($whitelist, $utf8_str){ $codepoints = utf8ToUnicode($utf8_str); if($codepoints===false){ // has non-utf8 char return true; } return array_diff( array_unique($codepoints), $whitelist ); } function p(){ $args = func_get_args(); echo implode("\t", $args), "\n"; } 
+6
source share

I think you can use a more compilation approach. I got the Bulgarian database a few weeks ago, which was dynamically encoded in the database, but moving it to another database, I got funk.

The way I decided is to reset the database, set up the database to sort utf8, and then import the data as binary. This will automatically convert everything to utf8 and no longer gives me.

It was in MySQL

+2
source share

When you connect to the database, remember that always use mysql_set_charset ('utf8', $ db_connection);

he will fix everything, he will solve all my problems.

See this: http://phpanswer.com/store-french-characters-into-mysql-db-and-display/

+2
source share

As you said, your data is sometimes converted using utf8_encode , your data is encoded either using UTF-8 or ISO 8859-1 (since utf8_encode converted from ISO 8859-1 to UTF-8). And since UTF-8 encodes characters from 128 to 255 with two bytes starting with 1100001x, you just need to check if your data is UTF-8 and convert it if not.

So, scan all your data if it is already UTF-8 (see several is_utf8 functions) and use utf8_encode if it is not UTF-8.

0
source share

My problem is that somehow I got databases in my databases like à, é, ê in regular format or utf8. After researching, I came to the conclusion that some browsers (I don’t know IE or FF or others) encode the presented input data, since there was no utf8 encoding intentionally added to handle submit forms. So, if I were reading data using utf8_encode, I would change other simple characters and vice versa.

My solution after studying the above solutions: 1. I created a new database with charset utf8 2. Imported the database AFTER I changed the definition of charset in the CREATE TABLE statement in the sql dump file from Latin ... to UTF8. 3. import data from the source database (for now, it may be quite simple to change the encoding to existing db and tables, and this is only if the source db is not utf8) 4. Update the contents of the database directly, replacing the characters encoded in utf8 in a simple format, for example

 UPDATE `clients` SET `name` = REPLACE(`name`,"é",'é' ) WHERE `name` LIKE CONVERT( _latin1 '%é%' USING utf8 ); 
  1. I put this line in the db class (for php code) to make sure this is a UTF8 message

    $ this-> query ('SET CHARSET UTF8');

So ho upgrade? (step 4) I built an array with possible characters that can be encoded

 $special_chars = array( 'ù','û','ü', 'ÿ', 'à','â','ä','å','æ', 'ç', 'é','è','ê','ë', 'ï','î', 'ô','','ö','ó','ø', 'ü'); 

I built an array with table pairs, the field needs to be updated

 $where_to_look = array( array("table_name" , "field_name"), ..... ); 

than,

  foreach($special_chars as $char) { foreach($where_to_look as $pair) { //$table = $pair[0]; $field = $pair[1] $sql = "SELECT id , `" . $pair[1] . "` FROM " .$pair[0] . " WHERE `" . $pair[1] . "` LIKE CONVERT( _latin1 '%" . $char . "%' USING utf8 );"; if($db->num_rows() > 0){ $sql1 = "UPDATE " . $pair[0] . " SET `" . $pair[1] . "` = REPLACE(`" . $pair[1] . "`,CONVERT( _latin1 '" . $char . "' USING utf8 ),'" . $char . "' ) WHERE `" . $pair[1] . "` LIKE CONVERT( _latin1 '%" . $char . "%' USING utf8 )"; $db1->query($sql1); } } } 

The main idea is to use mysql coding functions to avoid coding between mysql, apache, browser and vice versa; NOTE. I did not have php functions available like mb _....

The best

0
source share

All Articles