How to convert stored incorrectly entered data?

My Perl application and MySQL database now correctly handle incoming UTF-8 data, but I need to convert existing data. Some of the data appears to have been encoded as CP-1252 and not decoded as such before being encoded as UTF-8 and stored in MySQL. I read an O'Reilly article Turning MySQL data in latin1 to utf8 utf-8 , but although it is often referenced, this is not the final solution.

I looked at Encode :: DoubleEncodedUTF8 and Encoding :: FixLatin , but I did not work on my data.

This is what I have done so far:

#Return the $bytes from the DB using BINARY()
my $characters = decode('utf-8', $bytes);
my $good = decode('utf-8', encode('cp-1252', $characters));

This fixes most cases, but if you work with proprietary encoded records, it changes them. I tried using Encode :: Guess and Encode :: Detect , but they cannot distinguish between correctly encoded and misleading entries. Therefore, I simply cancel the conversion if the symbol \ x {FFFD} is found after the conversion .

Some entries, however, are only partially converted. Here's an example where the left curly quotes are correctly converted, but the correct curly quotes become crippled.

perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "\xC3\xA2\xE2\x82\xAC\xC5\x93four score\xC3\xA2\xE2\x82\xAC\xC2\x9D")))'

And here is an example where the right single quote was not converted:

perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "bob\xC3\xAF\xC2\xBF\xC2\xBDs")))'

Do I also deal with double encoded data here? What else needs to be done to convert these records?

+5
1

. :

  • cp1252, cp1252 utf8
  • utf8, cp1252 utf8

(, )

, , ?

-, , cp1252 unicode. , (, 0x9D), cp1252.

cp1252 utf8, - , cp1252. , , . , , . " " .

-, utf-8, :

$ perl -CO -MEncode -e '$a=decode("utf-8", 
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

:

e2 20ac 153 66 6f 75 72 20 73 63 6f 72 65 e2 20ac 9d

( "fmt" - unix, , )

cp1252, unicode cp1252, , . ( , ). , , , utf8.

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  $a=encode("cp-1252", $a, sub { chr($_[0]) } );
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

- , .

:

e2 80 9c 66 6f 75 72 20 73 63 6f 72 65 e2 80 9d

utf8. ? , perl utf8:

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  $a=encode("cp-1252", $a, sub { chr($_[0]) } );
  $a=decode("utf-8", $a, 1);
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

"1" , , . :

201c 66 6f 75 72 20 73 63 6f 72 65 201d

:

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  $a=encode("cp-1252", $a, sub { chr($_[0]) } );
  $a=decode("utf-8", $a, 1);
  print "$a\n"'
"four score"

, :

  • mysql. $bytestream.
  • $bytestream utf8:
    • $bytestream $good
    • $bytestream - -ASCII (.. 0x80), while... valid utf8.
    • $bytestream "demangle ($ bytestream)", . cp1252-to-utf8, , , .
  • $good , undef. $good , , $bytestream cp1252 utf8. (, , 2 ..)

.

sub demangle {
  my($a) = shift;
  eval { # the non-string form of eval just traps exceptions
         # so that we return undef on exception
    local $SIG{__WARN__} = sub {}; # No warning messages
    $a = decode("utf-8", $a, 1);
    encode("cp-1252", $a, sub {$_[0] <= 255 or die $_[0]; chr($_[0])});
  }
}

, , ASCII, utf-8, utf-8. , , .

:

, , , "". , cp1252-to-utf8, , , . , , utf8 , :

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "bob\xC3\xAF\xC2\xBF\xC2\xBDs");
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

:

62 6f 62 ef bf bd 73

, ef bf bd unicode cp1252. , Unicode cp1252 :

62 6f 62 ef bf bd 73

, . utf-8, , , :

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "bob\xC3\xAF\xC2\xBF\xC2\xBDs");
  $a=encode("cp-1252", $a, sub { chr(shift) } );
  $a=decode("utf-8", $a, 1);
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

62 6f 62 fffd 73

utf-8, utf-8, 0xFFFD, " ". , , * -to-utf8 , , "". .

, , utf8 ( , ) , 0xFFFD. - :

sub is_valid_utf8 {
  defined(eval { decode("utf-8", $_[0], 1) })
}
+6

All Articles