Check expiration date of UTF-8
For most multibyte encodings, it is possible to programmatically detect invalid byte sequences. Since Ruby treats all strings as defaults UTF-8, you can check if the string is specified in a valid one UTF-8:
str = "Partly valid\xE4 UTF-8 encoding: äöüß"
str.valid_encoding?
str.scrub('').valid_encoding?
, UTF-8, , UTF-8.
, , UTF-8 CP1252 (a.k.a. Windows-1252).
, UTF-8 ( ):
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
unless str.valid_encoding?
str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end
=======
, UTF-8 ( Ruby, .: #valid_encoding?) . 16 UTF-8 0,01%. ( UTF-8)
(in) , CP1252 ISO-8859-1. , , , CP1252.
, UTF-8 , CP1252 Latin1 - , . , , , CP1252 (a.k.a. Windows-1252). : ISO-8859-1, ISO-8859-15