How to check if character is utf-8

How to check if utf-8 encoded character set is installed via ruby ​​| ror?

+5
source share
3 answers

There is no specific way to do this, in Ruby and elsewhere:

str = 'foo' # start with a simple string
# => "foo" 
str.encoding
# => #<Encoding:UTF-8> # which is UTF-8 encoded
str.bytes.to_a
# => [102, 111, 111] # as you can see, it consists of three bytes 102, 111 and 111
str.encode!('us-ascii') # now we will recode the string to 8-bit us-ascii encoding
# => "foo" 
str.encoding
# => #<Encoding:US-ASCII> 
str.bytes.to_a
# => [102, 111, 111] # see, same three bytes
str.encode!('windows-1251') # let us try some cyrillic
# => "foo" 
str.encoding
# => #<Encoding:Windows-1251> 
str.bytes.to_a
# => [102, 111, 111] # see, the same three again!

Of course, you can use statistical analysis of the text and exclude encodings for which the text is invalid, but theoretically this is not a solvable problem.

+8
source

Check expiration date of UTF-8

For most multibyte encodings, it is possible to programmatically detect invalid byte sequences. Since Ruby treats all strings as defaults UTF-8, you can check if the string is specified in a valid one UTF-8:

# encoding: UTF-8
# -------------------------------------------
str = "Partly valid\xE4 UTF-8 encoding: äöüß"

str.valid_encoding?
   # => false

str.scrub('').valid_encoding?
   # => true

, UTF-8, , UTF-8.


, , UTF-8 CP1252 (a.k.a. Windows-1252).
, UTF-8 ( ):

# encoding: UTF-8
# ------------------------------------------------------
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end #unless
   # => "String CP1252 encoding: äöüß"

=======

  • , UTF-8 ( Ruby, .: #valid_encoding?) . 16 UTF-8 0,01%. ( UTF-8)

  • (in) , CP1252 ISO-8859-1. , , , CP1252.

  • , UTF-8 , CP1252 Latin1 - , . , , , CP1252 (a.k.a. Windows-1252). : ISO-8859-1, ISO-8859-15

+6
"your string".encoding
 # => #<Encoding:UTF-8>

Or, if you want to be pro-active,

"your string".encoding.name == "UTF-8"
 # => true
+1
source

All Articles