Cyrillic strings ̆ ̄ ̈ returns a length of 2 instead of 1 in ruby ​​and other programming languages

In Ruby, Javascript, and Java (others that I haven't tried), have Cyrillic characters ̆ ̆ ̄ Ï length 2. When I try to check the length of a string with these characters, I get a bad output value.

"̈".mb_chars.length
#=> 2  #should be 1 (ruby on rails)

"̆".length
#=> 2  #should be 1 (ruby, javascript)

"Ӭ".length
#=> 1  #correct (ruby, javascript)

Note that the strings are encoded in UTF-8, and each char behaves like a single character.

My question is why is this behavior and how can I get the string length correctly with these characters inside?

+6
source share
3 answers

, ̈ : :

'̈'.chars
#=> ["", "̈"]

, , ̈ ̆ ( Ӭ).

:

'̆'.gsub(/\p{Diacritic}/, '')
#=> "" 
'̆'.gsub(/\p{Diacritic}/, '').length
#=> 1 

, . , Ӭ, :

'Ӭ'.length
#=> 1
'Ӭ'.gsub(/\p{Diacritic}/, '')
#=> "Ӭ" 
'Ӭ'.gsub(/\p{Diacritic}/, '').length
#=> 1 

, . , Unicode , .

+5

Ruby 2.5 String#each_grapheme_cluster:

'̆̄̈'.each_grapheme_cluster.to_a   #=> ["̆", "̄", "̈"]
'̆̄̈'.each_grapheme_cluster.count  #=> 3

, each_grapheme_cluster.size, each_char.size, 6 . ( , )

+5

Try unicode-display_width , which is built to give an exact answer to this question:

require "unicode/display_width"
Unicode::DisplayWidth.of "̈" #=> 1
0
source

All Articles