You can try RuneCountInString from utf8 package.
returns the number of runes in p
which, as shown in this script : the length of the "world" can be 6 (when writing in Chinese: "ไธ็"), but its rune count is 2:
package main import "fmt" import "unicode/utf8" func main() { fmt.Println("Hello, ไธ็", len("ไธ็"), utf8.RuneCountInString("ไธ็")) }
Phrozen adds in the comments :
In fact, you can do len() over runes by simply typing a type. len([]rune("ไธ็")) print 2 . On litas in Go 1.3.
Stefan Steiger points to a blog post Normalizing Text in Go "
What is a symbol?
As mentioned in the message string messages , characters can span multiple runes .
For example, " e " and "โโ" (acute "\ u0301") can be combined to form "รฉ" (" e\u0301 " in the NFD). Together, these two runes are one symbol .
The definition of a symbol may vary by application.
For normalization, we define it as:
- a sequence of runes starting with a starter,
- a rune that does not alter or unite with another rune,
- followed by perhaps an empty sequence of non-starters, i.e. runes that perform (usually accents).
The normalization algorithm processes one character at a time.
Using this package and its Iter type , the actual number of characters will be:
package main import "fmt" import "golang.org/x/text/unicode/norm" func main() { var ia norm.Iter ia.InitString(norm.NFKD, "รฉcole") nc := 0 for !ia.Done() { nc = nc + 1 ia.Next() } fmt.Printf("Number of chars: %d\n", nc) }
This uses the Unicode Normalization NFKD "Decomposition Compatibility" form
VonC Oct 01 2018-12-12T00: 00Z
source share