How to get the number of characters in a string?

Question

How to get the number of characters in a string?

How can I get the number of characters of a string in Go?

For example, if I have the string "hello" , the method should return 5 . I saw that len(str) returns the number of bytes and not the number of characters, so len("£") returns 2 instead of 1, because E is encoded with two bytes in UTF-8.

+80

string go character string-length

Ammar 01 Oct

source share

4 answers

There is a way to get the number of runes without any packages by converting the string to [] rune as len([]rune(YOUR_STRING)) :

 package main import "fmt" func main() { russian := "  " english := "Sputnik & pogrom" fmt.Println("count of bytes:", len(russian), len(english)) fmt.Println("count of runes:", len([]rune(russian)), len([]rune(english))) }

number of bytes 30 16
number of runes 16 16

+20

Denis Kreshikhin Apr 03 '16 at 16:54 on

source share

Depends on your definition of what a "character" is. If “rune equals character” is right for your task (usually not), then VonC's answer is perfect for you. Otherwise, it should probably be noted that there are several situations where the number of runes in a Unicode string represents an interesting value. And even in such situations, it is better, if possible, to make an account during the "crossing" of the line, when the runes are processed to avoid doubling the decoding efforts of UTF-8.

+5

zzzz 01 Oct

source share

If you need to consider grapheme clusters, use the regexp or unicode module. Counting the number of code points (runes) or bytes is also necessary for validaiton, since the length of the grapheme cluster is not limited. If you want to eliminate extremely long sequences, check to see if the sequence matches the thread-safe text format .

 package main import ( "regexp" "unicode" "strings" ) func main() { str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308" str2 := "a" + strings.Repeat("\u0308", 1000) println(4 == GraphemeCountInString(str)) println(4 == GraphemeCountInString2(str)) println(1 == GraphemeCountInString(str2)) println(1 == GraphemeCountInString2(str2)) println(true == IsStreamSafeString(str)) println(false == IsStreamSafeString(str2)) } func GraphemeCountInString(str string) int { re := regexp.MustCompile("\\PM\\pM*|.") return len(re.FindAllString(str, -1)) } func GraphemeCountInString2(str string) int { length := 0 checked := false index := 0 for _, c := range str { if !unicode.Is(unicode.M, c) { length++ if checked == false { checked = true } } else if checked == false { length++ } index++ } return length } func IsStreamSafeString(str string) bool { re := regexp.MustCompile("\\PM\\pM{30,}") return !re.MatchString(str) }

+4

masakielastic Nov 04

source share

VonC · Accepted Answer · 2012-10-01 07:06

You can try RuneCountInString from utf8 package.

returns the number of runes in p

which, as shown in this script : the length of the "world" can be 6 (when writing in Chinese: "世界"), but its rune count is 2:

 package main import "fmt" import "unicode/utf8" func main() { fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界")) }

Phrozen adds in the comments :

In fact, you can do len() over runes by simply typing a type. len([]rune("世界")) print 2 . On litas in Go 1.3.

Stefan Steiger points to a blog post Normalizing Text in Go "

What is a symbol?

As mentioned in the message string messages , characters can span multiple runes .
For example, " e " and "◌◌" (acute "\ u0301") can be combined to form "é" (" e\u0301 " in the NFD). Together, these two runes are one symbol .
The definition of a symbol may vary by application.
For normalization, we define it as:
a sequence of runes starting with a starter,
a rune that does not alter or unite with another rune,
followed by perhaps an empty sequence of non-starters, i.e. runes that perform (usually accents).
The normalization algorithm processes one character at a time.

Using this package and its Iter type , the actual number of characters will be:

 package main import "fmt" import "golang.org/x/text/unicode/norm" func main() { var ia norm.Iter ia.InitString(norm.NFKD, "école") nc := 0 for !ia.Done() { nc = nc + 1 ia.Next() } fmt.Printf("Number of chars: %d\n", nc) }

This uses the Unicode Normalization NFKD "Decomposition Compatibility" form

How to get the number of characters in a string?

More articles: