Length () vs Sizeof () for Unicode strings

Question

Length () vs Sizeof () for Unicode strings

Quote: Delphi XE8 Help:

For single-byte and multi-byte strings, Length returns the number of bytes used by the string. Example for UTF-8:
Writeln(Length(Utf8String('1¢'))); // displays 3 
For Unicode (WideString) Length strings, returns the number of bytes divided by two.

Important questions arise:

Why is there no difference in processing at all?
Why does Length () not do what he expected to do, only returns the length of the parameter (as in, the number of elements) instead of giving the size in bytes in some cases?
Why does he claim that he divides the result by 2 for Unicode strings (UTF-16)? AFAIK UTF-16 no more than 4 bytes, and thus, this will give incorrect results.

+5

delphi delphi-xe8

Zzombo Jun 03 '15 at 12:13

source share

1 answer

David heffernan · Accepted Answer · 2015-06-03T12:16:34+0000

Length returns the number of elements when considering a string as an array.

For strings with 8-bit element types (ANSI, UTF-8), then Length gives you the number of bytes, since the number of bytes is the same as the number of elements.
For strings with 16-bit elements (UTF-16), then Length is half the number of bytes, because each element has a width of 2 bytes.

Your line '1' has two code points, but the second code point requires two bytes for encoding in UTF-8. Therefore, Length(Utf8String('1¢')) is evaluated to three.

You mention SizeOf in the title of the question. Passing a string variable to SizeOf will always return the size of the pointer, since the string variable is under the hood, just a pointer.

For your specific questions:

Why is there no difference in processing at all?

There is only a difference if you think of Length as relative to bytes. But the wrong way to think about it is that Length always returns a counter of elements, and when viewed this way, the behavior is uniform in all types of strings and even in all types of arrays.

Why does Length () not do what he expected to do, only returns the length of the parameter (as in, the number of elements) instead of giving the size in bytes in some cases?

It always returns an item counter. It so happens that when the size of an element is one byte, the count of the element and the number of bytes are the same. In fact, the documentation you are referencing also contains the following excerpt that you specified: Returns the number of characters in a string or elements in an array. This is the key text. The excerpt you included is intended to illustrate the effects of this italic text.

Why does he claim that he divides the result by 2 for Unicode strings (UTF-16)? AFAIK UTF-16 no more than 4 bytes, and thus, this will give incorrect results.

UTF-16 character elements are always 16 bits wide. However, for some Unicode code points, two character elements are required for encoding. These pairs of symbolic elements are called surrogate pairs.

You hope, I think that Length will return the number of code points in a string. But this is not so. It returns the number of characters. And for variable length encodings, the number of code points does not necessarily match the number of character elements. If your string was encoded as UTF-32, then the number of code points will be the same as the number of characters, since UTF-32 is a constant-sized encoding.

A quick way to count code points is to check the string check for surrogate pairs. When you encounter a surrogate pair, read one code point. Otherwise, when you encounter a character element that is not part of a surrogate pair, read one code point. In pseudo code:

 N := 0; for C in S do if C.IsSurrogate then inc(N) else inc(N, 2); CodePointCount := N div 2;

Another point is that the number of code points does not match the number of visible characters. Some code points combine characters and combine with adjacent code points to form one visible character or character.

Finally, if all you are hoping to do is find the byte size of the string payload, use this expression:

 Length(S) * SizeOf(S[1])

This expression works for all types of strings.

Be very careful with the System.SysUtils.ByteLength function. At first glance, it looks like what you want. However, this function returns the byte length of the encoded string of UTF-16. Therefore, if you pass it to AnsiString , say, then the value returned by ByteLength will be twice as many AnsiString bytes.

Length () vs Sizeof () for Unicode strings

More articles: