How to manipulate substrings, not subarrays, UnicodeString?

I am testing migration from Delphi 5 to XE. Not familiar with UnicodeString, before asking my question, I would like to introduce its background.

Delphi XE String Functions: Copy , Delete, and Paste There is an Index parameter that indicates where to start work. An index can have any integer value, starting from 1 and ending with the length of the string to which the function is applied. Since a string can contain multi-element characters, a function operation can begin with an element (surrogate) belonging to a multi-element series encoding a single Unicode with a code code. Then, having a reasonable string and using one of the functions, we can get an unreasonable result.

The phenomenon can be illustrated by the following examples, using the Copy function with respect to strings representing the same array of named code points (i.e. significant characters)

($61, $13000, $63) 

This is the concatenation of 'a' , EGYPTIAN_HIEROGLYPH_A001 and 'c' ; he looks like

enter image description here

Case 1. Copy of AnsiString (element = byte)

Let's start with the aforementioned UnicodeString #$61#$13000#$63 and convert it to AnsiString s0 encoding with UTF-8 encoding.

Then we check the function

  copy (s0, index, 1) 

for all possible index values; there are 6 of them, since s0 has a length of 6 bytes.

  procedure Copy_Utf8Test; type TAnsiStringUtf8 = type AnsiString (CP_UTF8); var ss : string; s0,s1 : TAnsiStringUtf8; ii : integer; begin ss := #$61#$13000#$63; //mem dump of ss: $61 $00 $0C $D8 $00 $DC $63 $00 s0 := ss; //mem dump of s0: $61 $F0 $93 $80 $80 $63 ii := length(s0); //sets ii=6 (bytes) s1 := copy(s0,1,1); //'a' s1 := copy(s0,2,1); //#$F0 F means "start of 4-byte series"; no corresponding named code-point s1 := copy(s0,3,1); //#$93 "trailing in multi-byte series"; no corresponding named code-point s1 := copy(s0,4,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point s1 := copy(s0,5,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point s1 := copy(s0,6,1); //'c' end; 

The first and last results are reasonable in the UTF-8 codepage, while the other 4 are not.

Case 2. Copy of UnicodeString (element = word)

Let's start with the same UnicodeString s0 := #$61#$13000#$63 .

Then we check the function

  copy (s0, index, 1) 

for all possible index values; there are 4 of them, since s0 has a length of 4 words.

  procedure Copy_Utf16Test; var s0,s1 : string; ii : integer; begin s0 := #$61#$13000#$63; //mem dump of s0: $61 $00 $0C $D8 $00 $DC $63 $00 ii := length(s0); //sets ii=4 (bytes) s1 := copy(s0,1,1); //'a' s1 := copy(s0,2,1); //#$D80C surrogate pair member; no corresponding named code-point s1 := copy(s0,3,1); //#$DC00 surrogate pair member; no corresponding named code-point s1 := copy(s0,4,1); //'c' end; 

The first and last results are reasonable within the code page CP_UNICODE (1200), while the other 2 are not.

Conclusion

String oriented functions: Copy , Delete and Paste work fine in a string, considered as a simple array of bytes or words. But they are not useful if the string is considered as what it essentially is, i.e. A representation of an array of named code points.

Both of the above two cases deal with strings that represent the same array of 3 named code points. They are considered as representations (encodings) of the same text consisting of 3 significant characters (in order to avoid abuse of the term "characters").

You may want to extract (copy) any of the significant characters, regardless of whether a particular textual representation (coding) is mono or multi-element. I spent quite a bit of time searching for a satisfactory copy equivalent, which I used in Delphi 5.

Question. Do such equivalents exist, or should I write them myself?

+3
source share
2 answers

As you described, how Copy() , Delete() and Insert() work ALWAYS , even for AnsiString . Functions work with elements (for example, using codes in Unicode terminology) and always have.

AnsiString is a string of 8 bits of AnsiChar elements that can be encoded in any 8-bit ANSI / MBCS format, including UTF-8.

UnicodeString (and WideString ) is a string of 16 bits of WideChar elements that are encoded in UTF-16.

Functions MUST consider encoding. Not for MBCS AnsiString . Not for UTF-16 UnicodeString . Indexes are absolute index elements from the beginning of a line.

If you need Copy / Delete / Insert functions that support coding that work at the boundaries of logical code points, where each code can be 1 + elements per line, then you have to write your own functions or find third-party functions that do what you need. There are no MBilator-MBCS / UTF features in RTL.

+4
source

You must parse the Unicode string. Fortunaly Unicode encoding is designed to simplify the analysis. Here is an example of how to parse a UTF8 string:

 program Project9; {$APPTYPE CONSOLE} uses SysUtils; function GetFirstCodepointSize(const S: UTF8String): Integer; var B: Byte; begin B:= Byte(S[1]); if (B and $80 = 0 ) then Result:= 1 else if (B and $E0 = $C0) then Result:= 2 else if (B and $F0 = $E0) then Result:= 3 else if (B and $F8 = $F0) then Result:= 4 else Result:= -1; // invalid code end; var S: string; begin S:= #$61#$13000#$63; Writeln(GetFirstCodepointSize(S)); S:= #$13000#$63; Writeln(GetFirstCodepointSize(S)); S:= #$63; Writeln(GetFirstCodepointSize(S)); Readln; end. 
+2
source

All Articles