String to byte array in UTF-8?

Question

String to byte array in UTF-8?

How to convert a WideString string (or another long string) to a byte array in UTF-8?

+7

utf-8 freepascal lazarus

Mariusz Mar 08 '11 at 14:01

source share

6 answers

You can use TEncoding.UTF8.GetBytes in SysUtils.pas

+8

Mikael eriksson Mar 08 '11 at 14:53

source share

If you are using Delphi 2009 or later (Unicode versions), converting WideString to UTF8String is a simple assignment statement:

 var ws: WideString; u8s: UTF8String; u8s := ws;

The compiler will call the desired library function to convert, because it knows that values of type UTF8String have a "code page" CP_UTF8 .

In Delphi 7 and later, you can use the provided library function Utf8Encode . For earlier versions, you can get this function from other libraries, such as JCL.

You can also write your own conversion function using the Windows API:

 function CustomUtf8Encode(const ws: WideString): UTF8String; var n: Integer; begin n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), nil, 0, nil, nil); Win32Check(n <> 0); SetLength(Result, n); n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), PAnsiChar(Result), n, nil, nil); Win32Check(n = Length(Result)); end;

In most cases, you can just use UTF8String as an array, but if you really need a byte array, you can use the David and Cosmin functions. If you write your own character conversion function, you can skip UTF8String and go directly to the byte array; just change the return type to TBytes or array of Byte . (You might want to increase the length by one if you want the array to be completed with a value of zero. SetLength will do this implicitly in the string, but into the array.)

If you have another type of string that is neither WideString, UnicodeString, nor UTF8String, then the way to convert it to UTF-8 is to convert it to WideString or UnicodeString first and then convert back to UTF-8.

+5

Rob kennedy Mar 08 '11 at 15:01

source share

 var S: UTF8String; B: TBytes; begin S := 'Șase sași în șase saci'; SetLength(B, Length(S)); // Length(s) = 26 for this 22 char string. CopyMemory(@B[0], @S[1], Length(S)); end.

Depending on what you need bytes for, you can include the NULL terminator.

For production code, make sure you're testing an empty string. Adding 3-4 LOCs simply requires making the sample more difficult to read.

+4

Cosmin prund Mar 08 '11 at 14:09

source share

I have the following two procedures (source code can be downloaded here - http://www.csinnovations.com/framework_utilities.htm ):

function CsiBytesToStr (const pInData: TByteDynArray; pStringEncoding: TECsiStringEncoding; pIncludesBom: Boolean): string;

function CsiStrToBytes (const pInStr: string; pStringEncoding: TECsiStringEncoding; pIncludeBom: Boolean): TByteDynArray;

+1

Misha Mar 08 '11 at 23:51

source share

widestring → UTF8:

http://www.freepascal.org/docs-html/rtl/system/utf8decode.html

the opposite:

http://www.freepascal.org/docs-html/rtl/system/utf8encode.html

Please note that the width assignment in ansistring in the pre D2009 system (including the current Free Pascal) is converted to local ansi encoding, distorting characters.

For the TBytes part, see Rob Kennedy's remark above.

0

Marco van de voort Mar 09 '11 at 12:57

source share

David heffernan · Accepted Answer · 2011-03-08T14:20:08+0000

A function like this will do what you need:

function UTF8Bytes(const s: UTF8String): TBytes; begin Assert(StringElementSize(s)=1); SetLength(Result, Length(s)); if Length(Result)>0 then Move(s[1], Result[0], Length(s)); end;

You can call it with any type of string, and RTL will convert from the encoding of the string passed to UTF-8. Therefore, you should not deceive the idea that you should convert to UTF-8 before calling, just pass any string and let RTL do the job.

After that, this is a pretty standard copy of the array. Note the statement that explicitly raises the assumption about the size of a string element for a UTF-8 encoded string.

If you want to get a null terminator, you should write it like this:

 function UTF8Bytes(const s: UTF8String): TBytes; begin Assert(StringElementSize(s)=1); SetLength(Result, Length(s)+1); if Length(Result)>0 then Move(s[1], Result[0], Length(s)); Result[high(Result)] := 0; end;

String to byte array in UTF-8?

More articles: