Can I get one canonical UTF-8 string from a Unicode string?

Question

Can I get one canonical UTF-8 string from a Unicode string?

I have a twelve year program for Windows. As it may be obvious to those who know it, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there is one place that still needs to be changed. However, there is a serious limitation: the exact same ~~ASCII~~ MUST sequence is created by different codes, some of which will work on systems other than Windows.

I am trying to determine if UTF-8 will do the trick or not. I heard along the way that different UTF-8 sequences may contain the same Unicode string, which would be a problem here.

So the question is: given the Unicode string, can I expect one canonical UTF-8 sequence to be generated by any standards-compliant converter implementation? Or are there several possibilities?

+1

unicode utf-8

Head geek Nov 12 '10 at 15:21

source share

2 answers

, UTF-8, .

- " ". UTF-16 - . Java - ( 3- 4- ). MySQL , .

, , U + FFFF. , , "" : -)

. , .

+3

Mihai Nita 13 . '10 9:29

John Knoeller · Accepted Answer · 2010-11-12T15:35:57+0000

Any Unicode string will have only one representation in UTF-8.

I think the confusion here is that in Unicode there are several ways to get the same visual output for some languages. Not to mention that Unicode has several characters that do not have a visual representation.

But this has nothing to do with UTF-8, it is a property of Unicode itself. Encoding this Unicode as UTF-8 is a purely mechanical process, and it is perfectly reversible.

: http://en.wikipedia.org/wiki/UTF-8

Can I get one canonical UTF-8 string from a Unicode string?

More articles: