How to create a UTF-8 string literal in Visual C ++ 2008

Question

How to create a UTF-8 string literal in Visual C ++ 2008

In VC ++ 2003, I can just save the source file as UTF-8, and all the lines were used as is. In other words, the following code will print lines, as in the console. If the source file was saved as UTF-8, then the result will be UTF-8.

printf("Chinese (Traditional)"); printf("中国語 (繁体)"); printf("중국어 (번체)"); printf("Chinês (Tradicional)");

I saved the file in UTF-8 format using the UTF-8 specification. However, compiling with VC2008 results in:

 warning C4566: character represented by universal-character-name '\uC911' cannot be represented in the current code page (932) warning C4566: character represented by universal-character-name '\uAD6D' cannot be represented in the current code page (932) etc.

The symbols causing these warnings are corrupted. Those that correspond to the language (in this case 932 = Japanese) are converted to the locale encoding, i.e. Shift-jis.

I cannot find a way to get VC ++ 2008 to compile this for me. Please note that it does not matter which language I use in the source file. It seems that there is no language standard that says: "I know what I'm doing, so don't change string literals." In particular, the useless pseudo-language UTF-8 does not work.

 #pragma setlocale(".65001") => error C2175: '.65001' : invalid locale

Also, "C" is not executed:

 #pragma setlocale("C") => see warnings above (in particular locale is still 932)

It seems that VC2008 forces all characters to the specified (or default) locale, and this language cannot be UTF-8. I don’t want to change the file to use escape lines such as "\ xbf \ x11 ...", because the same source is compiled using gcc, which may well deal with UTF-8 files.

Is it possible to indicate that compiling the source file should leave the string literals intact?

To ask about it differently, what compilation flags can I use to indicate backward compatibility with VC2003 when compiling the source file. that is, do not modify string literals, use them for bytes as they are.

Update

Thanks for the suggestions, but I want to avoid wchar. Since this application only deals with strings in UTF-8, using wchar will require me to convert all strings back to UTF-8, which should not be unnecessary. All input, output, and internal processing is in UTF-8. This is a simple application that works great both on Linux and when compiling with VC2003. I want to be able to compile the same application with VC2008 and work.

For this to happen, I need VC2008 not trying to convert it to a local local language (Japanese, 932). I want the VC2008 to be backward compatible with the VC2003. I want to install a locale or compiler that says strings are used as is, essentially, as opaque char arrays, or as UTF-8. It looks like I could be stuck in VC2003 and gcc, although VC2008 is trying to be too smart in this case.

+58

c ++ visual-c ++ utf-8

brofield Mar 27 '09 at 6:48

source share

17 answers

Brofield,

I had the same problem, and I just stumbled upon a solution that does not require converting the source lines to wide characters and vice versa: save the source file as UTF-8 without a signature, and VC2008 will leave it alone. It worked great when I decided to refuse the signature. Summarizing:

Unicode (UTF-8 without a signature) is Codepage 65001, does not issue c4566 warning in VC2008 and does not call VC to encode, while Codepage 65001 (UTF-8 with signature) rolls c4566 (as you found).

Hope it is not too late to help you, but it can speed up your VC2008 application to remove the workaround.

+16

echo Sep 01 '09 at 1:51

source share

While it is probably best to use wide strings and then convert as needed to UTF-8. I think your best bet is, as you already mentioned, to use hexadecimal escape sequences in strings. Suppose you need a code point \uC911 , you can just do it.

 const char *str = "\xEC\xA4\x91";

I believe this will work fine, just not very readable, so if you do, comment on this to explain.

+15

Evan Teran Mar 28 '09 at 10:12

source share

File / Advanced save options / Encoding: "Unicode (UTF-8 without signature ) - Codepage 65001"

+14

Vladius Mar 09 '10 at 19:06

source share

Visual C ++ (2005+) Standard COMPILER behavior for source files:

CP1252 (for this example, the West European code page):
- "Ä" → C4 00
- 'Ä' → C4
- L"Ä" → 00C4 0000
- L'Ä' → 00C4
UTF-8 without specification:
- "Ä" → C3 84 00 (= UTF-8)
- 'Ä' → warning: multi-character constant
- "Ω" → E2 84 A6 00 (= UTF-8, as expected)
- L"A" → 00C3 0084 0000 (wrong!)
- L'Ä' → warning: multi-character constant
- L"Ω" → 00E2 0084 00A6 0000 (wrong!)
UTF-8 with specification:
- "Ä" → C4 00 (= CP1252, no more than UTF-8),
- 'Ä' → C4
- "Ω" → error: cannot be converted to CP1252!
- L"Ä" → 00C4 0000 (correct)
- L'Ä' → 00C4
- L"Ω" → 2126 0000 (correct)

You see that the C compiler processes UTF-8 files without specification in the same way as CP1252. As a result, the compiler cannot mix UTF-8 and UTF-16 strings into compiled output! Therefore, you need to decide for a single source code file:

either use UTF-8 with the specification and only generate UTF-16 strings (i.e. always use the L prefix),
or UTF-8 without specification and only generate UTF-8 strings (i.e. never use the L prefix).
7-bit ASCII characters are not involved and can be used with or without the L prefix

Regardless, EDITOR can automatically detect UTF-8 files without specification as UTF-8 files.

+8

Henrik Haftmann Jul 19 '12 at 11:58

source share

From the commentary to this very nice blog.
"Using UTF-8 as an internal representation for strings in C and C ++ with Visual Studio"
=> http://www.nubaria.com/en/blog/?p=289

 #pragma execution_character_set("utf-8")

It requires Visual Studio 2008 SP1 and the following fix:
http://support.microsoft.com/kb/980263 ....

+6

Alexander Jung Feb 14 2018-12-12T00:

source share

How about this? You save the strings in a UTF-8 encoded file, and then pre-process them in the ASCII encoded C ++ source file. You store the UTF-8 encoding inside the string using hexadecimal escape sequences. Line

 "中国語 (繁体)"

converted to

 "\xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E (\xE7\xB9\x81\xE4\xBD\x93)"

Of course, this is impossible for anyone to read, and the goal is to avoid problems with the compiler.

You can either use the C ++ preprocessor to reference the lines in the converted header file, or you can convert the entire UTF-8 source to ASCII before compilation using this trick.

+4

Martin Liversage Sep 15 '09 at 2:32

source share

A portable conversion from any native encoding directly uses char_traits :: widen ().

 #include <locale> #include <string> #include <vector> ///////////////////////////////////////////////////////// // NativeToUtf16 - Convert a string from the native // encoding to Unicode UTF-16 // Parameters: // sNative (in): Input String // Returns: Converted string ///////////////////////////////////////////////////////// std::wstring NativeToUtf16(const std::string &sNative) { std::locale locNative; // The UTF-16 will never be longer than the input string std::vector<wchar_t> vUtf16(1+sNative.length()); // convert std::use_facet< std::ctype<wchar_t> >(locNative).widen( sNative.c_str(), sNative.c_str()+sNative.length(), &vUtf16[0]); return std::wstring(vUtf16.begin(), vUtf16.end()); }

Theoretically, the return journey from UTF-16 to UTF-8 should be just as simple, but I found that the UTF-8 locales do not work properly on my system (VC10 Express on Win7).

Thus, I wrote a simple converter based on RFC 3629.

 ///////////////////////////////////////////////////////// // Utf16ToUtf8 - Convert a character from UTF-16 // encoding to UTF-8. // NB: Does not handle Surrogate pairs. // Does not test for badly formed // UTF-16 // Parameters: // chUtf16 (in): Input char // Returns: UTF-8 version as a string ///////////////////////////////////////////////////////// std::string Utf16ToUtf8(wchar_t chUtf16) { // From RFC 3629 // 0000 0000-0000 007F 0xxxxxxx // 0000 0080-0000 07FF 110xxxxx 10xxxxxx // 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx // max output length is 3 bytes (plus one for Nul) unsigned char szUtf8[4] = ""; if (chUtf16 < 0x80) { szUtf8[0] = static_cast<unsigned char>(chUtf16); } else if (chUtf16 < 0x7FF) { szUtf8[0] = static_cast<unsigned char>(0xC0 | ((chUtf16>>6)&0x1F)); szUtf8[1] = static_cast<unsigned char>(0x80 | (chUtf16&0x3F)); } else { szUtf8[0] = static_cast<unsigned char>(0xE0 | ((chUtf16>>12)&0xF)); szUtf8[1] = static_cast<unsigned char>(0x80 | ((chUtf16>>6)&0x3F)); szUtf8[2] = static_cast<unsigned char>(0x80 | (chUtf16&0x3F)); } return reinterpret_cast<char *>(szUtf8); } ///////////////////////////////////////////////////////// // Utf16ToUtf8 - Convert a string from UTF-16 encoding // to UTF-8 // Parameters: // sNative (in): Input String // Returns: Converted string ///////////////////////////////////////////////////////// std::string Utf16ToUtf8(const std::wstring &sUtf16) { std::string sUtf8; std::wstring::const_iterator itr; for (itr=sUtf16.begin(); itr!=sUtf16.end(); ++itr) sUtf8 += Utf16ToUtf8(*itr); return sUtf8; }

I believe that this should work on any platform, but I could not test it, except for my own system, so it may have errors.

 #include <iostream> #include <fstream> int main() { const char szTest[] = "Das tausendschöne Jungfräulein,\n" "Das tausendschöne Herzelein,\n" "Wollte Gott, wollte Gott,\n" "ich wär' heute bei ihr!\n"; std::wstring sUtf16 = NativeToUtf16(szTest); std::string sUtf8 = Utf16ToUtf8(sUtf16); std::ofstream ofs("test.txt"); if (ofs) ofs << sUtf8; return 0; }

+3

Michael J Dec 19 2018-10-12T00:

source share

Perhaps try an experiment:

 #pragma setlocale(".UTF-8")

or

 #pragma setlocale("english_england.UTF-8")

+1

Windows programmer Sep 15 '09 at 3:15

source share

I had a similar problem. UTF-8 string literals were converted to the current system code page at compile time - I just opened the .obj files in the hex-viewer and they were already crippled. For example, the character ć was just one byte.

The solution for me was to save UTF-8 and WITHOUT specifications. This is how I tricked the compiler. Now he thinks that this is just a normal source and does not translate lines. There are now two bytes in the .obj ć files.

Ignore some commentators. I understand what you want - I also want: UTF-8 source, UTF-8 generated files, UTF-8, UTF-8 input files over communication lines without translation.

Perhaps this helps ...

+1

Daniel N. Dec 18 '09 at 13:10

source share

I know I'm late to the party, but I think I need to spread this . For Visual C ++ 2005 and above, if the source file does not contain specifications (byte order), and the locale of your system is not English, VC will assume that your source file is not in Unicode.

In order for the UTF-8 source files to be compiled correctly, you must save the UTF-8 encoding without specification , and the system locale (non-Unicode language) must be English .

+1

raymai97 May 04 '17 at 16:35

source share

I had a similar problem, the solution was to save to UTF8 using boom using advanced save options

0

Dennis Mar 28 '09 at 15:51

source share

So, things to change. Now I got a solution.

First of all, you must work under the local codepage page, for example, in English, so cl.exe will not receive codes in chaos.

Secondly, save the source code in the UTF8-NO specification, note NO-BOM, and then compile cl.exe, don’t call any C APIs like printf wprint, all these employees do not work, I don’t know why:) .... maybe later ...

Then just compile and run, you will see the result ..... my email is luoyonggang, (google) hope some ...

WScript:

 #! /usr/bin/env python # encoding: utf-8 # Yonggang Luo # the following two variables are used by the target "waf dist" VERSION='0.0.1' APPNAME='cc_test' top = '.' import waflib.Configure def options(opt): opt.load('compiler_c') def configure(conf): conf.load('compiler_c') conf.check_lib_msvc('gdi32') conf.check_libs_msvc('kernel32 user32') def build(bld): bld.program( features = 'c', source = 'chinese-utf8-no-bom.c', includes = '. ..', cflags = ['/wd4819'], target = 'myprogram', use = 'KERNEL32 USER32 GDI32')

Running script run.bat

 rd /s /q build waf configure build --msvc_version "msvc 6.0" build\myprogram rd /s /q build waf configure build --msvc_version "msvc 9.0" build\myprogram rd /s /q build waf configure build --msvc_version "msvc 10.0" build\myprogram

Download Source Package main.c:

 //encoding : utf8 no-bom #include <stdio.h> #include <string.h> #include <Windows.h> char* ConvertFromUtf16ToUtf8(const wchar_t *wstr) { int requiredSize = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, 0, 0, 0, 0); if(requiredSize > 0) { char *buffer = malloc(requiredSize + 1); buffer[requiredSize] = 0; WideCharToMultiByte(CP_UTF8, 0, wstr, -1, buffer, requiredSize, 0, 0); return buffer; } return NULL; } wchar_t* ConvertFromUtf8ToUtf16(const char *cstr) { int requiredSize = MultiByteToWideChar(CP_UTF8, 0, cstr, -1, 0, 0); if(requiredSize > 0) { wchar_t *buffer = malloc( (requiredSize + 1) * sizeof(wchar_t) ); printf("converted size is %d 0x%x\n", requiredSize, buffer); buffer[requiredSize] = 0; MultiByteToWideChar(CP_UTF8, 0, cstr, -1, buffer, requiredSize); printf("Finished\n"); return buffer; } printf("Convert failed\n"); return NULL; } void ShowUtf8LiteralString(char const *name, char const *str) { int i = 0; wchar_t *name_w = ConvertFromUtf8ToUtf16(name); wchar_t *str_w = ConvertFromUtf8ToUtf16(str); printf("UTF8 sequence\n"); for (i = 0; i < strlen(str); ++i) { printf("%02x ", (unsigned char)str[i]); } printf("\nUTF16 sequence\n"); for (i = 0; i < wcslen(str_w); ++i) { printf("%04x ", str_w[i]); } //Why not using printf or wprintf? Just because they do not working:) MessageBoxW(NULL, str_w, name_w, MB_OK); free(name_w); free(str_w); } int main() { ShowUtf8LiteralString("English english_c", "Chinese (Traditional)"); ShowUtf8LiteralString("简体 s_chinese_c", "你好世界"); ShowUtf8LiteralString("繁体 t_chinese_c", "中国語 (繁体)"); ShowUtf8LiteralString("Korea korea_c", "중국어 (번체)"); ShowUtf8LiteralString("What? what_c", "Chinês (Tradicional)"); }

0

lygstate Jul 08 '11 at 17:20

source share

UTF-8 source files

Without specification : processed as raw, unless your system uses a codepage> 1byte / char (e.g. Shift JIS). You need to change the system code page to one byte, and then you can use Unicode characters inside literals and compile without problems (at least I hope).
With spec : let them char and string literals convert to system code page at compile time. You can check the current system code page using GetACP (). AFAIK, there is no way to set the system code page to 65001 (UTF-8), therefore, therefore, there is no way to use UTF-8 directly with the specification.

The only portable and compiler-independent way is to use ASCII encoding and escape sequences, because there is no guarantee that any compiler will accept the UTF-8 encoded file.

0

user206334 Apr 09 '13 at 9:05

source share

I had a similar problem with compiling UTF-8 string literals (char), and I found that basically I had to have both the UTF-8 specification and #pragma execution_character_set("utf-8") [1], or neither specification, no pragma [2]. Using one without the other led to an incorrect conversion.

I wrote down the details at https://github.com/jay/compiler_string_test

[1]: Visual Studio 2012 does not support execute_character_set. Visual Studio 2010 and 2015, it works great, and, as you know, with the patch in 2008, it works great.

[2]: Some comments in this thread noted that the use of neither specification nor pragma can lead to incorrect conversion for developers using a local code page that is multi-byte (for example, Japan).

0

Jay Dec 08 '17 at 4:31 on

source share

I agree with Theo Wache. Read the article Absolute Minimum Every software developer Absolutely, should know positively about Unicode and character sets (without excuses!) On Joel On Software ...

-3

Wacek Mar 28 '09 at 10:40

source share

Read the articles. Firstly, you do not want UTF-8. UTF-8 is just a way of representing characters. You need wide characters (wchar_t). You spell them as L "yourtextgoeshere". The type of this literal is wchar_t *. If you're in a hurry, just find wprintf.

-6

Theo Vosse Mar 28 '09 at 21:22

source share

brofield · Accepted Answer · 2009-03-30 04:52

Update:

I decided there was no guaranteed way to do this. The solution that I cited below works for the English version of VC2003, but fails when compiling with the Japanese version of VC2003 (or, perhaps, this is the Japanese OS). In any case, it may not depend on the work. Note that even declaring everything since L "" lines do not work (and this is painful in gcc, as described below).

Instead, I believe that you just need to bite the bullet and move all the text to the data file and load it from there. Now I save and access text in INI files through SimpleIni (a cross-platform library of INI files). At least there is a guarantee that it works, since all the text leaves the program.

Original:

I answer this myself, because only Evan seemed to understand the problem. The answers to the question of what Unicode is and how to use wchar_t are not relevant to this problem, since this is not about internationalization, but also about a misunderstanding of Unicode, character encoding. I appreciate your attempt to help, though, I apologize if I was not clear enough.

The problem is that I have source files that need to be cross-compiled under various platforms and compilers. The program handles UTF-8. He does not care about any other encodings. I want to have string literals in UTF-8, as it currently works with gcc and vc2003. How do I do this with the VC2008? (i.e. backward compatible solution).

Here is what I found:

gcc (v4.3.2 20081105):

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files must not have UTF-8 specification

VC2003:

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files may or may not have the UTF-8 specification (it doesn't matter)

VC2005 +:

string literals are massaged by the compiler (without raw strings)
char string literals are transcoded to the specified locale
UTF-8 is not supported as a target locale
source files must have UTF-8 specification

So, the simple answer is that for this specific purpose, the VC2005 + is broken and does not provide a backward compatible compilation path. The only way to get Unicode strings into a compiled program is through UTF-8 + BOM + wchar, which means that I need to convert all strings back to UTF-8 during use.

There is no simple cross-platform method for converting wchar to UTF-8, for example, what size and encoding is wchar? On Windows UTF-16. On other platforms? Different. For details, see

How to create a UTF-8 string literal in Visual C ++ 2008

More articles: