Cross-platform Unicode in C / C ++: what encoding to use?

Question

Cross-platform Unicode in C / C ++: what encoding to use?

I am currently working on a hobby project (C / C ++) that should work on both Windows and Linux, with full Unicode support. Unfortunately, Windows and Linux use different encodings, which complicates our lives.

In my code I try to use the data as universal as possible, which simplifies the work for both Windows and Linux. On Windows, wchar_t is encoded as UTF-16 by default, and as UCS-4 on Linux (correct me if I'm wrong).

My software opens ({_wfopen, UTF-16, Windows}, {fopen, UTF-8, Linux}) and writes the data to files in UTF-8. So far, all this can be done. So far I have not decided to use SQLite.

The SQLite C / C ++ interface allows the use of strings with one or two bytes ( click ). Of course, this does not work with wchar_t on Linux, since wchar_t on Linux is 4 bytes by default. Therefore, writing and reading from sqlite requires a conversion for Linux.

Currently, the code is cluttered with exceptions for Windows / Linux. I was hoping to stick with the standard idea of storing data in wchar_t:

wchar_t on Windows: Filepaths no problem, read / write in sqlite no problem. Writing data to a file should be done in UTF-8 anyway.
wchar_t on Linux: exception for file paths due to UTF-8 encoding, conversion before reading / writing to sqlite (wchar_t) and the same for windows when writing data to a file.

After reading ( here ) I made sure that I should stick with wchar_t on Windows. But after all this worked out, the problem started with porting to Linux.

I'm currently going to redo all of this with a simple char (UTF-8), because it works with both Windows and Linux, given the fact that I need "WideCharToMultiByte" to reach UTF-8 every line in Windows. Using simple char * strings will significantly reduce the number of exceptions for Linux / Windows.

Do you have any experience with cross-platform Unicode? Any thoughts on the idea of simply storing data in UTF-8 instead of using wchar_t?

+8

linux windows cross-platform unicode wchar-t

Erikkou Jun 28 '12 at 0:18

source share

2 answers

Our software is also cross-platform, and we faced similar problems. We decided that our goal was to ensure the least number of conversions. This means that we use wchar_t for Windows and char for Unix / Mac.

We do this by supporting _T and LPCTSTR and the like on Unix and having common functions that can be easily converted between std::string and std::wstring . We also have a common std::basic_string<TCHAR> ( tstring ), which we use in most cases.

So far it works very well. Basically, most functions accept tstring or LPCTSTR , and those that will not convert their parameters from tstring . This means that most of the time we do not convert our lines and do not skip most of the parameters.

+2

Fozi Jun 28 '12 at 12:41

source share

Puppy · Accepted Answer · 2012-06-28T00:21:37+0000

UTF-8 on all platforms with just-in-time conversion to UTF-16 for Windows is a common tactic for cross-platform Unicode.

Cross-platform Unicode in C / C ++: what encoding to use?

More articles: