Thursday, 7 November 2013

UTF-8 locales in FreeBSD, Linux and Windows

1. FreeBSD

We use FreeBSD 8.3 and gcc 4.6.4, work with the OS terminal in putty.
In your home directory, edit .login_conf:

me:\
:charset=UTF-8:\
:lang=ru_RU.UTF-8:\
:setenv=LC_COLLATE=C:

Restart all putty instances.
In putty, set Change settings... → Translation → Remote character set = UTF-8.
Check locale in terminal, it should be:

$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE=C
LC_TIME="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_ALL=

In your C++ program, use UTF-8 natively, as char*, putting it to std::cout. You can convert utf8 strings to utf16 and print them to wcout, actually you'll get the same output for utf8 strings as with cout. You don't need to call setlocale (LC_ALL, ""); from <clocale>.

Note: for those who don't like Russian UI-s, like me. You can set en_EN.UTF-8 instead of ru_RU.UTF-8 in .login_conf to have native messages in English. The effect to output correctness will be just the same.

2. Windows

There is a bulk of approaches to overcome the lack of UTF-8 support in Windows console. I have the following "lazy" approach successful, also it's easily portable to Unix-systems. With this approach, you can have everything inside your soft in UTF-8, and convert strings from UTF-8 to UTF-16 by the portable utf8cpp library only to output them to console.

It's pretty straightforward. In the very start of your program, add:
setlocale (LC_ALL, "Russian");
Use wcin and wcout, convert UTF-8 strings to UTF-16 with utf8::utf8to16 when outputting.
No need to make chcp in console!

It's portable to FreeBSD by the following mean. In FreeBSD, which is tuned like described earlier, you can get the same output with UTF-16's wcout as with UTF-8's cout. Just try it, but don't forget to add setlocale (LC_ALL, "") or setlocale (LC_ALL, "ru_RU.UTF-8"), for wcout-printing it's important in FreeBSD.

3. Linux

You can perform exactly like in FreeBSD. I did the check on fresh Ubuntu Server 13.04 with gcc 4.7.3 on board, and there was a need to install Russian locale. Firstly, check what locales you do already have, with locale -a. Don't let the "utf8" name part confuse you, using "UTF-8" everywhere is just right. It holds through all Unix-based systems. Secondly, if you don't have Russian locales, here is how to to add them:

sudo locale-gen ru_RU
sudo locale-gen ru_RU.UTF-8
sudo dpkg-reconfigure locales
sudo update-locale

In comparison to the FreeBSD experience, bear in mind one very important thing: avoid mixing together fprintf/wprintf, cout/wcout, etc. E.g., after setting your locale, if you use cout, and then wcout, the latter would print you junk. Remove all cout uses, and wcout starts to work properly. Actually, one should always avoid mixing w and not-w versions of output together, it's a good rule of thumb. Nevertheless, this bad pattern passes okay (or doesn't reveal any errors) in Windows and FreeBSD consoles, but not in Linux console, which is undoubtedly right behavior in general.

Note: in VMWare window, the localized output is crappy: some symbols are ok, while others look like ◊. I don't know how to deal with this, for now I have to attach to my VM through putty. As an advantage, I have mouse and Russian keyboard inputs working.

4. Combining across OS-es

In general, LC_ALL is not a good practice, yet it works :)

There are two ways to deal with it.

Using setlocale in Unix-OS, easily Windows-portable

1) Call setlocale (LC_ALL, "Russian") at the start of the program in Windows, setlocale (LC_ALL, "ru_RU.UTF-8") in FreeBSD/Linux
2) Write to console with wcout, with converting output from utf8 with utf8::utf8to16
3)  Don't use cout at all

Without setlocale in Unix-OS, not-so-easily Windows-portable

1) Use cout in Unix-based systems, wcout in Windows
2) Convert utf8 to utf16 with utf8::utf8to16 only in Windows
2) Call setlocale (LC_ALL, "Russian") at the start only in Windows

The same is with printing to streams (file stream, piping output, stderr).

One can overcome the disadvantage of LC_ALL with such function:

void attach_to_rus_locale (std::ios_base& stream)
{
    std::locale loc ("ru_RU.utf8");
    stream.imbue (loc);
}

and pass there std::wcout, file streams and so on. Unfortunately, utf8::utf8to16 is somehow affected by the locale also. Someone should figure out how to deal with that.

No comments:

Post a Comment