Why does wide file-stream in C++ narrow written data by default?

Question

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

Good question. I hope you can dig up an answer. Personally I'm leaning towards the "IOStreams is just a badly designed library" theory... ;) It probably doesn't help that Unicode wasn't exactly well established when the library was designed. They might have thought that serializing to/from plain chars was the most portable approach.
@jalf Thanks. I am not really proficient with streams but this question bothers me a lot :D

AProgrammer · Accepted Answer · 2009-10-02 15:10:10Z

The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

Two main points:

IO is done in term of char.
it is the job of the locale to determine how wide chars are serialized
the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
there is an environment determined locale named ""

So to get anything, you have to set the locale.

If I use the simple program

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed\n";
    }
}

which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

$ env LC_ALL=C ./a.out
Output failed

the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003

(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

I bet the output failed because it was expecting another character. And the second one is not what I would expect. unless it is completely ignoring the high bits of the wchar_t. What happens if you output c = 0xABCD; Is it encoding the CD into UTF-8 and ignoring the AB? or is the whole thing encoded. What happens when the UTF-8 character is three bytes long?
Also I get different results. C: (ff 0a) en_US.utf8: (std::runtime_error[locale::facet::_S_create_c_locale name not valid])
I don't understand why C3 BF isn't the encoding of 0x00FF you where expecting. And for 0xABCD it gives EA AF 8D which is what I expected. What I didn't expect is that it allowed 0xDCBA (it is a surrogate and not a valid code point) and other invalid code points.
Locale name are not standardized..., so you'd need to find out if you have an utf-8 locale and what is its name and how to set it (under posix, running locale -a in the shell gives you a list). I don't have time now to find out what are the constraints on the "C" locale for wide characters -- I guess it is implementation defined.
OK. So it is converting UCS-2 (internal fixed width format) into UTF-8 (External multibyte format). That makes some sense. Note: UCS-2 does not support surrogate pairs (it just encodes them as code points so there is not loss of information when transforming between UTF-16 and UCS-2).

Éric Malenfant · Answer 2 · 2009-10-02 13:21:05Z

A very partial answer for the first question: A file is a sequence of bytes so, when dealing withwchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.

Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.

Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.

For your second question:

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:

The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t>converts between the UTF-32 and UTF-8 encodings schemes.codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.

In the [locale.stdcvt] section, we find:

For the facet codecvt_utf8: — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]

For the facet codecvt_utf16: — The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]

For the facet codecvt_utf8_utf16: — The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.

So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.

@Éric Thanks. Finally we are getting real Unicode streams :)
@Éric I meant that streams are Unicode aware, as C++0x is. I'm still looking for a rationale answer about the main question.

sellibitze · Answer 3 · 2009-10-02 13:22:42Z

I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)

Check out the most recent C++0x draft (N2960).

ltcmelo · Answer 4 · 2009-10-02 15:13:41Z

For your first question, this is my guess.

The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.

Inside your program, you should use a (fixed-width) wide-character encoding.
Only external storage should use (variable-width) multibyte encodings.

I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.

Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)

By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.

Your premises will probably not hold. UTF-16 is multibyte. Most people consider UTF-32 to wasteful for character data (I don't) so we will end up using UTF-16 and having all the extra code to handle the special corner case of surrogate pairs.
@Martin: UTF-8 and UTF-16 are all multibyte. I didn't say they were fixed-width. I don't understand exactly what you're saying.

score 2 · Answer 5

Check this out: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

You can alter the default behavior by setting a wide char buffer, using pubsetbuf. Once you did that, the output will be wchar_t and not char.

In other words for your example you will have:

wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!
wchar_t buffer[128];
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)

Why does wide file-stream in C++ narrow written data by default?

5 Answers