UTF-8 and Unicode FAQ for Unix/Linux

最新推荐文章于 2023-09-20 20:12:10 发布

CaspianSea

最新推荐文章于 2023-09-20 20:12:10 发布

阅读量2.9k

点赞数

分类专栏： misc 文章标签： unicode utf-8

misc 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

by Markus Kuhn

This text is a very comprehensive one-stop information resourceon how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). Youwill find here both introductory information for every user, as well asdetailed references for the experienced developer.

Unicode now replaces ASCII, ISO 8859 and EUC at all levels. Itenables users to handle not only practically any script and languageused on this planet, it also supports a comprehensive set ofmathematical and technical symbols to simplify scientific informationexchange.

With the UTF-8 encoding, Unicode can be used in a convenient andbackwards compatible way in environments that were designed entirelyaround ASCII, like Unix. UTF-8 is the way in which Unicode is usedunder Unix, Linux, and similar systems. Make sure that you are wellfamiliar with it and that your software supports UTF-8 smoothly.

What are UCS and ISO 10646?

The international standard ISO 10646 defines theUniversal Character Set (UCS). UCS is a superset of all othercharacter set standards. It guarantees round-tripcompatibility to other character sets. This means simply that noinformation is lost if you convert any text string to UCS and thenback to its original encoding.

UCS contains the characters required to represent practically allknown languages. This includes not only the Latin, Greek, Cyrillic,Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,Japanese and Korean Han ideographs as well as scripts such asHiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,Ogham, Myanmar, Sinhala, Thaana, Yi, and others. For scripts not yetcovered, research on how to best encode them for computer usage isstill going on and they will be added eventually. This includes notonly historic scripts such as Cuneiform, Hieroglyphs and various Indo-European notations, but even someselected artistic scripts such as Tolkien’s Tengwar and Cirth. UCS also covers a large number of graphical,typographical, mathematical and scientific symbols, including thoseprovided by TeX, PostScript, APL, the International Phonetic Alphabet(IPA), MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many wordprocessing and publishing systems. The standard continues to bemaintained and updated. Ever more exotic and specialized symbols andcharacters will be added for many years to come.

ISO 10646 originally defined a 31-bit character set. The subsets of2¹⁶ characters where the elements differ (in a 32-bitinteger representation) only in the 16 least-significant bits arecalled the planes of UCS.

The most commonly used characters, including all those found inmajor older encoding standards, have been placed into the first plane(0x0000 to 0xFFFD), which is called theBasic Multilingual Plane (BMP) or Plane 0. The characters thatwere later added outside the 16-bit BMP are mostly for specialistapplications such as historic scripts and scientific notation. Currentplans are that there will never be characters assigned outside the21-bit code space from 0x000000 to 0x10FFFF, which covers a bit overone million potential future characters. The ISO 10646-1 standard wasfirst published in 1993 and defines the architecture of the characterset and the content of the BMP. A second part ISO 10646-2 was added in2001 and defines characters encoded outside the BMP. In the 2003edition, the two parts were combined into a single ISO 10646 standard.New characters are still being added on a continuous basis, but theexisting characters will not be changed any more and are stable.

UCS assigns to each character not only a code number but also anofficial name. A hexadecimal number that represents a UCS or Unicodevalue is commonly preceded by “U+” as in U+0041 for the character“Latin capital letter A”. The UCS characters U+0000 to U+007F areidentical to those in US-ASCII (ISO 646 IRV) and the range U+0000 toU+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 toU+F8FF and also larger ranges outside the BMP are reserved for privateuse. UCS also defines several methods for encoding a string ofcharacters as a sequence of bytes, such as UTF-8 and UTF-16.

The full reference for the UCS standard is

International Standard ISO/IEC 10646, Information technology— Universal Multiple-Octet Coded Character Set (UCS) . Thirdedition, International Organization for Standardization, Geneva, 2003.

The standard can be orderedonline from ISO as a set of PDF files on CD-ROM for 112 CHF.

In September 2006, ISO releaseda free online PDF copy of ISO 10646 on its FreelyAvailable Standards web page. The ZIPfile is 80 MB long.

What are combining characters?

Some code points in UCS have been assigned to combiningcharacters. These are similar to the non-spacing accent keys on atypewriter. A combining character is not a full character by itself.It is an accent or other diacritical mark that is added to theprevious character. This way, it is possible to place any accent onany character. The most important accented characters, like those usedin the orthographies of common languages, have codes of their own inUCS to ensure backwards compatibility with older character sets. Theyare known asprecomposed characters. Precomposed characters are available inUCS for backwards compatibility with older encodings that have nocombining characters, such as ISO 8859. The combining-charactermechanism allows one to add accents and other diacritical marks to anycharacter. This is especially important for scientific notations suchas mathematical formulae and the International Phonetic Alphabet,where any possible combination of a base character and one or severaldiacritical marks could be needed.

Combining characters follow the character which they modify. Forexample, the German umlaut character Ä (“Latin capital letter A withdiaeresis”) can either be represented by the precomposed UCS codeU+00C4, or alternatively by the combination of a normal “Latin capitalletter A” followed by a “combining diaeresis”: U+0041 U+0308. Severalcombining characters can be applied when it is necessary to stackmultiple accents or add combining marks both above and below the basecharacter. The Thai script, for example, needs up to two combiningcharacters on a single base character.

What are UCS implementation levels?

Not all systems can be expected to support all the advancedmechanisms of UCS, such as combining characters. Therefore, ISO 10646specifies the following three implementation levels:

Level 1

Combining characters and Hangul Jamo characters are not supported.
[Hangul Jamo are an alternative representation ofprecomposed modern Hangul syllables as a sequence of consonants andvowels. They are required to fully support the Korean script includingMiddle Korean.]

Level 2

Like level 1, however in some scripts, a fixed list ofcombining characters is now allowed (e.g., for Hebrew, Arabic,Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada,Malayalam, Thai and Lao). These scripts cannot be representedadequately in UCS without support for at least certain combiningcharacters.

Level 3

All UCS characters are supported, such that, forexample, mathematicians can place a tilde or an arrow (or both) on anycharacter.

Has UCS been adopted as a national standard?

Yes, a number of countries have published national adoptions of ISO10646, sometimes after adding additional annexes with cross-referencesto older national standards, implementation guidelines, andspecifications of various national implementation subsets:

China: GB 13000.1-93
Japan: JISX 0221-1:2001
Korea: KS X 1005-1:1995 (includes ISO 10646-1:1993 amendments 1-7)
Vietnam: TCVN6909:2001
(This “16-bit Coded Vietnamese Character Set” is asmall UCS subset and to be implemented for data interchange with andwithin government agencies as of 2002-07-01.)
Iran: ISIRI6219:2002, Information Technology — Persian InformationInterchange and Display Mechanism, using Unicode. (This is not aversion or subset of ISO 10646, but a separate document that providesadditional national guidance and clarification on handling the Persianlanguage and the Arabic script in Unicode.)

What is Unicode?

In the late 1980s, there have been two independent attempts tocreate a single unified character set. One was the ISO 10646 projectof the International Organization forStandardization (ISO), the other was the Unicode Project organized by aconsortium of (initially mostly US) manufacturers of multi-lingualsoftware. Fortunately, the participants of both projects realized inaround 1991 that two different unified character sets is not exactlywhat the world needs. They joined their efforts and worked together oncreating a single code table. Both projects still exist and publishtheir respective standards independently, however the UnicodeConsortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables ofthe Unicode and ISO 10646 standards compatible and they closelycoordinate any further extensions. Unicode 1.1 corresponded to ISO10646-1:1993, Unicode 3.0 corresponded to ISO 10646-1:2000, Unicode3.2 added ISO 10646-2:2001, and Unicode 4.0 corresponds to ISO10646:2003, and Unicode 5.0 corresponds to ISO 10646:2003 plus itsamendments 1–3. All Unicode versions since 2.0 are compatible, onlynew characters will be added, no existing characters will be removedor renamed in the future.

The Unicode Standard can be ordered like any normal book, forinstance via amazon.comfor around 60 USD:

The Unicode Consortium: TheUnicode Standard 5.0,
Addison-Wesley, 2006,
ISBN 0-321-48091-0.

If you work frequently with text processing and character sets, youdefinitely should get a copy. Unicode 5.0 is also available online.

So what is the difference between Unicode and ISO 10646?

The UnicodeStandard published by the Unicode Consortium corresponds to ISO10646 at implementation level 3. All characters are at the samepositions and have the same names in both standards.

The Unicode Standard defines in addition much more semanticsassociated with some of the characters and is in general a betterreference for implementors of high-quality typographic publishingsystems. Unicode specifies algorithms for rendering presentation formsof some scripts (say Arabic), handling of bi-directional texts thatmix for instance Latin and Hebrew, algorithms for sorting and stringcomparison, and much more.

The ISO 10646 standard on the other hand is not much more than asimple character set table, comparable to the old ISO 8859 standards.It specifies some terminology related to the standard, defines someencoding alternatives, and it contains specifications of how to useUCS in connection with other established ISO standards such as ISO6429 and ISO 2022. There are other closely related ISO standards, forinstance ISO14651 on sorting UCS strings. A nice feature of the ISO 10646-1standard is that it provides CJK example glyphs in five differentstyle variants, while the Unicode standard shows the CJK ideographsonly in a Chinese variant.

What is UTF-8?

UCS and Unicode are first of all just code tables that assigninteger numbers to characters. There exist several alternatives forhow a sequence of such characters or their respective integer valuescan be represented as a sequence of bytes. The two most obviousencodings store Unicode text as sequences of either 2 or 4 bytessequences. The official terms for these encodings are UCS-2 and UCS-4,respectively. Unless otherwise specified, the most significant bytecomes first in these (Bigendian convention). An ASCII or Latin-1 filecan be transformed into a UCS-2 file by simply inserting a 0x00 bytein front of every ASCII byte. If we want to have a UCS-4 file, we haveto insert three 0x00 bytes instead before every ASCII byte.

Using UCS-2 (or UCS-4) under Unix would lead to very severeproblems. Strings with these encodings can contain as parts of manywide characters bytes like “\0” or “/” which have a special meaning infilenames and other C library function parameters. In addition, themajority of UNIX tools expects ASCII files and cannot read 16-bit wordsas characters without major modifications. For these reasons,UCS-2 is not a suitable external encoding of Unicode infilenames, text files, environment variables, etc.

The UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in RFC 3629 as well assection 3.9 of the Unicode 4.0 standard does not have these problems.It is clearly the way to go for usingUnicode under Unix-style operating systems.

UTF-8 has the following properties:

UCS characters U+0000 to U+007F (ASCII) are encoded simply asbytes 0x00 to 0x7F (ASCII compatibility). This means that files andstrings which contain only 7-bit ASCII characters have the sameencoding under both ASCII and UTF-8.
All UCS characters >U+007F are encoded as a sequence of severalbytes, each of which has the most significant bit set. Therefore, noASCII byte (0x00-0x7F) can appear as part of any other character.
The first byte of a multibyte sequence that represents a non-ASCIIcharacter is always in the range 0xC0 to 0xFD and it indicates howmany bytes follow for this character. All further bytes in a multibytesequence are in the range 0x80 to 0xBF. This allows easyresynchronization and makes the encoding stateless and robust againstmissing bytes.
All possible 2³¹ UCS codes can be encoded.
UTF-8 encoded characters may theoretically be up to six byteslong, however 16-bit BMP characters are only up to three bytes long.
The sorting order of Bigendian UCS-4 byte strings is preserved.
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

The following byte sequences are used to represent a character. Thesequence to be used depends on the Unicode number of the character:

U-00000000 – U-0000007F:	0xxxxxxx
U-00000080 – U-000007FF:	110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF:	1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF:	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF:	111110xx 10xxxxxx 10xxxxxx 10xxxxxx10xxxxxx
U-04000000 – U-7FFFFFFF:	1111110x 10xxxxxx 10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of thecharacter code number in binary representation. The rightmost xbit is the least-significant bit. Only the shortest possible multibytesequence which can represent the code number of the character can beused. Note that in multibyte sequences, the number of leading 1 bitsin the first byte is identical to the number of bytes in the entiresequence.

Examples: The Unicode character U+00A9 = 10101001 (copyright sign) is encoded in UTF-8 as

    11000010 10101001 = 0xC2 0xA9

and character U+2260 = 0010 0010 0110 0000 (not equalto) is encoded as:

    11100010 10001001 10100000 = 0xE2 0x89 0xA0

The official name and spelling of this encoding is UTF-8, where UTFstands for UCS Transformation Format. Please donot write UTF-8 in any documentation text in other ways (such as utf8or UTF_8), unless of course you refer to a variable name and not theencoding itself.

An important note for developers of UTF-8 decoding routines:For security reasons, a UTF-8 decoder mustnot accept UTF-8 sequences that are longer than necessary toencode a character. For example, the character U+000A (line feed) mustbe accepted from a UTF-8 stream only in the form 0x0A, but notin any of the following five possible overlong forms:

  0xC0 0x8A
  0xE0 0x80 0x8A
  0xF0 0x80 0x80 0x8A
  0xF8 0x80 0x80 0x80 0x8A
  0xFC 0x80 0x80 0x80 0x80 0x8A

Any overlong UTF-8 sequence could be abused to bypass UTF-8substring tests that look only for the shortest possible encoding. Alloverlong UTF-8 sequences start with one of the following bytepatterns:

1100000x (10xxxxxx)

11100000 100xxxxx (10xxxxxx)

11110000 1000xxxx (10xxxxxx 10xxxxxx)

11111000 10000xxx (10xxxxxx 10xxxxxx10xxxxxx)

11111100 100000xx (10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx)

Also note that the code positions U+D800 to U+DFFF (UTF-16surrogates) as well as U+FFFE and U+FFFF must not occur in normalUTF-8 or UCS-4 data. UTF-8 decoders should treat them like malformedor overlong sequences for safety reasons.

Markus Kuhn’s UTF-8 decoderstress test file contains a systematic collection of malformed andoverlong UTF-8 sequences and will help you to verify the robustness ofyour decoder.

Who invented UTF-8?

The encoding known today as UTF-8 was invented by Ken Thompson. It wasborn during the evening hours of 1992-09-02 in a New Jersey diner,where he designed it in the presence of Rob Pike on a placemat(see Rob Pike’s UTF-8 history). Itreplaced an earlier attempt to design a FSS/UTF (file system safe UCStransformation format) that was circulated in an X/Open workingdocument in August 1992 by Gary Miller (IBM), Greger Leijonhufvud andJohn Entenmann (SMI) as a replacement for the division-heavy UTF-1encoding from the first edition of ISO 10646-1. By the end of thefirst week of September 1992, Pike and Thompson had turned AT&TBell Lab’s Plan 9into the world’s first operating system to use UTF-8. They reported about their experienceat the USENIXWinter 1993 Technical Conference, San Diego, January 25-29, 1993,Proceedings, pp. 43-50. FSS/UTF was briefly also referred to as UTF-2and later renamed into UTF-8, and pushed through the standards processby the X/Open Joint Internationalization Group XOJIG.

Where do I find nice UTF-8 example files?

A few interesting UTF-8 example files for tests and demonstrationsare:

UTF-8Sampler web page by the Kermit project
Markus Kuhn’s example plain-textfiles, including among others the classic demo, decoder test, TeX repertoire, WGL4 repertoire, euro testpages, and Robert Brady’s IPA lyrics.
Unicode Transcriptions
Generatorfor Indic Unicode test files

What different encodings are there?

Both the UCS and Unicode standards are first of all large tablesthat assign to every character an integer number. If you use the term“UCS”, “ISO 10646”, or “Unicode”, this just refers to a mappingbetween characters and integers. This does not yet specify how tostore these integers as a sequence of bytes in memory.

ISO 10646-1 defines the UCS-2 and UCS-4 encodings. These aresequences of 2 bytes and 4 bytes per character, respectively. ISO10646 was from the beginning designed as a 31-bit character set (withpossible code positions ranging from U-00000000 to U-7FFFFFFF),however it took until 2001 for the first characters to be assignedbeyond the Basic Multilingual Plane (BMP), that is beyond the first2¹⁶ character positions (see ISO 10646-2 and Unicode 3.1).UCS-4 can represent all UCS and Unicode characters, UCS-2 canrepresent only those from the BMP (U+0000 to U+FFFF).

“Unicode” originally implied that the encoding was UCS-2 and itinitially didn’t make any provisions for characters outside the BMP(U+0000 to U+FFFF). When it became clear that more than 64k characterswould be needed for certain special applications (historic alphabetsand ideographs, mathematical and musical typesetting, etc.), Unicodewas turned into a sort of 21-bit character set with possible codepoints in the range U-00000000 to U-0010FFFF. The 2×1024 surrogatecharacters (U+D800 to U+DFFF) were introduced into the BMP to allow1024×1024 non-BMP characters to be represented as a sequence oftwo 16-bit surrogate characters. This way UTF-16 was born, which representsthe extended “21-bit” Unicode in a way backwards compatible withUCS-2. The term UTF-32 wasintroduced in Unicode to describe a 4-byte encoding of the extended“21-bit” Unicode. UTF-32 is the exact same thing as UCS-4, except thatby definition UTF-32 is never used to represent characters aboveU-0010FFFF, while UCS-4 can cover all 2³¹ code positions upto U-7FFFFFFF. The ISO 10646 working group has agreed to modify theirstandard to exclude code positions beyond U-0010FFFF, in order to turnthe new UCS-4 and UTF-32 into practically the same thing.

In addition to all that, UTF-8 was introducedto provide an ASCII backwards compatible multi-byte encoding. Thedefinitions of UTF-8 in UCS and Unicode differed originally slightly,because in UCS, up to 6-byte long UTF-8 sequences were possible torepresent characters up to U-7FFFFFFF, while in Unicode only up to4-byte long UTF-8 sequences are defined to represent characters up toU-0010FFFF. (The difference was in essence the same as between UCS-4and UTF-32.)

No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16,and UTF-32, though ISO 10646-1 says that Bigendian should be preferredunless otherwise agreed. It has become customary to append the letters“BE” (Bigendian, high-byte first) and “LE” (Littleendian, low-bytefirst) to the encoding names in order to explicitly specify a byteorder.

In order to allow the automatic detection of the byte order, it hasbecome customary on some platforms (notably Win32) to start everyUnicode file with the character U+FEFF (ZERO WIDTH NO-BREAK SPACE),also known as the Byte-Order Mark (BOM). Its byte-swapped equivalentU+FFFE is not a valid Unicode character, therefore it helps tounambiguously distinguish the Bigendian and Littleendian variants ofUTF-16 and UTF-32.

A full featured character encoding converter will have to providethe following 13 encoding variants of Unicode and UCS:

UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16,UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE

Where no byte order is explicitly specified, use the byte order ofthe CPU on which the conversion takes place and in an input streamswap the byte order whenever U+FFFE is encountered. The differencebetween outputting UCS-4 versus UTF-32 and UTF-16 versus UCS-2 lies inhandling out-of-range characters. The fallback mechanism fornon-representable characters has to be activated in UTF-32 (forcharacters > U-0010FFFF) or UCS-2 (for characters > U+FFFF) even whereUCS-4 or UTF-16 respectively would offer a representation.

Really just of historic interest are UTF-1, UTF-7, SCSU and adozen other less widely publicised UCS encoding proposals with variousproperties, none of which ever enjoyed any significant use. Their useshould be avoided.

A good encoding converter will also offer options for adding orremoving the BOM:

Unconditionally prefix the output text with U+FEFF.
Prefix the output text with U+FEFF unless it is already there.
Remove the first character if it is U+FEFF.

It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB0xBF) as a signature to mark the beginning of a UTF-8 file. Thispractice should definitely not be used on POSIXsystems for several reasons:

On POSIX systems, the locale (and not a magic file-type code)defines the encoding of plain text files. Mixing the two conceptswould add a lot of complexity and break existing functionality.
Adding a UTF-8 signature at the start of a file would interferewith many established conventions such as the kernel looking for “#!”at the beginning of a plaintext executable to locate the appropriateinterpreter.
Handling BOMs properly would add undesirable complexity even tosimple programs like cat or grep that mixcontents of several files into one.

In addition to the encoding alternatives, Unicode also specifiesvarious NormalizationForms, which provide reasonable subsets of Unicode, especially toremove encoding ambiguities caused by the presence of precomposed andcompatibility characters:

Normalization Form D (NFD): Split up (decompose)precomposed characters into combining sequences where possible,e.g. use U+0041 U+0308 (LATIN CAPITAL LETTER A, COMBINING DIAERESIS)instead of U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS). Also avoiddeprecated characters, e.g. use U+0041 U+030A (LATIN CAPITAL LETTER A,COMBINING RING ABOVE) instead of U+212B (ANGSTROM SIGN).
Normalization Form C (NFC): Use precomposed charactersinstead of combining sequences where possible, e.g. use U+00C4 (“Latincapital letter A with diaeresis”) instead of U+0041 U+0308 (“Latincapital letter A”, “combining diaeresis”). Also avoid deprecatedcharacters, e.g. use U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE)instead of U+212B (ANGSTROM SIGN).
NFC is the preferred form forLinux and WWW.
Normalization Form KD (NFKD): Like NFD, but avoid inaddition the use of compatibility characters, e.g. use “fi” instead ofU+FB01 (LATIN SMALL LIGATURE FI).
Normalization Form KC (NFKC): Like NFC, but avoid inaddition the use of compatibility characters, e.g. use “fi” instead ofU+FB01 (LATIN SMALL LIGATURE FI).

A full-featured character encoding converter should also offerconversion between normalization forms. Care should be used withmapping to NFKD or NFKC, as semantic information might be lost (forinstance U+00B2 (SUPERSCRIPT TWO) maps to 2) and extra mark-upinformation might have to be added to preserve it (e.g.,<SUP>2</SUP> in HTML).

What programming languages support Unicode?

More recent programming languages that were developed after around1993 already have special data types for Unicode/ISO 10646-1characters. This is the case with Ada95, Java, TCL, Perl, Python, C#and others.

ISO C 90 specifies mechanisms to handle multi-byte encoding andwide characters. These facilities were improved with Amendment 1 to ISO C90 in 1994 and even further improvements were made in the ISO C 99 standard. Thesefacilities were designed originally with various East-Asian encodingsin mind. They are on one side slightly more sophisticated than whatwould be necessary to handle UCS (handling of “shift sequences”), butalso lack support for more advanced aspects of UCS (combiningcharacters, etc.). UTF-8 is an example of what the ISO C standardcalls multi-byte encoding. The type wchar_t, which inmodern environments is usually a signed 32-bit integer, can be used tohold Unicode characters. (Since wchar_t has ended up beinga 16-bit type on some platforms and a 32-bit type on others,additionaltypes char16_t and char32_t have been proposedin ISO TR 19769 for future revisions of the C language, to giveapplication programmers more control over the representation of suchwide strings.)

Unfortunately, wchar_t was already widely used forvarious Asian 16-bit encodings throughout the 1990s. Therefore, theISO C 99 standard was bound by backwards compatibility. It could notbe changed to require wchar_t to be used with UCS, likeJava and Ada95 managed to do. However, the C compiler can at leastsignal to an application that wchar_t is guaranteed to holdUCS values in all locales. To do so, it defines the macro__STDC_ISO_10646__ to be an integer constant of the formyyyymmL. The year and month refer to the version ofISO/IEC 10646 and its amendments that have been implemented. Forexample, __STDC_ISO_10646__ == 200009L if theimplementation covers ISO/IEC 10646-1:2000.

How should Unicode be used under Linux?

Before UTF-8 emerged, Linux users all over the world had to usevarious different language-specific extensions of ASCII. Most popularwere ISO 8859-1 and ISO 8859-2 in Europe, ISO 8859-7 in Greece, KOI-8/ ISO 8859-5 / CP1251 in Russia, EUC and Shift-JIS in Japan, BIG5 in Taiwan, etc.This made the exchange of files difficult and application software hadto worry about various small differences between these encodings.Support for these encodings was usually incomplete, untested, andunsatisfactory, because the application developers rarely used allthese encodings themselves.

Because of these difficulties, major Linux distributors andapplication developers are now phasing out these older legacyencodings in favour of UTF-8. UTF-8 support has improved dramaticallyover the last few years and many people now use UTF-8 on a daily basisin

text files (source code, HTML files, email messages, etc.)
file names
standard input and standard output, pipes
environment variables
cut and paste selection buffers
telnet, modem, and serial port connections to terminal emulators

and in any other places where byte sequences used to be interpreted inASCII.

In UTF-8 mode, terminal emulators such as xterm or the Linuxconsole driver transform every keystroke into the corresponding UTF-8sequence and send it to the stdin of the foreground process.Similarly, any output of a process on stdout is sent to the terminalemulator, where it is processed with a UTF-8 decoder and thendisplayed using a 16-bit font.

Full Unicode functionality with all bells and whistles (e.g.high-quality typesetting of the Arabic and Indic scripts) can only beexpected from sophisticated multi-lingual word-processing packages.What Linux supports today on a broad base is far simpler and mainlyaimed at replacing the old 8- and 16-bit character sets. Linuxterminal emulators and command line tools usually only support a Level1 implementation of ISO 10646-1 (no combining characters), and onlyscripts such as Latin, Greek, Cyrillic, Armenian, Georgian, CJK, andmany scientific symbols are supported that need no further processingsupport. At this level, UCS support is very comparable to ISO 8859support and the only significant difference is that we have nowthousands of different characters available, that characters can berepresented by multibyte sequences, and that ideographicChinese/Japanese/Korean characters require two terminal characterpositions (double-width).

Level 2 support in the form of combining characters for selectedscripts (in particular Thai)and Hangul Jamo is in parts also available (i.e., some fonts, terminalemulators and editors support it via simple overstringing), butprecomposed characters should be preferred over combining charactersequences where available. More formally, the preferred way ofencoding text in Unicode under Linux should be Normalization FormC as defined in Unicode TechnicalReport #15.

One influential non-POSIX PC operating system vendor (whom we shallleave unnamed here) suggested that all Unicode files should start withthe character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this rolealso referred to as the “signature” or “byte-order mark (BOM)”, inorder to identify the encoding and byte-order used in a file.Linux/Unix does not use any BOMs and signatures. Theywould break far too many existing ASCII syntax conventions (such asscripts starting with #!). On POSIX systems, the selectedlocale identifies already the encoding expected in all input andoutput files of a process. (It has also been suggested to call UTF-8files without a signature “UTF-8N” files, but this non-standard termis usually not used in the POSIX world.)

Before you switch to UTF-8 under Linux, update your installation toa recent distribution with up-to-date UTF-8 support. This isparticular the case if you use an installation older than SuSE 9.1 orRed Hat 8.0. Before these, UTF-8 support was not yet mature enough tobe recommendable for daily use.

RedHat Linux 8.0 (September 2002) was the first distribution to takethe leap of switching to UTF-8 as the default encoding for mostlocales. The only exceptions were Chinese/Japanese/Korean locales, forwhich there were at the time still too many specialized toolsavailable that did not yet support UTF-8. This first mass deploymentof UTF-8 under Linux caused most remaining issues to be ironed outrather quickly during 2003. SuSELinux then switched its default locales to UTF-8 as well, as ofversion 9.1(May 2004). It was followed by Ubuntu Linux, the firstDebian-derivative that switched to UTF-8 as the system-wide defaultencoding. With the migration of the three most popular Linuxdistributions, UTF-8 related bugs have now been fixed in practicallyall well-maintained Linux tools. Other distributions can be expectedto follow soon.

How do I have to modify my software?

If you are a developer, there are several approaches to add UTF-8support. We can split them into two categories, which I will call softand hard conversion. In soft conversion, data is kept in its UTF-8form everywhere and only very few software changes are necessary. Inhard conversion, any UTF-8 data that the program reads will beconverted into wide-character arrays and will be handled as sucheverywhere inside the application. Strings will only be converted backto UTF-8 at output time. Internally, a character remains a fixed-sizememory object.

We can also distinguish hard-wired and locale-dependent approachesfor supporting UTF-8, depending on how much the string processingrelies on the standard library. C offers a number of string processingfunctions designed to handle arbitrary locale-specific multibyteencodings. An application programmer who relies entirely on these canremain unaware of the actual details of the UTF-8 encoding. Chancesare then that by merely changing the locale setting, several othermulti-byte encodings (such as EUC) will automatically be supported aswell. The other way a programmer can go is to hardcode knowledge aboutUTF-8 into the application. This may lead in some situations tosignificant performance improvements. It may be the best approach forapplications that will only be used with ASCII and UTF-8.

Even where support for every multi-byte encoding supported by libcis desired, it may well be worth to add extra code optimized forUTF-8. Thanks to UTF-8’s self-synchronizing features, it can beprocessed very efficiently. The locale-dependent libc string functionscan be two orders of magnitude slower than equivalent hardwired UTF-8procedures. A bad teaching example was GNU grep 2.5.1, which reliedentirely on locale-dependent libc functions such asmbrlen() for its generic multi-byte encoding support.This made it about 100× slower in multibyte mode than insingle-byte mode! Other applications with hardwired support for UTF-8regular expressions (e.g., Perl 5.8) do not suffer this dramaticslowdown.

Most applications can do very fine with just soft conversion. Thisis what makes the introduction of UTF-8 on Unix feasible at all. Toname two trivial examples, programs such as cat andecho do not have to be modified at all. They can remaincompletely ignorant as to whether their input and output is ISO 8859-2or UTF-8, because they handle just byte streams without processingthem. They only recognize ASCII characters and control codes such as'\n' which do not change in any way under UTF-8.Therefore the UTF-8 encoding and decoding is done for theseapplications completely in the terminal emulator.

A small modification will be necessary for any program thatdetermines the number of characters in a string by counting the bytes.With UTF-8, as with other multi-byte encodings, where the length of atext string is of concern, programmers have to distinguish clearlybetween

the number of bytes,
the number of characters,
the display width (e.g., the number of cursor position cells in aVT100 terminal emulator)

of a string.

C’s strlen(s) function always counts the number ofbytes. This is the number relevant, for example, for memorymanagement (determination of string buffer sizes). Where the output ofstrlen is used for such purposes, no change will be necessary.

The number of characters can be counted in C in a portableway using mbstowcs(NULL,s,0). This works for UTF-8 likefor any other supported encoding, as long as the appropriate localehas been selected. A hard-wired technique to count the number ofcharacters in a UTF-8 string is to count all bytes except those in therange 0x80 – 0xBF, because these are just continuation bytes and notcharacters of their own. However, the need to count characters arisessurprisingly rarely in applications.

In applications written for ASCII or ISO 8859, a far more commonuse of strlen is to predict the number ofcolumns that the cursor of the terminal will advance if a stringis printed. With UTF-8, neither a byte nor a character count willpredict the display width, because ideographic characters (Chinese,Japanese, Korean) will occupy two column positions, whereas controland combining characters occupy none. To determine the width of astring on the terminal screen, it is necessary to decode the UTF-8sequence and then use the wcwidth function to test thedisplay width of each character, or wcswidth to measurethe entire string.

For instance, the ls program had to be modified,because without knowing the column widths of filenames, it cannotformat the table layout in which it presents directories to the user.Similarly, all programs that assume somehow that the output ispresented in a fixed-width font and format it accordingly have tolearn how to count columns in UTF-8 text. Editor functions such asdeleting a single character have to be slightly modified to delete allbytes that might belong to one character. Affected were for instanceeditors (vi, emacs, readline,etc.) as well as programs that use the ncurses library.

Any Unix-style kernel can do fine with soft conversion and needsonly very minor modifications to fully support UTF-8. Most kernelfunctions that handle strings (e.g. file names, environment variables,etc.) are not affected at all by the encoding. Modifications werenecessary in Linux the following places:

The console display and keyboard driver (another VT100 emulator)have to encode and decode UTF-8 and should support at least somesubset of the Unicode character set. This had already been availablein Linux as early as kernel 1.2 (send ESC %G to the console toactivate UTF-8 mode).
External file system drivers such as VFAT and WinNT have toconvert file name character encodings. UTF-8 is one of the availableconversion options, and the mount command has to tell thekernel driver that user processes shall see UTF-8 file names. SinceVFAT and WinNT use already Unicode anyway, UTF-8 is the only availableencoding that guarantees a lossless conversion here.
The tty driver of any POSIX system supports a “cooked” mode, inwhich some primitive line editing functionality is available. In orderto allow the character-erase function (which is activated when youpress backspace) to work properly with UTF-8, someone needs to tell itnot count continuation bytes in the range 0x80-0xBF as characters, butto delete them as part of a UTF-8 multi-byte sequence. Since thekernel is ignorant of the libc locale mechanics, another mechanism isneeded to tell the tty driver about UTF-8 being used. Linux kernelversions 2.6 or newer support a bit IUTF8 in the c_iflag membervariable of struct termios. If it is set, the “cooked” mode lineeditor will treat UTF-8 multi-byte sequences correctly. This mode canbe set from the command shell with “stty iutf8”. Xterm and friendsshould set this bit automatically when called in a UTF-8 locale.

C support for Unicode and UTF-8

Starting with GNU glibc 2.2, the type wchar_t isofficially intended to be used only for 32-bit ISO 10646 values,independent of the currently used locale. This is signalled toapplications by the definition of the __STDC_ISO_10646__macro as required by ISO C99. The ISO C multi-byte conversionfunctions (mbsrtowcs(), wcsrtombs(), etc.)are fully implemented in glibc 2.2 or higher and can be used toconvert between wchar_t and any locale-dependentmultibyte encoding, including UTF-8, ISO 8859-1, etc.

For example, you can write

  #include <stdio.h>
  #include <locale.h>

  int main()
  {
    if (!setlocale(LC_CTYPE, "")) {
      fprintf(stderr, "Can't set the specified locale! "
              "Check LANG, LC_CTYPE, LC_ALL.\n");
      return 1;
    }
    printf("%ls\n", L"Schöne Grüße");
    return 0;
  }

Call this program with the locale setting LANG=de_DEand the output will be in ISO 8859-1. Call it withLANG=de_DE.UTF-8 and the output will be in UTF-8. The%ls format specifier in printf callswcsrtombs in order to convert the wide character argumentstring into the locale-dependent multi-byte encoding.

Many of C’s string functions are locale-independent and they justlook at zero-terminated byte sequences:

  strcpy strncpy strcat strncat strcmp strncmp strdup strchr strrchr
  strcspn strspn strpbrk strstr strtok

Some of these (e.g. strcpy) can equally be used for single-byte(ISO 8859-1) and multi-byte (UTF-8) encoded character sets, as theyneed no notion of how many byte long a character is, while others(e.g., strchr) depend on one character being encoded in a single charvalue and are of less use for UTF-8 (strchr still works fine if youjust search for an ASCII character in a UTF-8 string).

Other C functions are locale dependent and work in UTF-8 localesjust as well:

  strcoll strxfrm

How should the UTF-8 mode be activated?

If your application is soft converted and does not use the standardlocale-dependent C multibyte routines (mbsrtowcs(),wcsrtombs(), etc.) to convert everything intowchar_t for processing, then it might have to find out insome way, whether it is supposed to assume that the text data ithandles is in some 8-bit encoding (like ISO 8859-1, where 1 byte = 1character) or UTF-8. Once everyone uses only UTF-8, you can just makeit the default, but until then both the classical 8-bit sets and UTF-8may still have to be supported.

The first wave of applications with UTF-8 support used a whole lotof different command line switches to activate their respective UTF-8modes, for instance the famous xterm -u8. That turned outto be a very bad idea. Having to remember a special command lineoption or other configuration mechanism for every applicationis very tedious, which is why command line options arenot the proper way of activating a UTF-8 mode.

The proper way to activate UTF-8 is the POSIX locale mechanism. Alocale is a configuration setting that contains information aboutculture-specific conventions of software behaviour, including thecharacter encoding, the date/time notation, alphabetic sorting rules,the measurement system and common office paper size, etc. The names oflocales usually consist of ISO639-1 language and ISO3166-1 alpha-2 country codes, sometimes with additional encoding names orother qualifiers.

You can get a list of all locales installed on your system (usuallyin /usr/lib/locale/) with the command locale-a. Set the environment variable LANG to the nameof your preferred locale. When a C program executes thesetlocale(LC_CTYPE, "") function, the library will testthe environment variables LC_ALL, LC_CTYPE,and LANG in that order, and the first one of these thathas a value will determine which locale data is loaded for theLC_CTYPE category (which controls the multibyteconversion functions). The locale data is split up into separatecategories. For example, LC_CTYPE defines the characterencoding and LC_COLLATE defines the string sorting order.The LANG environment variable is used to set the defaultlocale for all categories, but the LC_* variables can beused to override individual categories. Do not worry too much about thecountry identifiers in the locales. Locales such as en_GB(English in Great Britain) and en_AU (English inAustralia) differ usually only in the LC_MONETARYcategory (name of currency, rules for printing monetary amounts),which practically no Linux application ever uses.LC_CTYPE=en_GB and LC_CTYPE=en_AU haveexactly the same effect.

Effect of locale on sorting order: If you had not set a locale previously, you may quickly notice that setting one (e.g., LANG=en_US.UTF-8 or LANG=en_GB.UTF-8), also changes the sorting order used by some tools: the “ls” command now sorts filenames with uppercase and lowercase first character next to each other (like in a dictionary), and file globbing no longer uses the ASCII order either (e.g. “echo [a-z]*” also lists filenames starting uppercase). To get the old ASCII sorting order back that you are used to, simply set in addition also LC_COLLATE=POSIX (or equivalently LC_COLLATE=C), and you will quickly feel at home again.

You can query the name of the character encoding in your currentlocale with the command locale charmap. This should sayUTF-8 if you successfully picked a UTF-8 locale in theLC_CTYPE category. The command locale -m provides a listwith the names of all installed character encodings.

If you use exclusively C library multibyte functions to do all theconversion between the external character encoding and thewchar_t encoding that you use internally, then the Clibrary will take care of using the right encoding according toLC_CTYPE for you and your program does not even have toknow explicitly what the current multibyte encoding is.

However, if you prefer not to do everything using the libcmulti-byte functions (e.g., because you think this would require toomany changes in your software or is not efficient enough), then yourapplication has to find out for itself when to activate the UTF-8mode. To do this, on any X/Open compliant systems, where <langinfo.h> is available, you can use a linesuch as

  utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);

in order to detect whether the current locale uses the UTF-8encoding. You have of course to add a setlocale(LC_CTYPE,"") at the beginning of your application to set the localeaccording to the environment variables first. The standard functioncall nl_langinfo(CODESET) is also what localecharmap calls to find the name of the encoding specified by thecurrent locale for you. It is available on pretty much every modernUnix now. FreeBSD added nl_langinfo(CODESET) support withversion 4.6 (2002-06). If you need an autoconf test for theavailability of nl_langinfo(CODESET), here is the oneBruno Haible suggested:

======================== m4/codeset.m4 ================================
#serial AM1

dnl From Bruno Haible.

AC_DEFUN([AM_LANGINFO_CODESET],
[
  AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
    [AC_TRY_LINK([#include <langinfo.h>],
      [char* cs = nl_langinfo(CODESET);],
      am_cv_langinfo_codeset=yes,
      am_cv_langinfo_codeset=no)
    ])
  if test $am_cv_langinfo_codeset = yes; then
    AC_DEFINE(HAVE_LANGINFO_CODESET, 1,
      [Define if you have <langinfo.h> and nl_langinfo(CODESET).])
  fi
])
=======================================================================

[You could also try to query the locale environment variablesyourself without using setlocale(). In the sequenceLC_ALL, LC_CTYPE, LANG, lookfor the first of these environment variables that has a value. Makethe UTF-8 mode the default (still overridable by command lineswitches) when this value contains the substring UTF-8,as this indicates reasonably reliably that the C library has beenasked to use a UTF-8 locale. An example code fragment that does thisis

  char *s;
  int utf8_mode = 0;

  if (((s = getenv("LC_ALL"))   && *s) ||
      ((s = getenv("LC_CTYPE")) && *s) ||
      ((s = getenv("LANG"))     && *s)) {
    if (strstr(s, "UTF-8"))
      utf8_mode = 1;
  }

This relies of course on all UTF-8 locales having the name of theencoding in their name, which is not always the case, therefore thenl_langinfo() query is clearly the better method. If youare really concerned that calling nl_langinfo() might notbe portable enough, there is also Markus Kuhn’s portable public domainnl_langinfo(CODESET)emulator for systems that do not have the real thing (and anotherone from Bruno Haible), and you can use the norm_charmap() function tostandardize the output of the nl_langinfo(CODESET) ondifferent platforms.]

How do I get a UTF-8 version of xterm?

The xtermversion that comes with XFree864.0 or higher (maintained by ThomasDickey) includes UTF-8 support. To activate it, start xterm in aUTF-8 locale and use a font with iso10646-1 encoding, forinstance with

  LC_CTYPE=en_GB.UTF-8 xterm \
    -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'

and then cat some example file, such as UTF-8-demo.txtin the newly started xterm and enjoy what you see.

If you are not using XFree86 4.0 or newer, then you canalternatively download the latest xtermdevelopment version separately and compile it yourself with“./configure --enable-wide-chars ; make” or alternativelywith “xmkmf; make Makefiles; make; make install; makeinstall.man”.

If you do not have UTF-8 locale support available, use command lineoption -u8 when you invoke xterm to switch input andoutput to UTF-8.

How much of Unicode does xterm support?

Xterm in XFree86 4.0.1 only supported Level 1 (no combiningcharacters) of ISO 10646-1 with fixed character width andleft-to-right writing direction. In other words, the terminalsemantics were basically the same as for ISO 8859-1, except that itcan now decode UTF-8 and can access 16-bit characters.

With XFree86 4.0.3, two important functions were added:

automatic switching to a double-width font for CJK ideographs
simple overstriking combining characters

If the selected normal font is X × Y pixelslarge, then xterm will attempt to load in addition a2X × Y pixels large font (same XLFD, exceptfor a doubled value of the AVERAGE_WIDTH property). Itwill use this font to represent all Unicode characters that have beenassigned the East Asian Wide (W) or East Asian FullWidth(F) property in Unicode TechnicalReport #11.

The following fonts coming with XFree86 4.x are suitable fordisplay of Japanese and Korean Unicode text with terminal emulatorsand editors:

  6x13    -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  6x13B   -Misc-Fixed-Bold-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  6x13O   -Misc-Fixed-Medium-O-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  12x13ja -Misc-Fixed-Medium-R-Normal-ja-13-120-75-75-C-120-ISO10646-1

  9x18    -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
  9x18B   -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1
  18x18ja -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1
  18x18ko -Misc-Fixed-Medium-R-Normal-ko-18-120-100-100-C-180-ISO10646-1

Some simple support for nonspacing or enclosing combiningcharacters (i.e., those with general category code Mn or Me in the Unicodedatabase) is now also available, which is implemented by justoverstriking (logical OR-ing) a base-character glyph with up to twocombining-character glyphs. This produces acceptable results foraccents below the base line and accents on top of small characters. Italso works well, for example, for Thai and Korean Hangul Conjoining Jamofonts that were specifically designed for use with overstriking.However, the results might not be fully satisfactory for combiningaccents on top of tall characters in some fonts, especially with thefonts of the “fixed” family. Therefore precomposed characters willcontinue to be preferable where available.

The fonts below that come with XFree86 4.x are suitable for displayof Latin etc. combining characters (extra head-space). Other fontswill only look nice with combining accents on small x-high characters.

  6x12    -Misc-Fixed-Medium-R-Semicondensed--12-110-75-75-C-60-ISO10646-1
  9x18    -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
  9x18B   -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1

The following fonts coming with XFree86 4.x are suitable fordisplay of Thai combining characters:

  6x13    -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  9x15    -Misc-Fixed-Medium-R-Normal--15-140-75-75-C-90-ISO10646-1
  9x15B   -Misc-Fixed-Bold-R-Normal--15-140-75-75-C-90-ISO10646-1
  10x20   -Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO10646-1
  9x18    -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1

The fonts18x18ko,18x18Bko,16x16Bko, and16x16koare suitable for displaying Hangul Jamo (using the same simpleoverstriking character mechanism used for Thai).

A note for programmers of text mode applications:

With support for CJK ideographs and combining characters, theoutput of xterm behaves a little bit more like with a proportionalfont, because a Latin/Greek/Cyrillic/etc. character requires onecolumn position, a CJK ideograph two, and a combining character zero.

The Open Group’s Single UNIXSpecification specifies the two C functions wcwidth() and wcswidth() that allow an application totest how many column positions a character will occupy:

  #include <wchar.h>
  int wcwidth(wchar_t wc);
  int wcswidth(const wchar_t *pwcs, size_t n);

Markus Kuhn’s free wcwidth()implementation can be used by applications on platforms where the Clibrary does not yet provide a suitable function.

Xterm will for the foreseeable future probably not support thefollowing functionality, which you might expect from a moresophisticated full Unicode rendering engine:

bidirectional output of Hebrew and Arabic characters
substitution ofArabicpresentation forms
substitution ofIndic/Syriacligatures
arbitrary stacks of combining characters

Hebrew and Arabic users will therefore have to use applicationprograms that reverse and left-pad Hebrew and Arabic strings beforesending them to the terminal. In other words, the bidirectionalprocessing has to be done by the application and not by xterm. Thesituation for Hebrew and Arabic improves over ISO 8859 at least in theform of the availability of precomposed glyphs and presentation forms.It is far from clear at the moment, whether bidirectional supportshould really go into xterm and how precisely this should work. BothISO6429 = ECMA-48 and the Unicode bidialgorithm provide alternative starting points. See also ECMATechnical Report TR/53.

If you plan to support bidirectional text output in yourapplication, have a look at either Dov Grobgeld’s FriBidi or Mark Leisher’s Pretty Good BidiAlgorithm, two free implementations of the Unicode bidi algorithm.

Xterm currently does not support the Arabic, Syriac, or Indic textformatting algorithms, although Robert Brady has published some experimental patchestowards bidi support. It is still unclear whether it is feasible orpreferable to do this in a VT100 emulator at all. Applications canapply the Arabic and Hangul formatting algorithms themselves easily,because xterm allows them to output the necessary presentation forms.For Hangul, Unicode contains the presentation forms needed for modern(post-1933) Korean orthography. For Indic scripts, the X fontmechanism at the moment does not even support the encoding of thenecessary ligature variants, so there is little xterm could offeranyway. Applications requiring Indic or Syriac output should betteruse a proper Unicode X11 rendering library such as Pango instead of a VT100 emulator likexterm.

Where do I find ISO 10646-1 X11 fonts?

Quite a number of Unicode fonts have become available for X11 overthe past few months, and the list is growing quickly:

Markus Kuhn together with a number of other volunteers hasextended the old -misc-fixed-*-iso8859-1 fonts that comewith X11 towards a repertoire that covers all European characters(Latin, Greek, Cyrillic, intl. phonetic alphabet, mathematical andtechnical symbols, in some fonts even Armenian, Georgian, Katakana,Thai, and more). For more information see the Unicode fonts and tools for X11 page. Thesefonts are now also distributed with XFree86 4.0.1 or higher.
Markus has also prepared ISO10646-1 versions of all the Adobe and B&H BDF fonts in the X11R6.4distribution. These fonts already contained the full PostScriptfont repertoire (around 30 additional characters, mostly those usedalso by CP1252 MS-Windows, e.g. smart quotes, dashes, etc.), whichwere however not available under the ISO 8859-1 encoding. They are nowall accessible in the ISO 10646-1 version, along with many additionalprecomposed characters covering ISO 8859-1,2,3,4,9,10,13,14,15. Thesefonts are now also distributed with XFree86 4.1 or higher.
XFree86 4.0 comes with an integratedTrueType font engine that can make available any Apple/Microsoftfont to your X application in the ISO 10646-1 encoding.
Some future XFree86 release might also remove most old BDF fontsfrom the distribution and replace them with ISO 10646-1 encodedversions. The X server will be extended with an automatic encodingconverter that creates other font encodings such as ISO 8859-* fromthe ISO 10646-1 font file on-the-fly when such a font is requested byold 8-bit software. Modern software should preferably use the ISO10646-1 font encoding directly.
ClearlyU(cu12) is a 12 point, 100 dpi proportional ISO 10646-1 BDF fontfor X11 with over 3700 characters by Mark Leisher (exampleimages).
The Electronic FontOpen Laboratory in Japan is also working on a family of Unicodebitmap fonts.
Dmitry Yu. Bolkhovityanov created a UnicodeVGA font in BDF for use by text mode IBM PC emulators etc.
Roman Czyborra’s GNUUnicode font project works on collecting a complete and free8×16/16×16 pixel Unicode font. It currently covers over 34000characters.
etl-unicode isan ISO 10646-1 BDF font prepared by Primoz Peterlin.
Primoz Peterlinhas also started the freefontproject, which extends to better UCS coverage some of the 35 corePostScript outline fonts that URW++ donated to the ghostscriptproject, with the help of pfaedit.
George Williams has created a Type1Unicode font family, which is also available in BDF. He alsodeveloped the PfaEditPostScript and bitmap font editor.
EversonMono is ashareware monospaced font with over 3000 European glyphs, alsoavailable from the DKUUG server.
Birger Langkjer has prepared a Unicode VGAConsole Font for Linux.
Alan Wood has a list of Microsoftfonts that support various Unicode ranges.
CODE2000is a Unicode font by James Kass.

Unicode X11 font names end with -ISO10646-1. This isnow the officially registered value for the X Logical Font Descriptor (XLFD) fieldsCHARSET_REGISTRY and CHARSET_ENCODING forall Unicode and ISO 10646-1 16-bit fonts. The*-ISO10646-1 fonts contain some unspecified subset of theentire Unicode character set, and users have to make sure thatwhatever font they select covers the subset of characters needed bythem.

The *-ISO10646-1 fonts usually also specify aDEFAULT_CHAR value that points to a special non-Unicodeglyph for representing any character that is not available in the font(usually a dashed box, the size of an H, located at 0x00). Thisensures that users at least see clearly that there is an unsupportedcharacter. The smaller fixed-width fonts such as 6x13 etc. for xtermwill never be able to cover all of Unicode, because many scripts suchas Kanji can only be represented in considerably larger pixel sizesthan those widely used by European users. Typical Unicode fonts forEuropean usage will contain only subsets of between 1000 and 3000characters, such as the CEN MES-3repertoire.

You might notice that in the *-ISO10646-1 fonts the shapes of the ASCII quotation marks hasslightly changed to bring them in line with the standards and practiceon other platforms.

What are the issues related to UTF-8 terminal emulators?

VT100 terminal emulators accept ISO2022 (=ECMA-35)ESC sequences in order to switch between different character sets.

UTF-8 is in the sense of ISO 2022 an “other coding system” (seesection 15.4 of ECMA 35). UTF-8 is outside the ISO 2022SS2/SS3/G0/G1/G2/G3 world, so if you switch from ISO 2022 to UTF-8,all SS2/SS3/G0/G1/G2/G3 states become meaningless until you leaveUTF-8 and switch back to ISO 2022. UTF-8 is a stateless encoding, i.e.a self-terminating short byte sequence determines completely whichcharacter is meant, independent of any switching state. G0 and G1 inISO 10646-1 are those of ISO 8859-1, and G2/G3 do not exist in ISO10646, because every character has a fixed position and no switchingtakes place. With UTF-8, it is not possible that your terminal remainsswitched to strange graphics-character mode after you accidentallydumped a binary file to it. This makes a terminal in UTF-8 mode muchmore robust than with ISO 2022 and it is therefore useful to have away of locking a terminal into UTF-8 mode such that it cannotaccidentally go back to the ISO 2022 world.

The ISO 2022 standard specifies a range of ESC % sequences forleaving the ISO 2022 world (designation of other coding system, DOCS),and a number of such sequences have been registered for UTF-8 in section 2.8 of the ISO 2375 InternationalRegister of Coded Character Sets:

ESC %G activates UTF-8 with an unspecifiedimplementation level from ISO 2022 in a way that allows to go backto ISO 2022 again.
ESC %@ goes back from UTF-8 to ISO 2022 in caseUTF-8 had been entered via ESC %G.
ESC %/G switches to UTF-8 Level 1 with no return.
ESC %/H switches to UTF-8 Level 2 with no return.
ESC %/I switches to UTF-8 Level 3 with no return.

While a terminal emulator is in UTF-8 mode, any ISO 2022 escapesequences such as for switching G2/G3 etc. are ignored. The only ISO2022 sequence on which a terminal emulator might act in UTF-8 mode isESC %@ for returning from UTF-8 back to the ISO 2022scheme.

UTF-8 still allows you to use C1 control characters such as CSI,even though UTF-8 also uses bytes in the range 0x80-0x9F. It isimportant to understand that a terminal emulator in UTF-8 mode mustapply the UTF-8 decoder to the incoming byte streambefore interpreting any control characters. C1characters are UTF-8 decoded just like any other character aboveU+007F.

Many text-mode applications available today expect to speak to theterminal using a legacy encoding or to use ISO 2022 sequences forswitching terminal fonts. In order to use such applications within aUTF-8 terminal emulator, it is possible to use a conversion layer thatwill translate between ISO 2022 and UTF-8 on the fly. Examples forsuch utilities are Juliusz Chroboczek’s luit and pluto. Ifall you need is ISO 8859 support in a UTF-8 terminal, you can also usescreen(version 4.0 or newer) by Michael Schröder and Jürgen Weigert. Asimplementation of ISO 2022 is a complex and error-prone task, betteravoid implementing ISO 2022 yourself. Implement only UTF-8 and pointusers who need ISO 2022 at luit (or screen).

What UTF-8 enabled applications are available?

Warning: As of mid-2003, this section is becoming increasinglyincomplete. UTF-8 support is now a pretty standard feature for mostwell-maintained packages. This list will soon have to be convertedinto a list of the most popular programs that still have problems withUTF-8.

Terminal emulation and communication

xterm asshipped with XFree86 4.0 or higher works correctly in UTF-8 locales ifyou use an *-iso10646-1 font. Just try it with for exampleLC_CTYPE=en_GB.UTF-8 xterm -fn'-Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1'.
C-Kermithas supported UTF-8 as the transfer, terminal, and file character setsince version 7.0.
mlterm is amulti-lingual terminal emulator that supports UTF-8 among many otherencodings, combining characters, XIM.
Edmund Grimley Evans extendedthe BOGL Linuxframebuffer graphics library with UCS font support and built a simpleUTF-8 console terminal emulator called bterm with it.
Utermpurports to be a UTF-8 terminal emulator for the Linux framebufferconsole.
Pluto,Juliusz Chroboczek’s paranormal Unicode converter, can guess whichencoding is being used in a terminal session, and converts iton-the-fly to UTF-8. (Wonderful for reading IRC channels with mixedISO 8859 and UTF-8 messages!)

Editing and word processing

Vim (the popular clone of theclassic vi editor) supports UTF-8 with wide characters and up to twocombining characters starting from version 6.0.
Emacs has quitegood basic UTF-8 support starting from version 21.3. Emacs 23 changedthe internal encoding to UTF-8.
Yudit is Gaspar Sinai’sfree X11 Unicode editor.
Mined 2000 by Thomas Wolff is a verynice UTF-8 capable text editor, ahead of the competition with featuressuch as not only support of double-width and combining characters, butalso bidirectional scripts, keyboard mappings for a wide range ofscripts, script-dependent highlighting, etc.
JOE is apopular WordStar-like editor that supports UTF-8 as of version 3.0.
Cooledit offersUTF-8 and UCS support starting with version 3.15.0.
QEmacs is asmall editor for use on UTF-8 terminals.
less is apopular plain-text file viewer that had UTF-8 support since version348. (Version 358 had a bugrelated to the handling of UTF-8 characters and backspaceunderlining/boldification as used by nroff/man, for which a patchis available, version 381 still has problems with UTF-8 characters inthe search-mode input line.)
GNU bash and readline providesingle-line editors and they introduced support for multi-bytecharacter encodings, such as UTF-8, with versions bash 2.05b andreadline 4.3.
gucharmap and UMap are tools to select andpaste any Unicode character into your application.
LaTeX has supportedUTF-8 in its base package since March 2004(still experimental). You can simplywrite \usepackage[utf8]{inputenc} and then encode atleast some of TeX’s standard character repertoire in UTF-8 in yourLaTeX sources. (Before that, UTF-8 was already available in the formof Dominique Unruh’spackage, which covered far more characters and was rather resourcehungry.) XeTeX is a reengineeredversion of TeX that reads and understands (UTF-8 encoded) Unicode text.
Abiword.

Programming

Perl offers useable Unicode andUTF-8 support starting with version 5.8.1. Strings are now tagged inmemory as either byte strings or character strings, and the latter arestored internally as UTF-8 but appear to the programmer just assequences of UCS characters. There is now also comprehensive supportfor encoding conversion and normalization included. Read “manperluniintro” for details.
Python got Unicode supportadded in version 1.6.
Tcl/Tk started using Unicode as itsbase character set with version 8.1. ISO10646-1 fonts aresupported in Tk from version 8.3.3 or newer.
CLISP can work with allmulti-byte encodings (including UTF-8) and with the functionschar-width and string-width there is an APIcomparable to wcwidth() and wcswidth()available.

Mail and Internet

The Mutt email client hasworked since version 1.3.24 in UTF-8 locales. When compiled and linkedwith ncursesw (ncurses builtwith wide-character support), Mutt 1.3.x works decently in UTF-8locales under UTF-8 terminal emulators such as xterm.
Exmh is aGUI frontend for the MH ornmh mail system and partiallysupports Unicode starting with version 2.1.1 if Tcl/Tk 8.3.3 or newer isused. To enable displaying UTF-8 email, make sure you have the *-iso10646-1 fonts installed and add to .Xdefaultsthe line “exmh.mimeUCharsets: utf-8”. Much of the Exmh-internal MIMEcharset-set mechanics however still dates from the days before Tcl8.1, therefore ignores Tcl/Tk’s more recent Unicode support, and couldnow be simplified and improved significantly. In particular, writingor replying to UTF-8 mail is still broken.
Most modern web browsers such as Mozilla Firefox have pretty decentUTF-8 support today.
The popular Pineemail client lacks UTF-8 support and is no longer maintained. Switchto its successor Alpine, a completereimplementation by the same authors, which has excellentUTF-8 support.

Printing

Cedilla isJuliusz Chroboczek’s best-effort Unicode to PostScript text printer.
Markus Kuhn’s hpp is a verysimple plain text formatter for HP PCL printers that supports the repertoireof characters covered by the standard PCL fixed-width fonts in all thecharacter encodings for which your C library has a locale mapping.Markus Kuhn’s utf2ps is anearly quick-and-dirty proof-of-concept UTF-8 formatter for PostScript,that was only written to demonstrate which characterrepertoire can easily be printed using only the standardPostScript fonts and was never intended to be actually used.
Some post-2004 HP printershave UTF-8PCL firmware support(more).The relevant PCL5 commands appear to be “␛&t1008P” (encoding method: UTF-8) and “␛(18N” (Unicode code page). RecentPCL printers from other manufacturers (e.g., Kyocera) also advertiseUTF-8 support (for SAP compatibility).
The Common UNIX Printing Systemcomes with a texttops tool that converts plaintext UTF-8 toPostScript.
txtbdf2psby Serge Winitzki is a Perl script to print UTF-8 plaintext toPostScript using BDF pixel fonts.

Misc

The PostgreSQL DBMS hadsupport for UTF-8 since version 7.1, both as the frontend encoding,and as the backend storage encoding. Data conversion between frontendand backend encodings is performed automatically.
FIGlet is a tool to outputbanner text in large letters using monospaced characters as blockgraphics elements and added UTF-8 support in version 2.2.
Charlintis a character normalization tool for the W3C character model.
The first available UTF-8 tools for Unix came out of the Plan 9 project, BellLab’s Unix successor and the world’s first operating system usingUTF-8. Plan 9’s Sameditor and 9termterminal emulator have also been ported to Unix. Wily started out as a Uniximplementation of the Plan 9 Acme editor and is a mouse-oriented,text-based working environment for programmers. More recently the Plan 9 from User Space (akaplan9port) package has emerged, a port of many Plan 9 programs fromtheir native Plan 9 environment to Unix-like operating systems.
The Gnumericspreadsheet is fully Unicode based from version 1.1.
The Heirloom Toolchestis a collection of standard Unix utilities derived from original Unixmaterial released asopen source by Caldera with support for multibyte character sets,especially UTF-8.
convmv isa tool to convert the filenames in entire directory trees from alegacy encoding to UTF-8.

What patches to improve UTF-8 support are available?

Many of these already have been included in the respective maindistribution.

The Advanced Utility Development subgroup of the OpenI18N(formerly Li18nux) project have prepared various internationalization patches for tools such as cut, fold, glibc,join, sed, uniq, xterm, etc. that might improve UTF-8 support.
A collection of UTF-8 patches for various tools as well as a UTF-8support status list is in Bruno Haible’s Unicode-HOWTO.
Bruno Haible has also prepared various patchesfor stty, the Linux kernel tty, etc.
The multilingualizationpatch (w3m-m17n) for the text-mode web browser w3m allows you to viewdocuments in all the common encodings on a UTF-8 terminal like xterm(also switch option “Use alternate expression with ASCII for entity”to OFF after pressing “o”). Another multilingual version(w3mmee) is available as well (have not tried that yet).

Are there free libraries for dealing with Unicode available?

Ulrich Drepper’s GNU Clibrary glibc has featured since version 2.2 full multi-bytelocale support for UTF-8, an ISO ISO 14651 sorting order algorithm,and it can recode into many other encodings. All current Linuxdistributions come with glibc 2.2 or newer, so you definitely shouldupgrade now if you are still using an earlier Linux C library.
The International Componentsfor Unicode (ICU) (formerly IBM Classes for Unicode) have becomewhat is probably the most powerful cross-platform standard library formore advanced Unicode character processing functions.
X.Net’s xIUA is apackage designed to retrofit existing code for ICU support byproviding locale management so that users do not have to modifyinternal calling interfaces to pass locale parameters. It uses morefamiliar APIs, for example to collate you use xiua_strcoll, and isthread safe.
Mark Leisher’s UCDataUnicode character property and bidi library as well as hiswchar_t support test code.
Bruno Haible’s libiconvcharacter-set conversion library provides an iconv() implementation, for use on systems which do not have one,or whose implementation cannot convert from/to Unicode.
It also contains the libcharset character-encoding query librarythat allows applications to determine in a highly portable way thecharacter encoding of the current locale, avoiding the portabilityconcerns of using nl_langinfo(CODESET) directly.
Bruno Haible’slibutf8 provides various functions for handling UTF-8 strings,especially for platforms that do not yet offer proper UTF-8 locales.
Tom Tromey’s libunicode library is part of the Gnome Desktop project, but canbe built independently of Gnome. It contains various character classand conversion functions. (CVS)
FriBidi is DovGrobgeld’s free implementation of the Unicode bidi algorithm.
Markus Kuhn’s free wcwidth()implementation can be used by applications on platforms where theC library does not yet provide an equivalent function to find, howmany column positions a character or string will occupy on a UTF-8terminal emulator screen.
Markus Kuhn’s transtab is atransliteration table for applications that have to make a best-effortconversion from Unicode to ASCII or some 8-bit character set. Itcontains a comprehensive list of substitution strings for Unicodecharacters, comparable to the fallback notations that people usecommonly in email and on typewriters to represent unavailablecharacters. The table comes in ISO/IEC TR 14652 format, to allowsimple inclusion into POSIX locale definition files.

What is the status of Unicode support for various X widget libraries?

The Pango – Unicode and ComplexText Processing project added full-featured Unicode support to GTK+.
Qt supported the use of*-ISO10646-1 fonts since version 2.0.
A UTF-8 extension for theFast Light Tool Kit was prepared byJean-Marc Lienher, based on his Xutf8 Unicode display library.

What packages with UTF-8 support are currently under development?

Native Unicode support is planned for Emacs 23. If you areinterested in contributing/testing, please join theemacs-devel @gnu.org mailing list.
The Linux ConsoleProject works on a complete revision of the VT100 emulator builtinto the Linux kernel, which will improve the simplistic UTF-8 supportalready there.

How does UTF-8 support work under Solaris?

Starting with Solaris 2.8, UTF-8 is at least partially supported.To use it, just set one of the UTF-8 locales, for instance by typing

 setenv LANG en_US.UTF-8

in a C shell.

Now the dtterm terminal emulator can be used to inputand output UTF-8 text and the mp print filter will printUTF-8 files on PostScript printers. The en_US.UTF-8locale is at the moment supported by Motif and CDE desktopapplications and libraries, but not by OpenWindows, XView, andOPENLOOK DeskSet applications and libraries.

For more information, read Sun’s Overview of en_US.UTF-8 Locale Support web page.

Can I use UTF-8 on the Web?

Yes. There are two ways in which a HTTP server can indicate to aclient that a document is encoded in UTF-8:

Make sure that the HTTP header of a document contains theline
```
  Content-Type: text/html; charset=utf-8
```
if the file is HTML, or the line
```
  Content-Type: text/plain; charset=utf-8
```
if the file is plain text. How this can be achieved depends on yourweb server. If you use Apacheand you have a subdirecory in which all *.html or *.txt files areencoded in UTF-8, then create there a file .htaccess and add to it the two lines
```
  AddType text/html;charset=UTF-8 html
  AddType text/plain;charset=UTF-8 txt
```
A webmaster can modify /etc/httpd/mime.types to make the same changefor all subdirectories simultaneously.
If you cannot influence the HTTP headers that the web serverprefixes to your documents automatically, then add in a HTML documentunder HEAD the element
```
  <META http-equiv=Content-Type content="text/html; charset=UTF-8">
```
which usually has the same effect. This obviously works only for HTMLfiles, not for plain text. It also announces the encoding of the fileto the parser only after the parser has already started to read thefile, so it is clearly the less elegant approach.

The currently most widely used browsers support UTF-8 well enoughto generally recommend UTF-8 for use on web pages. The old Netscape 4browser used an annoyingly large single font for displaying any UTF-8document. Best upgrade to Mozilla, Netscape 6 or some other recentbrowser (Netscape 4 is generally very buggy and not maintained anymore).

There is also the question of how non-ASCII characters entered intoHTML forms are encoded in the subsequent HTTP GET or POST request thattransfers the field contents to a CGI script on the server.Unfortunately, both standardization and implementation are still a huge mess here, asdiscussed in the FORM submission and i18n tutorial by Alan Flavell. We can onlyhope that a practice of doing all this in UTF-8 will emergeeventually. See also the discussion about Mozilla bug18643.

How are PostScript glyph names related to UCS codes?

See Adobe’s Unicodeand Glyph Names guide.

Are there any well-defined UCS subsets?

With over 40000 characters, the design of a font that covers everysingle Unicode character is an enormous project, not just regardingthe number of glyphs that need to be created, but also in terms of thecalligraphic expertise required to do an adequate job for each script.As a result, there are hardly any fonts that try to cover “all ofUnicode”. While a few projects have attempted to create singlecomplete Unicode fonts, their quality is not comparable with that ofmany good smaller fonts. For example, the Unicode and ISO 10646 booksare still printed using a large collection of different fonts thatonly together cover the entire repertoire. Any high-quality font canonly cover the Unicode subset for which the designer feels competentand confident.

Older, regional character encoding standards defined both an encodingand a repertoire of characters that an individual calligrapher couldhandle. Unicode lacks the latter, but in the interest ofinteroperability, it is useful to have defined a hand full ofstandardized subsets, each a few hundred to a few thousand characterlarge and targeted at particular markets, that font designers couldpractically aim to cover. A number of different UCS subsets alreadyhave been established:

The WindowsGlyph List 4.0 (WGL4) is a set of 650 characters that covers allthe 8-bit MS-DOS, Windows, Mac, and ISO code pages that Microsoft hadused before. All Windows fonts now cover at least the WGL4 repertoire.WGL4 is a superset of CEN MES-1. (WGL4 testfile).
Three EuropeanUCS subsets MES-1, MES-2, and MES-3 have been defined by theEuropean standards committee CEN/TC304 in CWA 13873:
- MES-1 is a very small Latin subset with only 335 characters. Itcontains exactly all characters found in ISO 6937 plus the EURO SIGN.This means MES-1 contains all characters of ISO 8859 parts1,2,3,4,9,10,15. [Note: If your aim is to provide only the cheapestand simplest reasonable Central European UCS subset, I would implementMES-1 plus the following important 14 additional characters found inWindows code page 1252 but not in MES-1: U+0192, U+02C6, U+02DC,U+2013, U+2014, U+201A, U+201E, U+2020, U+2021, U+2022, U+2026,U+2030, U+2039, U+203A.]
- MES-2 is a Latin/Greek/Cyrillic/Armenian/Georgian subset with 1052characters. It covers every language and every 8-bit code page used inEurope (not just the EU!) and European language countries. It alsoadds a small collection of mathematical symbols for use in technicaldocumentation. MES-2 is a superset of MES-1. If you are developingonly for a European or Western market, MES-2 is the recommendedrepertoire. [Note: For bizarre committee-politics reasons, thefollowing eight WGL4 characters are missing from MES-2: U+2113,U+212E, U+2215, U+25A1, U+25AA, U+25AB, U+25CF, U+25E6. If youimplement MES-2, you should definitely also add those and then you canclaim WGL4 conformance in addition.]
- MES-3 is a very comprehensive UCS subset with 2819 characters. Itsimply includes every UCS collection that seemed of potential use toEuropean users. This is for the more ambitious implementors. MES-3 isa superset of MES-2 and WGL4.
JIS X 0221-1995 specifies 7 non-overlapping UCS subsets forJapanese users:
- Basic Japanese (6884 characters): JIS X 0208-1997, JIS X 0201-1997
- Japanese Non-ideographic Supplement (1913 characters): JIS X0212-1990 non-kanji, plus various other non-kanji
- Japanese Ideographic Supplement 1 (918 characters): some JIS X0212-1990 kanji
- Japanese Ideographic Supplement 2 (4883 characters): remaining JISX 0212-1990 kanji
- Japanese Ideographic Supplement 3 (8745 characters): remainingChinese characters
- Full-width Alphanumeric (94 characters): for compatibility
- Half-width Katakana (63 characters): for compatibility
The ISO 10646 standard splits up its repertoire into a number ofcollections that can be used to define and document implementedsubsets. Unicode defines similar, but not quite identical, blocks ofcharacters, which correspond to sections in the Unicode standard.
RFC 1815 is amemo written in 1995 by someone who obviously did not like ISO 10646and was unaware of JIS X 0221-1995. It discusses a UCS subset called“ISO-10646-J-1” consisting of 14 UCS collections, some of which areintersected with JIS X 0208. This is just what a particular font in anold Japanese Windows NT version from 1995 happened to implement. RFC1815 is completely obsolete and irrelevant today and should best beignored.
Markus Kuhn has defined in the ucs-fonts.tar.gz README three UCSsubsets TARGET1, TARGET2, TARGET3 that are sensible extensions of thecorresponding MES subsets and that were the basis for the completionof this xterm font package.

Markus Kuhn’s uniset Perl scriptallows convenient set arithmetic over UCS subsets for anyone who wantsto define a new one or wants to check coverage of an implementation.

What issues are there to consider when converting encodings

The Unicode Consortium maintains a collection of mappingtables between Unicode and various older encoding standards. It isimportant to understand that the primary purpose of these tables wasto demonstrate that Unicode is a superset of the mapped legacyencodings, and to document the motivation and origin behind thoseUnicode characters that were included into the standard primarily forround-trip compatibility reasons with older character sets. Theimplementation of good character encoding conversion rountines is asignificantly more complex task than just blindly applying theseexample mapping tables! This is because some character setsdistinguish characters that others unify.

The Unicode mapping tables alone are to some degree well suited todirectly convert text from the older encodings to Unicode. High-endconversion tools nevertheless should provide interactive mechanisms,where characters that are unified in the legacy encoding butdistinguished in Unicode can interactively or semi-automatically bedisambiguated on a case-by-case basis.

Conversion in the opposite direction from Unicode to a legacycharacter set requires non-injective (= many-to-one) extensions ofthese mapping tables. Several Unicode characters have to be mapped toa single code point in many legacy encodings. The Unicode consortiumcurrently does not maintain standard many-to-one tables for thispurpose and does not define any standard behavior of coded characterset conversion tools.

Here are some examples for the many-to-one mappings that have to behandled when converting from Unicode into something else:

UCS characters	equivalent character	in target code
U+00B5 MICRO SIGN U+03BC GREEK SMALL LETTER MU	0xB5	ISO 8859-1
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE U+212B ANGSTROM SIGN	0xC5	ISO 8859-1
U+03B2 GREEK CAPITAL LETTER BETA U+00DF LATIN SMALL LETTER SHARP S	0xE1	CP437
U+03A9 GREEK CAPITAL LETTER OMEGA U+2126 OHM SIGN	0xEA	CP437
U+03B5 GREEK SMALL LETTER EPSILON U+2208 ELEMENT OF	0xEE	CP437
U+005C REVERSE SOLIDUS U+FF3C FULLWIDTH REVERSE SOLIDUS	0x2140	JIS X 0208

A first approximation of such many-to-one tables can be generatedfrom available normalization information, but these then still have tobe manually extended and revised. For example, it seems obvious thatthe character 0xE1 in the original IBM PC character set was meant tobe useable as both a Greek small beta (because it is located betweenthe code positions for alpha and gamma) and as a German sharp-scharacter (because that code is produced when pressing this letter ona German keyboard). Similarly 0xEE can be either the mathematicalelement-of sign, as well as a small epsilon. These characters are notUnicode normalization equivalents, because although they look similarin low-resolution video fonts, they are very different characters inhigh-quality typography. IBM’stables for CP437 reflected one usage in some cases, Microsoft’sthe other, both equally sensible. A good code converter should aim tobe compatible with both, and not just blindly use the Microsoftmapping table alone when converting from Unicode.

The Unicodedatabase does contain in field 5 the Character DecompositionMapping that can be used to generate some of the above examplemappings automatically. As a rule, the output of aUnicode-to-Something converter should not depend on whether theUnicode input has first been converted into Normalization FormC or not. For equivalence information on Chinese, Japanese, andKorean Han/Kanji/Hanja characters, use the Unihan database.In the cases of the IBM PC characters in the above examples, where thenormalization tables do not offer adequate mapping, thecross-references to similar looking characters in the Unicode book area valuable source of suggestions for equivalence mappings. In the end,which mappings are used and which not is a matter of taste andobserved usage.

The Unicode consortium used to maintain mapping tables to CJKcharacter set standards, but has declared them to be obsolete, becausetheir presence on the Unicode web server led to the development of anumber of inadequate and naive EUC converters. In particular, the (nowobsolete) CJK Unicode mapping tables had to be slightly modifiedsometimes to preserve information in combination encodings. Forexample, the standard mappings provide round-trip compatibility forconversion chains ASCII to Unicode to ASCII as well as for JIS X 0208to Unicode to JIS X 0208. However, the EUC-JP encoding covers theunion of ASCII and JIS X 0208, and the UCS repertoire covered by theASCII and JIS X 0208 mapping tables overlaps for one character, namelyU+005C REVERSE SOLIDUS. EUC-JP converters therefore have to use aslightly modified JIS X 0208 mapping table, such that the JIS X 0208code 0x2140 (0xA1 0xC0 in EUC-JP) gets mapped to U+FF3C FULLWIDTHREVERSE SOLIDUS. This way, round-trip compatibility from EUC-JP toUnicode to EUC-JP can be guaranteed without any loss of information.UnicodeStandard Annex #11: East Asian Width provides further guidance onthis issue. Another problem area is compatibility with olderconversion tables, as explained in an essay byTomohiro Kubota.

In addition to just using standard normalization mappings,developers of code converters can also offer transliteration support.Transliteration is the conversion of a Unicode character into agraphically and/or semantically similar character in the target code,even if the two are distinct characters in Unicode afternormalization. Examples of transliteration:

UCS characters	equivalent character	in target code
U+0022 QUOTATION MARK U+201C LEFT DOUBLE QUOTATION MARK U+201D RIGHT DOUBLE QUOTATION MARK U+201E DOUBLE LOW-9 QUOTATION MARK U+201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK	0x22	ISO 8859-1

The Unicode Consortium does not provide or maintain any standardtransliteration tables at this time. CEN/TC304 has a draft report“European fallback rules” on recommended ASCII fallback characters forMES-2 in the pipeline, but this is not yet mature. Whichtransliterations are appropriate or not can in some cases depend onlanguage, application field, and most of all personal preference.Available Unicode transliteration tables include, for example, thosefound in Bruno Haible’s libiconv, the glibc 2.2 locales, andMarkus Kuhn’s transtab package.

Is X11 ready for Unicode?

The X11 R7.0 release(2005) is the latest version of the X Consortium’s sampleimplementation of the X11 Window System standards. The bulk of the current X11 standardsand parts of the sample implementation still pre-date widespreadinterest in Unicode under Unix.

Among the things that have already been fixed are:

Keysyms: Since X11R6.9, a keysym value has been allocatedfor every Unicode character in Appendix A of the X Window SystemProtocol specification. Any UCS character in the range U-00000100to U-00FFFFFF can now be represented by a keysym value in the range0x01000100 to 0x01ffffff. This scheme was proposed by Markus Kuhn in1998 and has been supported by a number of applications for manyyears, starting with xterm. The revised Appendix A now also containsan official UCS cross reference column in its table of pre-Unicodelegacy keysyms.
UTF-8 locales: The X11R6.8 sample implementation addedsupport for UTF-8 locales.
Fonts: A number of comprehensive Unicode standard fontswere added in X11R6.8, and they are now supported by some of theclassic standard tools, such as xterm.

There remain a number of problems in the X11 standards and someinconveniences in the sample implementation for Unicode users thatstill need to be fixed in one of the next X11 releases:

UTF-8 cut and paste: The ICCCMstandard still does not specify how to transfer UCS strings inselections. Some vendors have added UTF-8 as yet another encoding tothe existing COMPOUND_TEXT mechanism (CTEXT). This is not a good solution forat least the following reasons:
- CTEXT is a rather complicated ISO 2022 mechanism and Unicodeoffers the opportunity to provide not just another add-on to CTEXT,but to replace the entire monster with something far simpler, moreconvenient, and equally powerful.
- Many existing applications can communicate selections via CTEXT,but do not support a newly added UTF-8 option. A user of CTEXT has todecide whether to use the old ISO 2022 encodings or the new UTF-8encoding, but both cannot be offered simultaneously. In other words,adding UTF-8 to CTEXT seriously breaks backwards compatibility withexisting CTEXT applications.
- The current CTEXT specification even explicitly forbids theaddition of UTF-8 in section 6: “ISO registered ‘other coding systems’are not used in Compound Text; extended segments are the onlymechanism for non-2022 encodings.”
Juliusz Chroboczekhas written an Inter-Client Exchange of Unicode Text draft proposal for anextension of the ICCCM to handle UTF-8 selections with a newUTF8_STRING atom that can be used as a property type and selectiontarget. This clean approach fixes all of the above problems.UTF8_STRING is just as state-less and easy to use as the existingSTRING atom (which is reserved exclusively for ISO 8859-1 strings andtherefore not usable for UTF-8), and adding a new selection targetallows applications to offer selections in both the old CTEXT and thenew UTF8_STRING format simultaneously, which maximizesinteroperability. The use of UTF8_STRING can be negociated between theselection holder and requestor, leading to no compatibility issueswhatsoever. Markus Kuhn has prepared an ICCCMpatch that adds the necessary definition to the standard. Currentstatus: The UTF8_STRING atom has now been officially registered with X.Org,and we hope for an update of the ICCCM in one of the next releases.
Application window properties: In order to assist thewindow manager in correctly labeling windows, the ICCCM 2.0specification requires applications to assign properties such asWM_NAME, WM_ICON_NAME and WM_CLIENT_MACHINE to each window. The oldICCCM 2.0 (1993) defines these to be of the polymorphic type TEXT,which means that they can have their text encoding indicated using oneof the property types STRING (ISO 8859-1), COMPOUND_TEXT (a ISO 2022subset), or C_STRING (unknown character set). Simply addingUTF8_STRING as a new option for TEXT would break backwardscompatibility with old window managers that do not know about thistype. Therefore, the freedesktop.org draftstandard developped in the Window ManagerSpecification Project adds new additional window properties_NET_WM_NAME, _NET_WM_ICON_NAME, etc. that have type UTF8_STRING.
Inefficient font data structures:The Xlib API and X11 protocol data structures used for representingfont metric information are extremely inefficient when handlingsparsely populated fonts. The most common way of accessing a font inan X client is a call to XLoadQueryFont(), which allocates memory foran XFontStruct and fetches its content from the server. XFontStructcontains an array of XCharStruct entries (12 bytes each). The size ofthis array is the code position of the last character minus the codeposition of the first character plus one. Therefore, any“*-iso10646-1” font that contains both U+0020 and U+FFFD will cause anXCharStruct array with 65502 elements to be allocated (even forCharCell fonts), which requires 786 kilobytes of client-side memoryand data transmission, even if the font contains only a thousandcharacters.
A few workarounds have been used so far:
- The non-Asian -misc-fixed-*-iso10646-1 fonts thatcome with XFree86 4.0 contain no characters above U+31FF. This reducesthe memory requirement to 153 kilobytes, which is still bad, but muchless so. (There are actually many useful characters above U+31FFpresent in the BDF files, waiting for the day when this problem willbe fixed, but they currently all have an encoding of -1 and aretherefore ignored by the X server. If you need these characters, thenjust install the original fonts withoutapplying the bdftruncate script).
- Starting with XFree86 4.0.3, the truncation of a BDF font can alsobe done by specifying a character code subrange at the end of theXLFD, as described in the XLFDspecification, section 3.1.2.12. For example,
```
-Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO10646-1[0x1200_0x137f]
```
  will load only the Ethiopic part of this BDF font with acorrespondingly nicely small XFontStruct. Earlier X server versionswill simply ignore the font subset brackets and will give you the fullfont, so there is no compatibility problem with using that.
- Bruno Haible has written a BIGFONT protocol extension for XFree864.0, which uses a compressed transmission of XCharStruct from serverto client and also uses shared memory in Xlib between several clientswhich have loaded the same font.
These workarounds do not solve the underlying problem thatXFontStruct is unsuitable for sparsely populated fonts, but they doprovide a significant efficiency improvement without requiring anychanges in the API or client source code. One real solution would beto extend or replace XFontStruct with something slightly more flexiblethat contains a sorted list or hash table of characters as opposed toan array. This redesign of XFontStruct would at the same time alsoallow the addition of the urgently needed provisions for combiningcharacters and ligatures.

Another approach would be to introduce a new font encoding, whichcould be called for instance “ISO10646-C” (the C stands for combining,complex, compact, or character-glyph mapped, as you prefer). In thisencoding, the numbers assigned to each glyph are really font-specificglyph numbers and are not equivalent to any UCS character codepositions. The information necessary to do a character-to-glyphmapping would have to be stored in to be standardized new properties.This new font encoding would be used by applications together with afew efficient C functions that perform the character-to-glyph codemapping:
- makeiso10646cglyphmap(XFontStruct *font, iso10646cglyphmap*map)
  Reads the character-to-glyph mapping table from the fontproperties into a compact and efficient in-memory representation.
- freeiso10646cglyphmap(iso10646cglyphmap *map)
  Frees that in-memory representation.
- mbtoiso10646c(char *string, iso10646cglyphmap *map, XChar2b*output)
  wctoiso10646c(wchar_t *string, iso10646cglyphmap *map,XChar2b *output)
  These take a Unicode character string andconvert it into a XChar2b glyph string suitable foroutput by XDrawString16 with the ISO10646-C font fromwhich the iso10646cglyphmap was extracted.
ISO10646-C fonts would still be limited to having not more than 64kibiglyphs,but these can come from anywhere in UCS, not just from the BMP. Thissolution also easily provides for glyph substitution, such that we canfinally handle the Indic fonts. It solves the huge-XFontStruct problemof ISO10646-1, as XFontStruct grows now proportionally with the numberof glyphs, not with the highest characters. It could also provide forsimple overstriking combining characters, but then the glyphs forcombining characters would have to be stored with negative widthinside an ISO10646-C font. It can even provide support for variablecombining accent positions, by having several alternative combiningglyphs with accents at different heights for the same combiningcharacter, with the ligature substitution tables encoding whichcombining glyph to use with which base character.

TODO: write specification for ISO10646-C properties, write sampleimplementations of the mapping routines, and add these to xterm, GTK,and other applications and libraries. Any volunteers?
Combining characters: The X11 specification does notsupport combining characters in any way. The font information lacksthe data necessary to perform high-quality automatic accent placement(as it is found, for example, in all TeX fonts). Various people haveexperimented with implementing simplest overstriking combiningcharacters using zero-width characters with ink on the left side ofthe origin, but details of how to do this exactly are unspecified(e.g., are zero-width characters allowed in CharCell and Monospacedfonts?) and this is therefore not yet widely established practice.
Ligatures: The Indic scripts need font file formats thatsupport ligature substitution, which is at the moment just ascompletely out of the scope of the X11 specification as are combiningcharacters.

Several XFree86 team members have worked on these issues. X.Org, the official successor of the XConsortium and the Opengroup as the custodian of the X11 standards andthe sample implementation, has taken over the results or is stillconsidering them.

With regard to the font related problems, the solution willprobably be to dump the old server-side font mechanisms entirely anduse instead XFree86’s new Xft. Anotherrelated work-in-progress is Standard Type Services (ST)framework that Sun has been working on.

What are useful Perl one-liners for working with UTF-8?

These examples assume that you have Perl 5.8.1 or newer and thatyou work in a UTF-8 locale (i.e., “locale charmap” outputs “UTF-8”).

For Perl 5.8.0, option -C is not needed andthe examples without -C will not work in a UTF-8 locale.You really should no longer use Perl 5.8.0, as its Unicode support hadlots of bugs.

Print the euro sign (U+20AC) to stdout:

  perl -C -e 'print pack("U",0x20ac)."\n"'
  perl -C -e 'print "\x{20ac}\n"'           # works only from U+0100 upwards

Locate malformed UTF-8 sequences:

  perl -ne '/^(([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})*)(.*)$/;print "$ARGV:$.:".($-[3]+1).":$_" if length($3)'

Locate non-ASCII bytes:

  perl -ne '/^([\x00-\x7f]*)(.*)$/;print "$ARGV:$.:".($-[2]+1).":$_" if length($2)'

Convert non-ASCII characters into SGML/HTML/XML-style decimalnumeric character references (e.g. Ş becomesŞ):

  perl -C -pe 's/([^\x00-\x7f])/sprintf("&#%d;", ord($1))/ge;'

Convert (hexa)decimal numeric character references to UTF-8:

  perl -C -pe 's/&\#(\d+);/chr($1)/ge;s/&\#x([a-fA-F\d]+);/chr(hex($1))/ge;'

How can I enter Unicode characters?

There are a range of techniques for entering Unicode charactersthat are not present by default on your keyboard.

Application-independent methods

Copy-and-paste from a small file that lists your most commonlyused Unicode characters in a convenient and for your needs suitablychosen arrangement. This is usually the most convenient andappropriate method for relatively rarely required very specialcharacters, such as more esoteric mathematical operators.
Extend your keyboard mapping using xmodmap. This is particularlyconvenient if your keyboard has an AltGr key, which is meant forexactly this purpose (some US keyboards have instead of AltGr just aright Alt key, others lack that key entirely unfortunately, in whichcase some other key must be assigned the Mode_switch function). Writea file "~/.Xmodmap" with entries such as
```
  keycode 113 = Mode_switch Mode_switch
  keysym d = d NoSymbol degree        NoSymbol
  keysym m = m NoSymbol emdash        mu
  keysym n = n NoSymbol endash        NoSymbol
  keysym 2 = 2 quotedbl twosuperior   NoSymbol
  keysym 3 = 3 sterling threesuperior NoSymbol
  keysym 4 = 4 dollar   EuroSign      NoSymbol
  keysym space = space  NoSymbol      nobreakspace NoSymbol
  keysym minus = minus  underscore    U2212        NoSymbol
  keycode 34 = bracketleft  braceleft  leftsinglequotemark  leftdoublequotemark
  keycode 35 = bracketright braceright rightsinglequotemark rightdoublequotemark
  keysym KP_Subtract = KP_Subtract NoSymbol U2212    NoSymbol
  keysym KP_Multiply = KP_Multiply NoSymbol multiply NoSymbol
  keysym KP_Divide   = KP_Divide   NoSymbol division NoSymbol
```
and load it with "xmodmap ~/.Xmodmap" from your X11 startup scriptinto your X server. You will then find that you get with AltGr easilythe following new characters out of your keyboard:
AltGr+d °
AltGr+ NBSP
AltGr+[ ‘
AltGr+] ’
AltGr+{ “
AltGr+} ”
AltGr+2 ²
AltGr+3 ³
AltGr+- −
AltGr+n –
AltGr+m —
AltGr+M µ
AltGr+keypad-/ ÷
AltGr+keypad-* ×

The above example file is meant for a UK keyboard, but easilyadapted to other layouts and extended with your own choice ofcharacters. If you use Microsoft Windows, tryMicrosoftKeyboard Layout Creator to make similar customizations.
ISO 14755 defines ahexadecimal input method: Hold down both the Ctrl and Shift key whiletyping the hexadecimal Unicode number. After releasing Ctrl and Shift,you have entered the corresponding Unicode character.
This is currently implemented in GTK+ 2, and works in applicationssuch as GNOME Terminal, Mozilla and Firefox.

Application-specific methods

In VIM, type Ctrl-V u followed by a hexadecimal number. Example:Ctrl-V u 20ac
In Microsoft Windows, press the Alt key while typing the decimalUnicode number with a leading zero on the numeric keypad. Example:press-Alt 08364 release-Alt
In Microsoft Word, type a hexadecimal number and then press Alt+Xto turn it into the corresponding Unicode character. Example: 20ac Alt-X

Are there any good mailing lists on these issues?

You should certainly be on the linux-utf8@nl.linux.orgmailing list. That’s the place to meet for everyone interested inworking towards better UTF-8 support for GNU/Linux or Unix systems andapplications. To subscribe, send a message to linux-utf8-request@nl.linux.org with the subjectsubscribe. You can also browse the linux-utf8 archive andsubscribe from there via a web interface.

There is also the unicode@unicode.org mailing list, which is the bestway of finding out what the authors of the Unicode standard and a lotof other gurus have to say. To subscribe, send to unicode-request@unicode.orga message with the subject line “subscribe” and the text “subscribeYOUR@EMAIL.ADDRESS unicode”.

The relevant mailing list for discussions about Unicode support inXlib and the X server is now xorg at xorg.org. In the past, there werealso the fonts and i18n atxfree86.org mailing lists, whose archives still contain valueableinformation.

Further references

Bruno Haible’s UnicodeHOWTO.
TheUnicode Standard, Version 5.0, Addison-Wesley, 2006. Youdefinitely should have a copy of the standard if you are doinganything related to fonts and character sets.
Ken Lunde’s CJKVInformation Processing, O’Reilly & Associates, 1999. This isclearly the best book available if you are interested in East Asiancharacter sets.
Unicode Technical Reports
Mark Davis’ UnicodeFAQ
ISO/IEC10646-1:2000
Frank Tang’sIñtërnâtiônàlizætiøn Secrets
IBM’s UnicodeZone
Unicode Supportin the Solaris 7 Operating Environment
The USENIX Winter 1993 paper by Rob Pike and Ken Thompson on theintroduction of UTF-8 underPlan 9 reports about the experience gained when Plan 9 migrated as thefirst operating system back in 1992 completely to UTF-8 (which was atthe time still called UTF-2). A must read!
OpenI18N is a projectinitiated by several Linux distributors to enhance Unicode support forfree operating systems. It published the OpenI18NGlobalization Specification, as well as some patches.
The OnlineSingle Unix Specification contains definitions of all the ISO CAmendment 1 function, plus extensions such as wcwidth().
The Open Group’s summary of ISOC Amendment 1.
GNU libc
The Linux Console Tools
The Unicode Consortium character databaseand character setconversion tables are an essential resource for anyone developingUnicode related tools.
Other conversion tables are available from Microsoftand KeldSimonsen’s WG15 archive.
Michael Everson’s Unicode and JTC1/SC2/WG2Archive contains online versions of many of the more recent ISO10646-1 amendments, plus many other goodies. See also his Roadmaps to the Universal Character Set.
An introduction into TheUniversal Character Set (UCS).
Otfried Cheong’s essay on Han Unificationin Unicode
The AMS STIX projectrevised and extended the mathematical characters for Unicode 3.2 andISO 10646-2. They are now preparing a freely available the STIX Fonts family of fully hintedType1 and TrueType fonts, covering the over 7700 characters needed forscientific publishing in a “Times compatible” design.
Jukka Korpela’s Soft hyphen (SHY) –a hard problem? is an excellent discussion of the controversysurrounding U+00AD.
James Briggs’ Perl,Unicode and I18N FAQ.
Mark Davis discusses in Formsof Unicode the tradeoffs between UTF-8, UTF-16, and UCS-4 (nowalso called UTF-32 for political reasons). Doug Ewell wrote A survey ofUnicode compression.
Alan Wood has a good page on Unicode and MultilingualSupport in Web Browsers and HTML.
ISO/JTC1/SC22/WG20 produced various Unicode related standardssuch as the International String Ordering (ISO 14651) and the Cultural Convention Specification TR (ISO TR 14652) (an extensionof the POSIX locale format that covers, for example, transliteration ofwide character output).
ISO/JTC1/SC2/WG2/IRG(Ideographic Rapporteur Group)
The Letter Databaseanswers queries on languages, character sets and names, as does the Zvon CharacterSearch.
Vietnamese Unicode FAQs
China has specified in GB 18030 a new encoding of UCS for use in Chinese governmentsystems that is backwards-compatible with the widely used GB 2312 andGBK encodings for Chinese. It seems though that the first version(released 2000-03) is somewhat buggy and will likely go through acouple more revisions, so use with care. GB 18030 is probably more ofa temporary migration path to UCS and will probably not survive forlong against UTF-8 or UTF-16, even in Chinese government systems.
HongKong Supplementary Character Set (HKSCS)
Various people propose UCS alternatives: Rosetta, Bytext.
Proceedings of the International Unicode Conferences: ICU13, ICU14, ICU15, ICU16, ICU17, ICU18, etc.
This FAQ has been translated into other languages:
- Korean: 2001-02
Be aware that each translation reflects only some past version of this document,which I update occasionally.

Suggestions for improvement are welcome.

Special thanks to Ulrich Drepper, Bruno Haible, Robert Brady,Juliusz Chroboczek, Shuhei Amakawa, Jungshik Shi, Robert Rogers, RomanCzyborra, Josef Hinteregger and many others for valuable comments, andto SuSE GmbH, Nürnberg, for their past support.

This work islicensed under a
Creative CommonsAttribution
4.0 International License.

Markus Kuhn

https://www.cl.cam.ac.uk/~mgk25/unicode.html

CaspianSea

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
UTF-8 and Unicode FAQ for Unix/Linux

by Markus KuhnThis text is a very comprehensive one-stop information resourceon how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). Youwill find here both introductory information for ever
复制链接

扫一扫