A short overview of ISO/IEC 10646 and Unicode

最新推荐文章于 2020-09-09 17:34:17 发布

wlnpu

最新推荐文章于 2020-09-09 17:34:17 发布

阅读量964

点赞数

文章标签： character transformation forms encoding internationalization protocols

A short overview of
ISO/IEC 10646 and Unicode

By Olle J鋜nefors <ojarnef@admin.kth.se>

Summary

The purpose of this text is to give a brief technical overview of the new character set standard ISO/IEC 10646 and the nearly related Unicode standard. I have omitted descriptions of the history of the standard as well as general talk about why a standard of this type is badly needed.

Previous knowledge

The reader should have some knowledge about coded character sets, have seen an ASCII table, and know of some 8-bit character sets, like Latin-1 (ISO/IEC 8859-1).

Document history

Various drafts of this text have previously been available over Internet, the latest of which is version Ap4 (from 1993-09-14).

1993-09-14, version Ap4: Last draft

1996-02-24, version A: Final document, prepared for the IAB character set workshop 1996-02-29/1996-03-01

1996-02-26, version Ar1: Added one item to the author presentation. HTML home added. Section 1: added three limitations of plain text removed by UCS. Section 5: paragraph about privat

About the author

Having joined SIS-ITS/AG2 (the Swedish standardization working group corresponding to ISO/IEC JTC1/SC2 -- Character sets and information coding) in 1988, I made contributions to the Swedish comments on several drafts of the ISO/IEC 10646 standard. I also had the pleasure to take part in the big merger of Unicode and ISO/IEC 10646 that was accomplished at three meetings during 1991 in San Francisco, Geneva and Paris, representing Sweden on the ISO side. I have also worked with character set standardization in European standardization (CEN/TC304) and within IETF. Lately, I have provided character set knowledge to and edited the first proposal for extending ISO/IEC 10646 with a major historical script, the Runic script.

Original home

The latest version of this text is available at
<URL:ftp://ftp.admin.kth.se/pub/misc/ucs/unicode-iso10646-oview.txta;type=A>

HTML home

An HTML version of this text is available at
<URL:http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html>

Table of content with synopsis

1. Most important facts
ISO/IEC 10646 = UCS. Universal in scope. Multi-octet character set. Relation to Unicode. Plain and rich text

2. The structure of the coding space
The half-filled UCS-2. The unused UCS-4. Cell, row, plane, group. Relation to ISO/IEC 8859-1. UCS-2 = BMP = plane 0 of group 0.

3. Implementation levels Level 1 (enough for Europe, the Middle East, East Asia). Level 2 (needed for South Asia). Bi-directional text. Precomposed characters, combining characters, composite sequences.

4. Adaptation to data communication needs
UCS transformation formats. UTF-8: UCS represented in 8-bit text. UTF-7: UCS-2 represented in 7-bit text. UTF-16: Part of UCS-4 represented in UCS-2.

5. What is accepted as a character in UCS?
Existing coded character sets amalgamated. CJK unification. Characters not shapes, not meanings. Compatibility characters. Private use characters.

6. References

7. Annex: Overview of the BMP (group=00, plane=00)

1. Most important facts

ISO/IEC 10646 is a relatively new character set standard, published in 1993 by the International Organization for Standardization (ISO). Its name is "Universal Multiple-Octet Coded Character Set". Troughout this overview I use its acronym, UCS.

UCS is the first offcially standardized coded character set with the purpose to eventually include all characters used in all the written languages in the world (and, in addition, all mathematical and other symbols). This is certainly a very ambitious goal, but the current first edition at least covers all major languages and all commercially important languages.

To be able to give every character of this grand repertoire a unique coded representation, the designers of UCS chose a uniform encoding, using bit sequences consisting of 16 or 31 bits (in the two coding forms, UCS-2 and UCS-4). This is the reason for the phrase "multi-octet" in the name of the standard.

Unicode is a coded character set specified by a consortium of major American computer manufacturers, primarily to overcome the chaos of different coded character sets in use when creating multilingual programs and internationalizing software. From version 1.1 on, Unicode is scrupulously kept compatible with ISO/IEC 10646 and its extensions. The consortium is also an important contributor to the ISO work to further develop ISO/IEC 10646.

In short, Unicode can be characterized as the (restricted) 2-octet form of UCS on (the most general) implementation level 3, with addition of a more precise specification of the bi-directional behavior of characters, when used in the Arabic and Hebrew scripts. Unicode is presently at version 1.1. Extensions in the soon forthcoming version 2.0 will make it possible to access also the wider coding space of UCS-4, within this 16-bit encoding.

UCS is intended to be usable both for internal data representation in computer systems and in data communication. UCS is already employed in commercial products from Microsoft, Novell, Apple and others. It is implemented in free software like Linux, and is proposed for inclusion in advanced data communication standards like HTML.

Strong but in my opinion ill-founded criticism has met UCS from programmer groups in Japan. It has, however, recently been adopted as a Japanese national standard.

ISO/IEC 10646 is a fundamental standard, potentially affecting almost all parts of information technology. But it specifies only a coded character set, not a complete system for text representation. It provides the basis for internationalization, but does not in itself give a complete solution of the problems in this field.

The simple kind of text for whose representation a coded character set standard is sufficient, plain text, is essentially only a linear sequence of graphic characters, with a fixed division into lines and possibly pages.

ISO/IEC 10646 and Unicode removes some assumptions often made about plain text, which simplifies implementations but are untenable in multilingual text and monolingual text in some languages:

Plain text does not need to be monospaced. (Proportional plain text in the Latin script has existed in the Apple Macintosh computers since the middle of the 80's.)
Characters cannot be identified with glyphs. Different graphic forms to be used in different situations are needed for some characters, e.g. Arabic letters.
Characters do not in general specify the language of the text. UCS is a completely language-neutral standard.

For several important aspects of text, as treated in modern text processing programs, UCS needs to be supplemented by further standards or rules, so-called higher-level text protocols. Some examples of these aspects are tables, mathematical formulas, information about the language of text fragments, text variations like italic text and different text sizes, choice of particular fonts, content mark-up, document structure, hyperlinks. This is called rich text. (Some standards for rich text are HTML, SGML, Microsoft RTF.)

The evolution of ISO/IEC 10646 and, in parallel, Unicode will continue for a long period of time, mostly by additions of scripts and symbol collections. This overview describes the first edition of the standard from 1993, but some of the extensions that are about to be adopted are also touched upon.

2. The structure of the coding space

In the first version of UCS 34203 different characters are included. Of these 21204 are ideographic characters used in Chinese, Japanese and Korean, and 6656 are Korean Hangul syllabograms. To guarantee that the coding space will not be filled up even in the future -- 2 octets give 65536 different character positions -- a 4-octet form of UCS (UCS-4) is also definied.

The 65536 positions in the 2-octet form of UCS are divided into 256 rows with 256 cells in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1. The first 128 characters are thus the ASCII characters. The octet representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 octet in front of it. UCS includes the same control characters as ISO/IEC 8859 and these are also in row 0. An overview of the content of all rows are found in the annex.

In the 4-octet form more than 2 billion (2147483648) different characters can be represented. (The first bit of the first octet must be 0 so only 31 of the 32 bits are used by UCS.) This coding space is subdivided into 128 groups, each containing 256 planes. The first octet in a character representation indicates the group number and the second the plane number. The third and fourth octets gives the row number and the cell number of the character. Those characters that can be represented by the 2-octet form of UCS belong to plane 0 of group 0, which is called the Basic Multilingual Plane, BMP. The 4-octet representation of a character in the BMP is produced by putting two 0 octets before its 2-octet representation.

Still no characters have been allocated to positions outside the BMP, and only the 2-octet form is used in practice.

3. Implementation levels

Independently of the two encoding forms of UCS, the standard ISO/IEC 10646 also draws a distinction between three different implementation levels. The full coded character set is available on level 3. On the lower levels certain subsets of the characters are not usable. This restricts the range of langauges that can be coded on these levels. On the other hand it makes simpler implementations possible.

A full implementation of the Unicode standard amounts to an implementation at level 3 of UCS.

The simplest implementation level 1 works exactly like the older simple coded character sets, such as ASCII and ISO/IEC 8859-1: Each graphic character occupies one position and moves the active position one step in the writing direction (even though the movement need not be constant; it is not if a proportional font is used). This model works well for among others the Latin, Greek, and Cyrillic scripts. On this level the composite letters, consisting of a base letter and one or more diacritical marks, which are used in certain languages, are included as single characters in their own right. UCS includes the composite letters of all official languages and also of most other languages with a well-established orthography using these scripts.

Also the Arabic and Hebrew scripts are handled on this level, but they introduce an extra complication: Arabic and Hebrew are normally written from right to left, but when words in e.g. the Latin script are included within such text, these are written in their normal direction, from left to right. In computer memory all characters are stored in the logical order, i.e. the order in which the sounds of the words are pronounced and the letters normally are input. When displayed, the writing direction of some words must be changed, relative to the order in memory. Two alternative methods to handle bi-directional text can be used together with UCS, one based on the international standard ISO/IEC 6429 and one defined for Unicode.

Other languages for which implementation level 1 is sufficient are Japanese and Chinese. These are not affected by any of the two complications noted above. For these languages it is the big number of different characters that make implementations difficult.
On implementation level 2 also the South-Asian scripts, e.g. Devanagari used on the Indian subcontinent, can be handled. These causes further complications of display software, since in many cases both the appearance and the relative position of a certain letter is determined by which the nearest surrounding letters are.
On the full implementation level 3 conforming programs also must be able to handle independent combining characters, e.g. accents and other diacritical marks that are printed over, under or through ordinary letters. Such characters can be freely combined with other characters and UCS sets no limit on the number of combining characters attached to a base character. A difference compared to some other coded character sets is that the codes for combining characters are stored after the code of the base character, not before it.

A complication for programming is that on this level some composite characters can each be coded in several different ways. As an example, the Danish letter "A with ring above and acute accent" can be represented in three different ways:
```
      01FA
      (the simple representation that must be used on level 1 and 2)

      00C5 0301
      ("A with ring above" + combining acute accent)

      0041 030A 0301
      ("A" + combining ring above + combining acute accent)
```
(The code positions in UCS are usually given in hexadecimal notation. 01FA indicates two octets, first the octet with the value 1, corresponding to row 1, then the octet with the hexadecimal value FA, corresponding to cell 250 in that row.)

Formally, the first alternative above is considered as a representation of a single precomposed character, while the second and third alternatives represent different composite sequences of several characters. Programs on implementation level 3 should, however, treat these three alternatives as fully equivalent representations of the same thing.

Implementation level 3 is necessary for full support of the Korean Hangul script and also for full support of IPA, the International Phonetic Alphabet. It also removes artificial restrictions on the possibilities of combining accents and similar marks with in ways not anticipated when the composite characters of implementation level 1 were chosen.

4. Adaptation to data communication needs

Many data communication protocols treat octets with values in the hexadecimal range 00-1F specially; they represent control characters in most 7-bit and 8-bit character sets. It is even the case that the most used protocol for electronic mail, classical SMTP, explicitly forbids the 128 octets > hex 7F. In certain datatypes used in data communication, e.g. domain names on Internet, even harder restrictions are imposed an allowed octets. In some important operating systems, notably Unix, even some octets that in ASCII represents graphic characters can not be used in file names.

When UCS is used in these contexts, the simple solution to just partition the 16-bit or 31-bit codes into 2 or 4 octets does not work. For many graphic characters this will produce octets in the ranges forbidden by the above mentioned protocols and operating system designs.

For these reasons, several algorithmic transformation methods have been defined for UCS data. The UTF-1 method (UCS Transformation Format No. 1), defined in an annex to ISO/IEC 10646, is of little interest and will be withdrawn. More important are the following:

UTF-8: The codes in the first half of the first row of the BMP, i.e. the characters that also can be found in ASCII, are in this transformation format replaced by their ASCII codes, which are octets in the range hex 00-7F. The other codes of UCS are transformed to between two and six octets in the range hex 80-FF. A text only containing characters in the BMP is transformed to the same octet sequence, irrespective of whether it was coded with UCS-2 or UCS-4.
UTF-7: This is a transformation format specially designed for the extreme requirements of Internet e-mail using the classical SMTP protocol. It transforms UCS-2-coded text to a sequence of octets that all are <= 7F. In this encoding most ASCII characters of the UCS-2 text are replaced by their ASCII octet. All other characters are transformed to a representation using around 2,7 octets per character.
UTF-16: Unlike UTF-8 and UTF-7, this transformation reduces UCS-4-coded text to a UCS-2-based encoding and the result can only be used by so called 8-bit safe programs and processes, where all octet values are allowed. All UCS-4 codes in the BMP are reduced to the corresponding code in UCS-2. In addition, UCS-4 codes in the 10 following planes of group 0 are transformed to two UCS-2 codes. 4096 codes in the BMP are reserved for this. This makes the characters that in the future may be allocated to 1048576 code positions of UCS-4 outside the BMP available in the 16-bit UCS-2 coded character set. The other code positions in UCS-4 are still unusable in the UTF-16 transformation format. One motivation for defining UTF-16 has been that it will make it possible for software implementing Unicode to cope with the expansion of UCS outside the BMP for the foreseeable future.

UTF-8 and UTF-16 will be added to ISO/IEC 10646 in the next revision of the standard, and are included in the forthcoming Unicode version 2.0. UTF-7 is a specification of IETF, the Internet Engineering Task Force, and formally unrelated to ISO/IEC 10646.

5. What is accepted as a character in UCS?

The character repertoire of the first version of UCS is based on an amalgamation of all internationally standardized coded character sets and the most important company-defined de facto standards for coded character sets that existed in 1991. Whenever what was deemed as the same character was found in different coded character sets, these were unified into one character with one code in UCS. But two different characters in the same coded character set was never unified. Also the letters of some scripts with no existing standard coded character set, and vast collections of mathematical symbols, technical symbols, geometric shapes, dingbats and other conventional signs were included in the repertoire of UCS.

When deciding on whether a graphic character should be added to UCS, the most important principle have been that a new character must differ from all already included characters both in meaning and in appearance to be accepted.

Alternative graphic forms of existing characters (font variants, glyphs) are consequently not given UCS codes of their own. In Chinese, Japanese and Korean there is a very big number of ideographic characters which have the same historical origin and only minor differences in appearance between the three languages. These national variants of the same ideographic character have been given a joint UCS code, a solution which is known as CJK unification.

On the other hand, not even a completely new way of using an existing character -- the same appearance but different meanings -- is sufficient justification to get it included in UCS as a separate character. For example the punctuation mark asterisk, "*", of considerable age in itself, has in recent years also been used as multiplication sign in different programming languages. This case is regarded as two different uses of the same character, which is given only one UCS representation.

There are two important exceptions from the criteria for character sameness outlined above:

Letters with exactly the same appearance that occur in scripts are given different codes. There are for example one Latin "P", one Greek "P" (capital rho), and one Cyrillic "P" (Cyrillic R).
A comparatively small number of characters have been accepted in UCS only because they occur in other, practically important coded character sets. This is to make possible the fully reversible conversion of data coded in these coded character sets to UCS and back again to the original character set (round-trip convertibility). Such characters are called compatibility characters. One example is the character SUPERSCRIPT TWO which can be found in UCS only because it is included in the coded character set ISO/IEC 8859-1.

What is said here is only a general outline of the principles used to identify individual characters to be given a code position in UCS and Unicode. These are unfortunately not described at all in the text of ISO/IEC 10646. In many specific cases it is of course not at all clear how to apply them. Quite a number of the decisions made are fairly arbitrary.

On important feature of UCS is that a large number of code positions are reserved for private use characters. No future revision of ISO/IEC 10646 will use these positions. There is room for 6400 private characters i the 2-octet form, and more in the 4-octet form.

6. References

UCS is defined in:

ISO/IEC International Standard 10646-1:1993(E): Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Arcitecture and Basic Multilingual Plane. International Organization for Standardization, Geneva, 1993.

Unicode version 1.0 is defined in two books:

The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 1 (Arcitecture, non-ideographic characters) Addison-Wesley, 1991

The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 2 (Ideographic characters) Addison-Wesley, 1992

The changes made between version 1.0 and version 1.1 are specified in:

Unicode Technical Report #4: The Unicode Standard, Version 1.1 The Unicode Consortium, 1993

Definitions of the various transformation formats proposed to be included in ISO/IEC 10646 and Unicode 2.0 are available on the Internet:

UTF-7 Encoding Form
[HTML-version of RFC 1642]
http://www.stonehand.com/unicode/standard/utf7.html

UCS Transformation Format 8 (UTF-8) [HTML-version of ISO-document ISO/IEC JTC1/SC2/WG2 N1036] http://www.stonehand.com/unicode/standard/wg2n1036.html

UCS Transformation Format 16 (UTF-16) [HTML-version of ISO-document ISO/IEC JTC1/SC2/WG2 N1035] http://www.stonehand.com/unicode/standard/wg2n1035.html

Internet sites with much information about Unicode:

http://www.stonehand.com/unicode/

ftp://ftp.stonehand.com/pub/

ftp://unicode.org/pub/

A good account of the history of ISO work on multi-octet character sets and the merger between ISO/IEC 10646 and Unicode can be found in:

Michael Y. Ksar: Untying tongues. ISO/IEC breaks down computer barriers in processing worldwide languages ISO Bulletin, No. 6 (June 1993)

Annex: Overview of the BMP (group=00, plane=00)

_______ ___________________________________________________________________

Row(s)  Content (script, other groups of characters, reserved area)
_______ ___________________________________________________________________

======= A-ZONE (alphabetical characters and symbols) =======================
00      (Control characters,) Basic Latin, Latin-1 Supplement (=ISO/IEC 8859-1)
01      Latin Extended-A, Latin Extended-B
02      Latin Extended-B, IPA Extensions, Spacing Modifier Letters
03      Combining Diacritical Marks, Basic Greek, Greek Symbols and Coptic
04      Cyrillic
05      Armenian, Hebrew
06      Basic Arabic, Arabic Extended
07--08  (Reserved for future standardization)
09      Devanagari, Bengali
0A      Gumukhi, Gujarati
0B      Oriya, Tamil
0C      Telugu, Kannada
0D      Malayalam
0E      Thai, Lao
0F      (Reserved for future standardization)
10      Georgian
11      Hangul Jamo
12--1D  (Reserved for future standardization)
1E      Latin Extended Additional
1F      Greek Extended
20      General Punctuation, Super/subscripts, Currency, Combining Symbols
21      Letterlike Symbols, Number Forms, Arrows
22      Mathematical Operators
23      Miscellaneous Technical Symbols
24      Control Pictures, OCR, Enclosed Alphanumerics
25      Box Drawing, Block Elements, Geometric Shapes
26      Miscellaneous Symbols
27      Dingbats
28--2F  (Reserved for future standardization)
30      CJK Symbols and Punctuation, Hiragana, Katakana
31      Bopomofo, Hangul Compatibility Jamo, CJK Miscellaneous
32      Enclosed CJK Letters and Months
33      CJK Compatibility
34--4D  Hangul

======= I-ZONE (ideographic characters) ===================================
4E--9F  CJK Unified Ideographs

======= O-ZONE (open zone) ================================================
A0--DF  (Reserved for future standardization)

======= R-ZONE (restricted use zone) ======================================
E0--F8  (Private Use Area)
F9--FA  CJK Compatibility Ideographs
FB      Alphabetic Presentation Forms, Arabic Presentation Forms-A
FC--FD  Arabic Presentation Forms-A
FE      Combining Half Marks, CJK Compatibility Forms, Small Forms, Arabic-B
FF      Halfwidth and Fullwidth Forms, Specials

Up to the KTH/NADA collection of information resources about character sets and the Internet IAB-charsets page.

Author: Olle J鋜nefors < ojarnef@admin.kth.se>
Maintainer: Peter Svanberg < psv@nada.kth.se> Organization: Royal Institute of Technology (KTH), Stockholm, Sweden
Version: Ar1
Document type: overview
Newest version at: ftp://ftp.admin.kth.se/pub/misc/ucs/unicode-iso10646-oview.txta
URL: http://www.nada.kth.se/i18n/unicode-iso10646-oview.html
This version updated: 1996-02-26

wlnpu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
A short overview of ISO/IEC 10646 and Unicode

A short overview ofISO/IEC 10646 and UnicodeBy Olle J鋜nefors ojarnef@admin.kth.se>SummaryThe purpose of this text is to give a brief technicaloverview of the new character set standa
复制链接

扫一扫