Character sets and codepages

Archived on Fri Jan 21 12:19:02 2005

Abstract

The goal of this document is to:

Define terminology relating to character sets.
Explain how characters are mapped to glyphs.
Describe the Windows 95 WGL4 character set.
List standard codepages for Windows 95.
Explain the codepage/Unicode range encoding within a text font.

Additional references

TrueType 1.0 Font File Specification, v.1.65, Microsoft
See for more information about character sets, WGL4.0 list of characters, Macintosh compatibility, and language encoding within a font.

The Unicode Standard, Version 1.0, Addison-Wesley, 1991
See for more information about Unicode script ranges and the characters they cover.

Developing International Software for Windows 95 and Windows NT: A Handbook for International Software Design, Microsoft
See for more information about various writing systems, input methods, Far East character mappings, NT specific issues, and programming with Unicode.


Characters, glyphs and fonts

We often speak inaccurately of character sets: we may refer to a "Greek character set" or a "Latin character set". But in order to understand how different writing systems are supported by Windows, we need to be more precise about characters.

Users don't view or print characters: a user views or prints glyphs. A glyph is a representation of a character. The character "Capital Letter A" is represented by the glyph "A" in Times New Roman Bold, and "A" in Arial Bold. A font is a collection of glyphs. Windows is able to retrieve the appropriate glyphs by using mapping information about the keyboard, the language system in use, and the glyphs associated with each character.

Fonts are designed with character sets in mind: a font for use in Russia will include glyphs representing Cyrillic characters. There is no magic word or blessing uttered to create a "character set." A character set is only a collection of characters. However, characters from different language systems are conventionally divided into different "character sets", primarily because, in the past, a limited number of characters could be "addressed" at any one time.


Preparing for TrueType Open: glyph substitution

Glyphs can also represent combinations of characters and alternate forms of characters: there is not a strict one-to-one correspondence between glyphs and characters. For example, two characters may be typed in a document, but represented by only a single glyph (a ligature glyph). Conversely, different versions of a character may appear at the beginning, middle, or ending of a word. Thus, a single character can be represented by several different glyphs in a font.

TrueType Open will provide a substitution table to handle one-to-one, one-to-many, and many-to-one mappings.


Character codes

Characters are represented by character codes. Character codes are generated and stored when a user inputs a document. Single-Byte character sets (SBCS) provide 256 character codes (2). This is an adequate number to encode most of the characters needed for Western Europe. For example, the Windows Extended ANSI character set contains 256 characters consisting of Latin letters, Arabic numerals, punctuation, and drawing characters.

However, 256 character codes are not enough to represent all the characters needed by multi-lingual users in a single font, or by users in the Far East, where over 12,000 characters may need to be addressed at any one time. Consequently, Multi-Byte character sets (commonly known as Double-Byte character sets) are necessary. Double-Byte character sets (DBCS) are a mixture of Single-Byte and Double-Byte character encodings and provide over 65,000 character codes (2 to the 16th power).


Unicode

Unicode is a 16-bit encoding that encompasses many characters used in general text interchange throughout the world. Each Unicode index refers unambiguously to a given character. Unicode allows a larger range of characters to be addressed than is possible using a Single-Byte character encoding. All Unicode values are Double-Byte, which simplifies the way a Unicode-based system reads a string of text. In comparison, a Double-Byte system must determine which values in a string are Single-Byte character codes and which are Double-Byte character codes.

NT internally uses Unicode for character encoding. Under NT, applications can still support existing Single-Byte codepages (discussed below) using the NLS APIs. DBCS-to-Unicode mappings are handled via the MultiByteToWideChar and WideCharToMultiByte API's.

Windows 95 does not use Unicode internally for character encoding. However Windows 95 is able to handle Multi-Byte character sets, and is able to map to Unicode using International API's (such as MultiByteToWideChar mentioned above).


Codepages

The meaning of the term "codepage" has evolved over time. Only one definition concerns us now: In Windows 95 and NT, a codepage is a list of selected character codes in a certain order. Codepages are usually defined to support specific languages or groups of languages which share common writing systems. For example, codepage 1253 provides character codes required in the Greek writing system.

The order of the character codes in a codepage allows the system to provide the appropriate character code to an application when a user presses a key on the keyboard. When a new codepage is loaded, different character codes are provided to the application.

In Windows 95, codepages can be changed on-the-fly by the user, without changing the default language system in use. An application can determine which codepages a specific font supports and can then present language options to the user.


Preparing for TrueType Open: saving writing system information within a text stream

When a user changes codepages, character codes from the new codepage are stored in the text stream. However, most codepages support multiple writing systems, each of which may have special rules about substituting or placing glyphs. TrueType Open will allow the flexibility for multiple writing systems to be supported by a single character set. Glyph substitution and placement rules can be associated with a writing system and stored in the font. Applications requiring these advanced features will need to save in the document an indication of the writing system in use, as well as the character codes entered.


The WGL4 character set

Traditionally, a font has been designed to contain all the glyphs required by a single codepage. However, Microsoft has now defined a character set standard which includes characters required by Western, Central, and Eastern European writing systems, as well as characters required by Greek and Turkish. This "PanEuropean" character set contains 652 characters and is called WGL4: Windows Glyph List 4. WGL4 takes advantage of the ability of Windows 95 to address characters according to their Unicode Double-Byte character codes using API extensions.

Note: WGL4 fonts are not required under Windows 95. Windows 95 will continue to support fonts which worked under Windows 3.1.

The WGL4 character set covers several codepages: 1250, 1251, 1252, 1253, and 1254. A user can load a single WGL4 font, and change codepages as needed. Previously, a user desiring to switch from English to Cyrillic to Greek while typing would have to choose three different fonts: first typing in Times New Roman, then in Times New Roman Cyrillic, and then in Times New Roman Greek.

Microsoft is supporting font developers as they create new WGL4 fonts. Windows 95 will also enable font developers to create fonts for large character sets other than WGL4, and users will be able to access all the glyphs as long as the associated characters exist in codepages supported by Windows 95.

The WGL4 character set is listed in Chapter 4 of the TrueType 1.0 Font File Specification (available on MSDN). The character set is compared to Win 3.1 ANSI, UGL, and Macintosh character sets.


Identifying writing system information within a font

As mentioned earlier, Windows can not determine an intended writing system or language based solely on the glyphs contained in a font. Before giving the user or application writing system options, Windows must know which writing systems a font covers.

Fortunately, fonts contain a great deal of information about their glyphs: in well-designed fonts you'll find hinting instructions, metrics, language information, attachment points for diacritical marks, underline and strikethrough information, and more. Fonts are comprised of many data structures, commonly referred to as tables, each containing specific information.

Language information about a font is stored in the "OS/2" table of the font. This table contains a variety of information about typeface weight, superscripts, strikeouts, ascender/descender values, PANOSE classification, licensing info, and more. For more information about the structure of TrueType Font Files, see the TrueType 1.0 Font File Specification (available on MSDN).

Writing systems covered by the glyphs in a font can be specified according to the Unicode script ranges covered by the font, or the codepages covered by the font. A font manufacturer sets script ranges and/or codepages by setting the appropriate bits of the ulCodePageRange fields or the ulUnicodeRange fields in the OS/2 table of the font. Multiple ranges can be specified for a single font. This encoding can not be changed by the user.


http://www.microsoft.com/typography/unicode/cscp.htm

当使用LaTeX编译CJK(中日韩)字符时,可能会出现"package cjk error: invalid character code. \xdufrontmatter"的错误。这个错误通常是由于在CJK代码中出现了无效的字符代码所导致的。 在解决这个错误之前,我们首先需要了解几个相关的概念。CJK包是一个用于处理中日韩字符的LaTeX宏包,它提供了一种方便的方法来输入和处理这些字符。而"xdufrontmatter"是可能由XDUThesis等模板或包提供的一个命令或选项,用于设置论文的封面和前置部分。 要解决"package cjk error: invalid character code. \xdufrontmatter"错误,我们可以采取以下几种方法: 1. 检查字符代码:请确保在CJK代码中没有使用无效的字符代码。常见的错误包括使用unicode范围之外的代码或未定义的代码。检查并修复这些错误将有助于消除无效字符代码的问题。 2. 检查CJK包的版本:确保你正在使用最新版本的CJK包。旧版本的CJK包可能会存在一些已知的错误和限制,升级到最新版本可能会解决一些问题。 3. 检查模板或包的文档:如果你使用的是特定的模板或包,并在使用xdufrontmatter命令或选项时遇到问题,请仔细查阅相关文档。有时,模板或包的作者会提供特定的使用说明或解决方案,以帮助解决错误。 4. 避免使用无效的CJK代码:如果发现某些CJK代码一直导致错误,可能是因为这些代码被认为是无效的或不受支持的。尝试使用其他有效的代码或考虑使用不同的方法来处理中日韩字符。 总之,"package cjk error: invalid character code. \xdufrontmatter"错误是由于在CJK代码中出现无效字符代码引起的。通过检查字符代码、升级CJK包、查阅文档和避免无效的CJK代码,我们可以解决这个错误并顺利编译LaTeX文档。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值