C Reference Manual Reading Notes: 003 Multibyte and Wide Characters

    To accommodate non-english alphabets that may contain a large number of characters, Standard C introduces wide characters and wide strings.  To present wide characters and wide strings in the external, byte-oriented world, the concept of multibyte characters is introduced.

 

    Wide Characters And Strings.  A wide character is a binary representation of an element of an extended character set. It has the integer type wchar_t which is declared in header file stddef.h. Standard C does not specify the encodingof the extended character set other than "null wide character"(zero, 0) and the existence of WEOF(-1).

 

    Multibyte Character is the representation of a wide character in either the source or execution character set.(There may be different encoding for each). A multibyte stirng is a normal C string, but whose characters can be interpreted as a series of multibyte characters. The form of multibyte characters and the mapping between multibyte and wide characters is implementation-defined. This mapping is performed for wide-character and wide string constants at compile time, and the standard library provides function that perform this mapping at run time. Multibyte characters encoding can be state dependent or independent.

 

   Standard C places some restrictions on multibyte characters:

     (1).  All characters from the standard character set must be present in the encoding.

     (2). In the initial shift state, all single-byte characters from the standard character set retain their normal interpretation and do not affect the shift state.

     (3). A byte containing all zeros is taken to be the null character regardless of shift state. No multibyte character can use a byte containing all zeros as its second or subsequent character.

   Together, these rules ensure that multibyte sequences can be processed as normal C strings(e.g. they will not contain embedded null characters ) and a C string without special multibyte codes will have the expected interpretation as a multibyte sequence.

 

    Source and execution use of multibyte characters. Multibyte character may appear in comments, idenrifiers, preprocessor header names, string constants, and character constants. Multibyte characters in the physical representation of the source are recognized and translated to the source character set before any lexical analysis, preprocessing, or even splicing of continuation lines. During process, character appearing in string and character constants are translated to the execution character set before they are interpreted as multibyte sequences.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值