4. Character encoding basics

Text in a computer or on the Web is composed of characters. Characters represent letters of the alphabet, punctuation, or other symbols.Characters are grouped into a character set .This is then called a coded character set when each character is assigned a particular number, called a code point. These code points will be represented in the computer by one or more bytes. A character encoding is a key to unlock (ie. crack) the code. It is a set of mappings between the bytes representing numbers in the computer and characters in the coded character set. Without the key, the data looks like garbage.

To start with you have to understand a little bit about computers. Information on a computer is stored and transmitted in what are called bits. Characters in a character set are stored as one or more bytes in a computer Certain bits or combinations of bits equate to certain characters. There are many different character encodings. If the wrong encoding is applied to the bytes in memory, the result will be unintelligible text. It is therefore important, if people are to read your content, that you correctly label the character encoding used.

 At first developers thought that 256 letters must be enough. Computers were for English speaking people only and few special characters where allowed. There were good reasons for this: memory was expensive and a fixed size for characters made the programming easier. This first way of storing things were called ASCII.

Time went by and people discovered that they needed letters that were not among the original 256.So what they did was replace a few special characters with the ones they needed.Different organizations have assembled different sets of characters and created encodings for them.But different people needed different characters and soon we had hundreds of sets to select from. This mess is what you see if you select View -> (Character) encodings -> More (encodings) in your browser. In addition, it is usually impossible to combine different encodings on the same Web page or in a database, so it is usually very difficult to support multilingual pages using ‘legacy’ approaches to encoding.

For example, in the coded character set called ISO 8859-1 (also known as Latin1) the decimal code point value for the letter é is 233. In ISO 8859-5, the same code point represents the Cyrillic character щ. These character sets contain fewer than 256 characters and map code points to byte values directly. So a code point with the value 233 is represented by a single byte with a value of 233. Note however that that byte may represent either é or щ, depending on the context.

There are other ways of handling characters from a range of scripts. For example, with the Unicode character set, you can represent both characters in the same set. In fact, Unicode contains, in a single set, most characters you are likely to ever need. While the value of 233 still represents the é, the Cyrillic character щ now has a code point value of 1097. This is too large a number to be represented by a single byte*. If you use the character encoding for Unicode text called UTF-8, щ will be represented by two bytes, but the code point value is not simply derived from the value of the two bytes spliced together – some more complicated decoding is needed. Other Unicode characters map to one, three or four bytes in the UTF-8 encoding.

UTF-8 is the most widely used way to represent Unicode text in web pages. But UTF-8 is only one of the possible ways of encoding Unicode characters. In other words, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. Unicode code points can be mapped to bytes using any one of the encodings called UTF-8, UTF-16 or UTF-32. The Devanagari character , with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15).

There can be further complications beyond those described in the panel above (such as byte order and escape sequences), but the detail described there shows why it is important that the application you are working with knows which character encoding is appropriate for your data, and knows how to handle that encoding.

 The Unicode Consortium provides a large, single character set that aims to include all the characters needed for any writing system in the world. It is now fundamental to the architecture of the Web and operating systems, and is supported by all major web browsers and applications.

A font is a collection of glyph definitions, ie. definitions of shapes used to display characters.

Once your application has worked out what characters it is dealing with, it will then look in the font for glyphs in order to display or print those characters. (Of course, if the encoding information was wrong, it will be looking up glyphs for the wrong characters.)

A given font will usually cover a single character set, or in the case of a large character set like Unicode, just a subset of all the characters in the set. If your font doesn't have a glyph for a particular character, some applications will look for the missing character in other fonts on your system (which will mean that the glyph will look different from the surrounding text, like a ransom note). Otherwise you will typically see a square box, a question mark or some other character instead. For example:

mojibake3.gif

 It is important to clearly distinguish between the concepts of a character set versus a character encoding.

...The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms...

Reference:  Section 5.2 Character encodings of the HTML Document Representation W3C Recommendations

Character encoding tells the browser and validator what set of characters to use when converting the bits to characters.

Choosing an encoding

Everyone developing content, whether content authors or programmers, must decide what character encoding to use. UTF-8 is a popular recommendation these days, but there may still be things you should consider before using it. Content developers and webmasters may also need to ensure that the server delivers content with the correct character encoding declarations, since server settings can override in-document declarations.

 

 

Reference: http://www.w3.org/International/articles/definitions-characters/          http://www.w3.org/International/questions/qa-what-is-encoding

转载于:https://my.oschina.net/u/556267/blog/71488

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值