字符集合概述

最新推荐文章于 2022-04-21 19:02:09 发布

nathena

最新推荐文章于 2022-04-21 19:02:09 发布

阅读量819

点赞数

文章标签： character byte encoding java stream output

字符集合概述

ASCII ,美国信息交换用标准代码,是一个7位的字符集. 因此,界定了7位二进制表示的不同的数值,其数值范围为127 . 这些字足以处理大部分美式英语,并能够合适的处理大多数欧洲语言(例外,俄罗斯和希腊) . 这是不同计算机一般都使用的格式（It's an often used lowest common denominator format for different computers.）
如果你从流中读出127内的字节值，可以在ASCII中找到对应的字符

ASCII, the American Standard Code for Information Interchange, is a seven-bit character set.
Thus it defines 27 or 128 different characters whose numeric values range from to 127. These
characters are sufficient for handling most of American English and can make reasonable
approximations to most European languages (with the notable exceptions of Russian and
Greek). It's an often used lowest common denominator format for different computers. If you
were to read a byte value between and 127 from a stream, then cast it to a char, the result
would be the corresponding ASCII character.

ASCII 字符 0-31跟127为非打印字符，字符 32-47 为不同的标点跟空格符，48-57 为阿拉伯数字0-9，字符 58-64
是另一组的标点字符（哪一组啊？？）字符65-90 是大写字母的A-Z ，字符91-96 是少数（不常用？）的标点，标记，
字符97-122 是小写的a-z ，最后，字符123-126 是一些货币符号，完整的ASCII字符在附录B的表B，1中列出
ASCII characters 0-31 and character 127 are nonprinting control characters. Characters 32-47
are various punctuation and space characters. Characters 48-57 are the digits 0-9. Characters
58-64 are another group of punctuation characters. Characters 65-90 are the capital letters AZ.
Characters 91-96 are a few more punctuation marks. Characters 97-122 are the lowercase
letters a-z. Finally, characters 123 through 126 are a few remaining punctuation symbols. The
complete ASCII character set is shown in Table B.1 in Appendix B.

所有Java程序都可以表示成纯粹的ASCII . 非ASCII字符的统一标准字符编码就是Unicode编码（好奇怪的一句话）; 即写成一个反斜线( / ) 其次是u,其次是四个十六进制数字; 例如, /u00a9 . 进一步讨论在第1.3.3节,稍后在本章规定.

All Java programs can be expressed in pure ASCII. Non-ASCII Unicode characters are
encoded as Unicode escapes; that is, written as a backslash ( /), followed by a u, followed by
four hexadecimal digits; for example, /u00A9. This is discussed further under the Section
1.3.3 section, later in this chapter.

ISO Latin-1

ISO Latin-1(国际标准化组织8859-5-1) 是一个严格8位字符ASCII超集，它定义了8位2进制的256个不同字符,其数值范围为255 . 首128字,即这些数字与高阶位等于零-完全相符的ASCII字符集. 因此, 65是B的ASCII码和ISO Latin-1的吗; 66 ,是B的ASCII和ISO SO Latin-1的吗等等，ISO Latin-1与ASCII的分叉点是从 128-156（高位是一），ASCII没有定义这些字符，ISO Latin-1
用他们定义了在罗马语法中出现而不在英文语法中出现的各种各样的重音符字母像ü，加上一些符号标记，以及控制符号等
ISO Latin-1 中的一半非ASCII字符列在附录表B。2 中

ISO Latin-1 is an eight-bit character set that's a strict superset of ASCII. It defines 28 or 256
different characters whose numeric values range from to 255. The first 128 characters—that
is, those numbers with the high-order bit equal to zero—correspond exactly to the ASCII
character set. Thus 65 is ASCII A and ISO Latin-1 A; 66 is ASCII B and ISO Latin-1 B; and
so on.Where ISO Latin-1 and ASCII diverge is in the characters between 128 and 255
(characters with high bit equal to one). ASCII does not define these characters. ISO Latin-1
uses them for various accented letters like ü needed for non-English languages written in a
Roman script, additional punctuation marks and symbols like （贴不上来就是这个符号©）., and additional control
characters. The upper, non-ASCII half of the ISO Latin-1 character set is shown in Table B.2.

It's a popular lowest common denominator format for different computers.）这句好难译
Latin-1提供了写希欧语言的足够字符（再次注意下希腊语除外），如果你从流中读出字节值，可以在Latin-1中找到对应的字符

Latin-1 provides enough characters to write most Western European languages (again with
the notable exception of Greek). It's a popular lowest common denominator format for
different computers. If you were to read an unsigned byte value from a stream, then cast it to
a char, the result would be the corresponding ISO Latin-1 character

Unicode

ISO Latin-1 能足够的表示大部分西欧语言，但它没有任何可以近似的表示西里尔文,希腊文,阿拉伯文,希伯莱文,波斯文,拉丁文，
没有涉及到象形文字想中国或者日本（为什么是小日本！），中文中拥有80，000多个不同的字符，并使用其它更多的字符来操作他们的语法.
于是Unicode字符集就被发明了，Unicode是双字节，16位的字符集合包括65，536 个不同的可能使用的字符，16位2进制制表示，（大约有40，000
个产用字符，并支持对以后的扩张）Unicode能很好的操作世界上大部分生活用户按数制字符

ISO Latin-1 suffices for most Western European languages, but it doesn't have anywhere near
the number of characters required to represent Cyrillic, Greek, Arabic, Hebrew, Persian, or
Devanagari, not to mention pictographic languages like Chinese and Japanese. Chinese alone
has over 80,000 different characters. To handle these scripts and many others, the Unicode
character set was invented. Unicode is a 2-byte, 16-bit character set with 216 or 65,536
different possible characters. (Only about 40,000 are used in practice, the rest being reserved
for future expansion.) Unicode can handle most of the world's living languages and a number
of dead ones as well.

Unicode 的前256个字字符高位为0的跟Latin-1字符集一样，因此，A ，B等的ASCII 和 Unicode 以及 Latin-1 是一样
The first 256 characters of Unicode—that is, the characters whose high-order byte is zero—
are identical to the characters of the ISO Latin-1 character set. Thus 65 is ASCII A and
Unicode A; 66 is ASCII B and Unicode B and so on.

java 字节流不能很好的以Unicode文档工作（原因为当java1.1.1加入字符流的读写时，一次读一个字节，但每个Unicode字符占用两个字节，因此，
读Unicode字符，你要增加操作，第一字节度256加上第二个字节读的，然后计算出字符的结果）
比如：
int b1 = in.read();
int b2 = in.read();
char c = (char) (b1*256 + b2);

Java streams do not do a good job of reading Unicode text. (This is why readers and writers
were added in Java 1.1.) Streams generally read a byte at a time, but each Unicode character
occupies two bytes. Thus, to read a Unicode character, you multiply the first byte read by 256,
add it to the second byte read, and cast the result to a char. For example:
int b1 = in.read();
int b2 = in.read();
char c = (char) (b1*256 + b2);
You must be careful to ensure that you don't inadvertently read the last byte of one character
and the first byte of the next, instead. Thus, for the most part, when reading text encoded in
Unicode or any other format, you should use a reader rather than an input stream. Readers
handle the conversion of bytes in one character set to Java chars without any extra effort. For
similar reasons, you should use a writer rather than an output stream to write text.

UTF-8
Unicode is a relatively inefficient encoding when most of your text consists of ASCII
characters. Every character requires the same number of bytes—two—even though some
characters are used much more frequently than others. A more efficient encoding would use
fewer bits for the more common characters. This is what UTF-8 does.
In UTF-8 the ASCII alphabet is encoded using a single byte, just as in ASCII. The next 1,919
characters are encoded in two bytes. The remaining Unicode characters are encoded in three
bytes. However, since these three-byte characters are relatively uncommon,[1] especially in
English text, the savings achieved by encoding ASCII in a single byte more than makes up for
it.
Java's .class files use UTF-8 internally to store string literals. Data input streams and data
output streams also read and write strings in UTF-8. However, this is all hidden from direct
view of the programmer, unless perhaps you're trying to write a Java compiler or parse output
of a data stream without using the DataInputStream class.

Other encodings
ASCII, ISO Latin-1, and Unicode are hardly the only character sets in common use, though
they are the ones handled most directly by Java. There are many other character sets, both that
encode different scripts and that encode the same scripts in different ways. For example, IBM
mainframes have long used a non-ASCII eight-bit character set called EBCDIC. EBCDIC has
most of the same characters as ASCII but assigns them to different numbers. Macintoshes
commonly use an eight-bit encoding called MacRoman that matches ASCII in the lower 128
places and has most of the same characters as ISO Latin-1 in the upper 128 characters but in
different positions. Big-5 and SJIS are encodings of Chinese and Japanese, respectively, that
are designed to allow these large scripts to be input from a standard English keyboard.
Java's Reader, Writer, and String classes understand how to convert these character sets to
and from Unicode.

nathena

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
字符集合概述

字符集合概述ASCII ,美国信息交换用标准代码,是一个7位的字符集. 因此,界定了7位二进制表示的不同的数值,其数值范围为127 . 这些字足以处理大部分美式英语,并能够合适的处理大多数欧洲语言(例外,俄罗斯和希腊) . 这是不同计算机一般都使用的格式（Its an often used lowest common denominator format for different comput
复制链接

扫一扫